Home > Back-end >  SQL query to count unique csv in a table
SQL query to count unique csv in a table

Time:12-14

Say I have the following SQLite v3 table, which holds some names per each country:

Germany   Peter,Jan,David,Florian
USA       James,Joe,Bob,David,Alan,George
UK        George,Jack,Peter
Israel    David,Moshe,Chaim

The names for each country are separated using commas.

I want to count how many unique names there are in total, which in this case will be 12 (since e.g. David is both in Germany, USA and Israel).

Is there a direct way to do it via a SQL query?

CodePudding user response:

Is there a direct way to do it via a SQL query?

I believe that the following will produce the count of unique names directly:-

WITH
    splt(value,rest) AS 
        (
            SELECT 
                substr(names,1,instr(names,',')-1),
                substr(names,instr(names,',') 1)||',' 
            FROM thetable
            UNION ALL SELECT 
                substr(rest,1,instr(rest,',')-1),
                substr(rest,instr(rest,',') 1) 
            FROM splt 
            WHERE length(rest) > 0 
            LIMIT 20 /* just in case limit to 20 iterations increase if more iterations exected */
    ),
    intermediate AS 
        (
            SELECT count(*),
                group_concat(value) 
            FROM splt 
            WHERE length(value) > 0 
            GROUP BY value
        )
SELECT count(*) AS unique_names FROM intermediate;

Explanation

This assumes that the country is in a column and that the names are in another column and that the column name is names in a table named thetable

The query consists of 2 CTE's (Common Table Expressions which are basically temporary tables).

The first CTE named splt is recursive it extracts each name in the list as a row.

Note that a recursive CTE MUST have some means of determining when to stop iterating a WHERE clause or a LIMIT. In the case both are used the WHERE clause is the correct check ceasing the iterations (on a per source row basis) when the length of the extracted value is greater than 0. The LIMIT 20 is a precautionary measure, of course it may be increased.

The second CTE, named intermediate then removes 0 length names and duplicates by grouping according to value using the result from the splt CTE.

Finally the number of remaining rows is counted.

Demonstration

Using the following to demonstrate:-

DROP TABLE IF EXISTS thetable;
CREATE TABLE IF NOT EXISTS thetable (country TEXT, names TEXT);
INSERT INTO thetable VALUES
    ('Germany','Peter,Jan,David,Florian'),
    ('USA','James,Joe,Bob,David,Alan,George'),
    ('UK','George,Jack,Peter'),
    ('Isreal','David,Moshe,Chaim'),
    /*<<<<< ADDED to test resillience*/
    ('Spain',''), 
    ('France',null),
    ('Italy',zeroblob(100))
;
WITH
    splt(value,rest) AS 
        (
            SELECT 
                substr(names,1,instr(names,',')-1),
                substr(names,instr(names,',') 1)||',' 
            FROM thetable
            UNION ALL SELECT 
                substr(rest,1,instr(rest,',')-1),
                substr(rest,instr(rest,',') 1) 
            FROM splt 
            WHERE length(rest) > 0 
            LIMIT 20 /* just in case limit to 20 iterations increase if more iterations exected */
    ),
    intermediate AS 
        (
            SELECT count(*),
                group_concat(value) 
            FROM splt 
            WHERE length(value) > 0 
            GROUP BY value
        )
SELECT count(*) AS unique_names FROM intermediate;
DROP TABLE IF EXISTS thetable;

Results in:-

enter image description here

  • Related