Home > Net >  SQL: Summarize rows by column value, without using UNION
SQL: Summarize rows by column value, without using UNION

Time:10-20

Im trying to write an sql statement which summarizes the count of rows by values in a specific column.

I have several tables, starting with s0_ Every table has a huge amount of rows with multiple, different values in column load_nr.

Example: s0_table1:

load_nr column2 column3...
101 someData someData...
101 someData someData...
102 someData someData...
103 someData someData...
103 someData someData...
103 someData someData...

S0_table2:

load_nr column2 column3...
101 someData someData...
102 someData someData...
102 someData someData...
102 someData someData...
102 someData someData...
103 someData someData...

Desired result:

load_nr count s0_table1 count s0_table2
101 2 1
102 1 4
103 3 1

New data with up counting load_nr is added every day. Best case is, that the table names are fetched dynamically (s0_*).

Could anyone give me a cloue how to build this statement? I'm a newbe and not so experienced with more complicated statements.

Best Regards

*edit: Hopefully the tables are displayed correctly xD

CodePudding user response:

For that you need a FULL OUTER JOIN in case the load_nr don't match up

SELECT COALESCE(s1.load_nr,s2.load_nr) as load_nr, Count_S1, Count_S2
FROM
(SELECT COUNT(*) Count_S1,load_nr FROM S0_table1 GROUP BY load_nr) s1 FULL OUTER JOIN 
(SELECT COUNT(*) Count_S2,load_nr FROM S0_table2 GROUP BY load_nr) s2  ON s1.load_nr = s2.load_nr

ORDER BY load_nr

CodePudding user response:

Dynamic table names are going to be a problem. You basically have no choice but to write a SQL statement dynamically. There's a good chance that summing a UNION-ed subquery is going to be the fastest way to go, but if you can also try this:

SELECT 
    COALESCE(
        Table1Count.[load_nr],
        Table2Count.[load_nr],
        Table3Count.[load_nr]       
    ) [load_nr],
    COALESCE(Table1Count.[rows],0) [s0_table1],
    COALESCE(Table2Count.[rows],0) [s0_table2],
    COALESCE(Table3Count.[rows],0) [s0_table3]
FROM            (   SELECT  
                        [load_nr],
                        COUNT(1) [rows]
                    FROM        s0_table1
                    GROUP BY    [load_nr]
                ) Table1Count
FULL OUTER JOIN (   SELECT  
                        [load_nr],
                        COUNT(1) [rows]
                    FROM        s0_table2
                    GROUP BY    [load_nr]
                ) Table2Count
ON              Table2Count.load_nr = Table1Count.load_nr
FULL OUTER JOIN (   SELECT  
                        [load_nr],
                        COUNT(1) [rows]
                    FROM        s0_table3
                    GROUP BY    [load_nr]
                ) Table3Count
ON              COALESCE(
                    Table1Count.[load_nr],
                    Table2Count.[load_nr]       
                ) = Table3Count.load_nr

Of course you'd need to build the query dynamically and the nested coalesces for each subsequent join could get a bit tricky.

Another option if you have the load numbers you need in another table you can make it cleaner and easier to construct dynamically:

SELECT 
    loads.[load_nr],
    (SELECT COUNT(1) FROM s0_table1 WHERE load_nr = loads.[load_nr]) [s0_table1Count],
    (SELECT COUNT(1) FROM s0_table2 WHERE load_nr = loads.[load_nr]) [s0_table2Count],
    (SELECT COUNT(1) FROM s0_table3 WHERE load_nr = loads.[load_nr]) [s0_table3Count],
    (SELECT COUNT(1) FROM s0_table4 WHERE load_nr = loads.[load_nr]) [s0_table4Count]
FROM    loads

And if you need to remove the loads with no rows in any table you can just wrap it in a subquery and add a where clause:

SELECT 
    SubQuery.*
FROM    (   SELECT 
                loads.[load_nr],
                (SELECT COUNT(1) FROM s0_table1 WHERE load_nr = loads.[load_nr]) [s0_table1Count],
                (SELECT COUNT(1) FROM s0_table2 WHERE load_nr = loads.[load_nr]) [s0_table2Count],
                (SELECT COUNT(1) FROM s0_table3 WHERE load_nr = loads.[load_nr]) [s0_table3Count],
                (SELECT COUNT(1) FROM s0_table4 WHERE load_nr = loads.[load_nr]) [s0_table4Count]
            FROM    loads
        ) SubQuery
WHERE   (       SubQuery.[s0_table1Count] <> 0
            OR  SubQuery.[s0_table2Count] <> 0
            OR  SubQuery.[s0_table3Count] <> 0
            OR  SubQuery.[s0_table4Count] <> 0
        )

This could be pretty slow though depending on the shape of your data.

CodePudding user response:

I'm not sure if any of the following will work for you. This is something that I haven't ever used in a real project, but maybe it will provide you with some inspiration. The final solution will get 99% to having the best case that you wanted (more on this at the end).

Here's the basic idea. psql provides a command called \crosstabview that is able to display pivot tables, which is what you want. So, for your sample data, we could do this (notice that instead of terminating the query with a semicolon ;, we're terminating it with \crosstabview):

select load_nr, 's0_table1', count(*) from s0_table1 group by load_nr
union
select load_nr, 's0_table2', count(*) from s0_table2 group by load_nr
order by load_nr asc

\crosstabview
┌─────────┬───────────┬───────────┐
│ load_nr │ s0_table1 │ s0_table2 │
├─────────┼───────────┼───────────┤
│     101 │         2 │         1 │
│     102 │         1 │         4 │
│     103 │         3 │         1 │
└─────────┴───────────┴───────────┘

Try the same query with a semicolon terminator to see what the query has returned before psql post-processed the result:

select load_nr, 's0_table1', count(*) from s0_table1 group by load_nr
union
select load_nr, 's0_table2', count(*) from s0_table2 group by load_nr
order by load_nr asc
┌─────────┬───────────┬───────┐
│ load_nr │ ?column?  │ count │
├─────────┼───────────┼───────┤
│     101 │ s0_table1 │     2 │
│     101 │ s0_table2 │     1 │
│     102 │ s0_table1 │     1 │
│     102 │ s0_table2 │     4 │
│     103 │ s0_table2 │     1 │
│     103 │ s0_table1 │     3 │
└─────────┴───────────┴───────┘

Ok. Now this was a psql thing, but it PostgreSQL provides a built-in extension called tablefunc which exposes a function called crosstab that does precisely what the psql \crosstabview command does. Furthermore, this function accepts as argument a string containing the SQL text of the query you need, which means we could generate it.

For the moment we can just use the hardcoded query (I'm using $$-delimited strings below so that I don't have to escape single quotes):

create extension if not exists tablefunc;

select * from crosstab(
  -- The 1st argument is the query producing the values on the two dimentions:
  -- `load_nr` and table name.
  $$
  select load_nr, 's0_table1', count(*) from s0_table1 group by load_nr
  union
  select load_nr, 's0_table2', count(*) from s0_table2 group by load_nr
  order by load_nr asc
  $$,

  -- The 2nd argument must return the headers names, which in our case are
  -- the table names
  $$
  select * from (values ('s0_table1'), ('s0_table2')) as headers
  $$
) as (
  "load_nr" int,
  "count s0_table1" int,
  "count s0_table2" int
);

The above will return this, which reproduces perfectly the \crosstabview result:

┌─────────┬─────────────────┬─────────────────┐
│ load_nr │ count s0_table1 │ count s0_table2 │
├─────────┼─────────────────┼─────────────────┤
│     101 │               2 │               1 │
│     102 │               1 │               4 │
│     103 │               3 │               1 │
└─────────┴─────────────────┴─────────────────┘

Ok, now on to dynamically generate the two SQL queries that crosstab wants.

First, we can obtain a list of the table names prefixed with s0_:

select
  relname
from
  pg_catalog.pg_class
where relname like 's0_%'
order by relname asc;
┌───────────┐
│  relname  │
├───────────┤
│ s0_table1 │
│ s0_table2 │
└───────────┘

Now, we can use these values to build individual SELECT statements that we'll later join via UNION. First the SELECTs:

select
  'SELECT load_nr, ''' || relname || ''', count(*) FROM ' || relname || ' GROUP BY load_nr' as sql_text
from
  pg_catalog.pg_class
where relname like 's0_%'
order by relname asc;
┌───────────────────────────────────────────────────────────────────────┐
│                               sql_text                                │
├───────────────────────────────────────────────────────────────────────┤
│ SELECT load_nr, 's0_table1', count(*) FROM s0_table1 GROUP BY load_nr │
│ SELECT load_nr, 's0_table2', count(*) FROM s0_table2 GROUP BY load_nr │
└───────────────────────────────────────────────────────────────────────┘
(2 rows)

Now we can join the 2 rows into a single one:

select string_agg(sql_text, E'\nUNION\n') || E'\nORDER BY load_nr ASC' from (
  select
    'SELECT load_nr, ''' || relname || ''', count(*) FROM ' || relname || ' GROUP BY load_nr' as sql_text
  from
    pg_catalog.pg_class
  where relname like 's0_%'
  order by relname asc
) as queries;
┌───────────────────────────────────────────────────────────────────────┐
│                               ?column?                                │
├───────────────────────────────────────────────────────────────────────┤
│ SELECT load_nr, 's0_table1', count(*) FROM s0_table1 GROUP BY load_nr↵│
│ UNION                                                                ↵│
│ SELECT load_nr, 's0_table2', count(*) FROM s0_table2 GROUP BY load_nr↵│
│ ORDER BY load_nr ASC                                                  │
└───────────────────────────────────────────────────────────────────────┘
(1 row)

If we take this query and manually run it using \crosstabview, we'll get:

SELECT load_nr, 's0_table1', count(*) FROM s0_table1 GROUP BY load_nr
UNION
SELECT load_nr, 's0_table2', count(*) FROM s0_table2 GROUP BY load_nr
ORDER BY load_nr ASC \crosstabview
┌─────────┬───────────┬───────────┐
│ load_nr │ s0_table1 │ s0_table2 │
├─────────┼───────────┼───────────┤
│     101 │         2 │         1 │
│     102 │         1 │         4 │
│     103 │         3 │         1 │
└─────────┴───────────┴───────────┘
(3 rows)

Nice, so this was precisely what we wanted. We can now use SQL-generating query as the first argument to the function crosstab:

with
  source_sql as (
    select string_agg(sql_text, E'\nUNION\n') || E'\nORDER BY load_nr ASC' from (
      select
        'SELECT load_nr, ''' || relname || ''', count(*) FROM ' || relname || ' GROUP BY load_nr' as sql_text
      from
        pg_catalog.pg_class
      where relname like 's0_%'
      order by relname asc
    ) as queries
  )
  select * from crosstab(
    -- The 1st argument is the query producing the values on the two dimentions:
    -- `load_nr` and table name.
    (select * from source_sql),

    -- The 2nd argument must return the headers names, which in our case are
    -- the table names
    $$
    select * from (values ('s0_table1'), ('s0_table2')) as headers
    $$
  ) as (
    "load_nr" int,
    "count s0_table1" int,
    "count s0_table2" int
  );

Executing it will get us:

┌─────────┬─────────────────┬─────────────────┐
│ load_nr │ count s0_table1 │ count s0_table2 │
├─────────┼─────────────────┼─────────────────┤
│     101 │               2 │               1 │
│     102 │               1 │               4 │
│     103 │               3 │               1 │
└─────────┴─────────────────┴─────────────────┘
(3 rows)

Now, the final thing is to replace the 2nd argument to crosstab with a proper query — a way to fetch all table names prefixed with s0_, which we already know how to do — and with that the solution is complete:

with
  source_sql as (
    select string_agg(sql_text, E'\nUNION\n') || E'\nORDER BY load_nr ASC' from (
      select
        'SELECT load_nr, ''' || relname || ''', count(*) FROM ' || relname || ' GROUP BY load_nr' as sql_text
      from
        pg_catalog.pg_class
      where relname like 's0_%'
      order by relname asc
    ) as queries
  )

  -- select * from source_sql;

  select * from public.crosstab(
    (select * from source_sql),
    $categories$
      select relname from pg_catalog.pg_class
      where relname like 's0_%'
      order by relname asc
    $categories$
  ) as (
    load_nr int,
    "count s0_table1" int,
    "count s0_table2" int
  );

Coming back to my observation that this will bring you just 99% to your ideal solution. The reason for that is the fact that crosstab requires a column list definition, this part, where you have to manually list the table headers (again):

  ) as (
    load_nr int,
    "count s0_table1" int,
    "count s0_table2" int
  );

Unfortunately, I don't know of a way to get rid of that last hurdle. And as I mentioned in the beginning, this query might perform quite badly, but maybe it is good enough for your use case.

  • Related