Im trying to write an sql statement which summarizes the count of rows by values in a specific column.
I have several tables, starting with s0_ Every table has a huge amount of rows with multiple, different values in column load_nr.
Example: s0_table1:
load_nr | column2 | column3... |
---|---|---|
101 | someData | someData... |
101 | someData | someData... |
102 | someData | someData... |
103 | someData | someData... |
103 | someData | someData... |
103 | someData | someData... |
S0_table2:
load_nr | column2 | column3... |
---|---|---|
101 | someData | someData... |
102 | someData | someData... |
102 | someData | someData... |
102 | someData | someData... |
102 | someData | someData... |
103 | someData | someData... |
Desired result:
load_nr | count s0_table1 | count s0_table2 |
---|---|---|
101 | 2 | 1 |
102 | 1 | 4 |
103 | 3 | 1 |
New data with up counting load_nr is added every day. Best case is, that the table names are fetched dynamically (s0_*).
Could anyone give me a cloue how to build this statement? I'm a newbe and not so experienced with more complicated statements.
Best Regards
*edit: Hopefully the tables are displayed correctly xD
CodePudding user response:
For that you need a FULL OUTER JOIN in case the load_nr don't match up
SELECT COALESCE(s1.load_nr,s2.load_nr) as load_nr, Count_S1, Count_S2
FROM
(SELECT COUNT(*) Count_S1,load_nr FROM S0_table1 GROUP BY load_nr) s1 FULL OUTER JOIN
(SELECT COUNT(*) Count_S2,load_nr FROM S0_table2 GROUP BY load_nr) s2 ON s1.load_nr = s2.load_nr
ORDER BY load_nr
CodePudding user response:
Dynamic table names are going to be a problem. You basically have no choice but to write a SQL statement dynamically. There's a good chance that summing a UNION-ed subquery is going to be the fastest way to go, but if you can also try this:
SELECT
COALESCE(
Table1Count.[load_nr],
Table2Count.[load_nr],
Table3Count.[load_nr]
) [load_nr],
COALESCE(Table1Count.[rows],0) [s0_table1],
COALESCE(Table2Count.[rows],0) [s0_table2],
COALESCE(Table3Count.[rows],0) [s0_table3]
FROM ( SELECT
[load_nr],
COUNT(1) [rows]
FROM s0_table1
GROUP BY [load_nr]
) Table1Count
FULL OUTER JOIN ( SELECT
[load_nr],
COUNT(1) [rows]
FROM s0_table2
GROUP BY [load_nr]
) Table2Count
ON Table2Count.load_nr = Table1Count.load_nr
FULL OUTER JOIN ( SELECT
[load_nr],
COUNT(1) [rows]
FROM s0_table3
GROUP BY [load_nr]
) Table3Count
ON COALESCE(
Table1Count.[load_nr],
Table2Count.[load_nr]
) = Table3Count.load_nr
Of course you'd need to build the query dynamically and the nested coalesces for each subsequent join could get a bit tricky.
Another option if you have the load numbers you need in another table you can make it cleaner and easier to construct dynamically:
SELECT
loads.[load_nr],
(SELECT COUNT(1) FROM s0_table1 WHERE load_nr = loads.[load_nr]) [s0_table1Count],
(SELECT COUNT(1) FROM s0_table2 WHERE load_nr = loads.[load_nr]) [s0_table2Count],
(SELECT COUNT(1) FROM s0_table3 WHERE load_nr = loads.[load_nr]) [s0_table3Count],
(SELECT COUNT(1) FROM s0_table4 WHERE load_nr = loads.[load_nr]) [s0_table4Count]
FROM loads
And if you need to remove the loads with no rows in any table you can just wrap it in a subquery and add a where clause:
SELECT
SubQuery.*
FROM ( SELECT
loads.[load_nr],
(SELECT COUNT(1) FROM s0_table1 WHERE load_nr = loads.[load_nr]) [s0_table1Count],
(SELECT COUNT(1) FROM s0_table2 WHERE load_nr = loads.[load_nr]) [s0_table2Count],
(SELECT COUNT(1) FROM s0_table3 WHERE load_nr = loads.[load_nr]) [s0_table3Count],
(SELECT COUNT(1) FROM s0_table4 WHERE load_nr = loads.[load_nr]) [s0_table4Count]
FROM loads
) SubQuery
WHERE ( SubQuery.[s0_table1Count] <> 0
OR SubQuery.[s0_table2Count] <> 0
OR SubQuery.[s0_table3Count] <> 0
OR SubQuery.[s0_table4Count] <> 0
)
This could be pretty slow though depending on the shape of your data.
CodePudding user response:
I'm not sure if any of the following will work for you. This is something that I haven't ever used in a real project, but maybe it will provide you with some inspiration. The final solution will get 99% to having the best case that you wanted (more on this at the end).
Here's the basic idea. psql
provides a command called \crosstabview
that is
able to display pivot tables, which is what you want. So, for your sample data,
we could do this (notice that instead of terminating the query with a semicolon
;
, we're terminating it with \crosstabview
):
select load_nr, 's0_table1', count(*) from s0_table1 group by load_nr
union
select load_nr, 's0_table2', count(*) from s0_table2 group by load_nr
order by load_nr asc
\crosstabview
┌─────────┬───────────┬───────────┐
│ load_nr │ s0_table1 │ s0_table2 │
├─────────┼───────────┼───────────┤
│ 101 │ 2 │ 1 │
│ 102 │ 1 │ 4 │
│ 103 │ 3 │ 1 │
└─────────┴───────────┴───────────┘
Try the same query with a semicolon terminator to see what the query has returned
before psql
post-processed the result:
select load_nr, 's0_table1', count(*) from s0_table1 group by load_nr
union
select load_nr, 's0_table2', count(*) from s0_table2 group by load_nr
order by load_nr asc
┌─────────┬───────────┬───────┐
│ load_nr │ ?column? │ count │
├─────────┼───────────┼───────┤
│ 101 │ s0_table1 │ 2 │
│ 101 │ s0_table2 │ 1 │
│ 102 │ s0_table1 │ 1 │
│ 102 │ s0_table2 │ 4 │
│ 103 │ s0_table2 │ 1 │
│ 103 │ s0_table1 │ 3 │
└─────────┴───────────┴───────┘
Ok. Now this was a psql
thing, but it PostgreSQL provides a built-in
extension called tablefunc
which exposes a function called crosstab
that
does precisely what the psql \crosstabview
command does. Furthermore, this
function accepts as argument a string containing the SQL text of the query
you need, which means we could generate it.
For the moment we can just use the hardcoded query (I'm using $$
-delimited
strings below so that I don't have to escape single quotes):
create extension if not exists tablefunc;
select * from crosstab(
-- The 1st argument is the query producing the values on the two dimentions:
-- `load_nr` and table name.
$$
select load_nr, 's0_table1', count(*) from s0_table1 group by load_nr
union
select load_nr, 's0_table2', count(*) from s0_table2 group by load_nr
order by load_nr asc
$$,
-- The 2nd argument must return the headers names, which in our case are
-- the table names
$$
select * from (values ('s0_table1'), ('s0_table2')) as headers
$$
) as (
"load_nr" int,
"count s0_table1" int,
"count s0_table2" int
);
The above will return this, which reproduces perfectly the \crosstabview
result:
┌─────────┬─────────────────┬─────────────────┐
│ load_nr │ count s0_table1 │ count s0_table2 │
├─────────┼─────────────────┼─────────────────┤
│ 101 │ 2 │ 1 │
│ 102 │ 1 │ 4 │
│ 103 │ 3 │ 1 │
└─────────┴─────────────────┴─────────────────┘
Ok, now on to dynamically generate the two SQL queries that crosstab
wants.
First, we can obtain a list of the table names prefixed with s0_
:
select
relname
from
pg_catalog.pg_class
where relname like 's0_%'
order by relname asc;
┌───────────┐
│ relname │
├───────────┤
│ s0_table1 │
│ s0_table2 │
└───────────┘
Now, we can use these values to build individual SELECT
statements that we'll
later join via UNION
. First the SELECT
s:
select
'SELECT load_nr, ''' || relname || ''', count(*) FROM ' || relname || ' GROUP BY load_nr' as sql_text
from
pg_catalog.pg_class
where relname like 's0_%'
order by relname asc;
┌───────────────────────────────────────────────────────────────────────┐
│ sql_text │
├───────────────────────────────────────────────────────────────────────┤
│ SELECT load_nr, 's0_table1', count(*) FROM s0_table1 GROUP BY load_nr │
│ SELECT load_nr, 's0_table2', count(*) FROM s0_table2 GROUP BY load_nr │
└───────────────────────────────────────────────────────────────────────┘
(2 rows)
Now we can join the 2 rows into a single one:
select string_agg(sql_text, E'\nUNION\n') || E'\nORDER BY load_nr ASC' from (
select
'SELECT load_nr, ''' || relname || ''', count(*) FROM ' || relname || ' GROUP BY load_nr' as sql_text
from
pg_catalog.pg_class
where relname like 's0_%'
order by relname asc
) as queries;
┌───────────────────────────────────────────────────────────────────────┐
│ ?column? │
├───────────────────────────────────────────────────────────────────────┤
│ SELECT load_nr, 's0_table1', count(*) FROM s0_table1 GROUP BY load_nr↵│
│ UNION ↵│
│ SELECT load_nr, 's0_table2', count(*) FROM s0_table2 GROUP BY load_nr↵│
│ ORDER BY load_nr ASC │
└───────────────────────────────────────────────────────────────────────┘
(1 row)
If we take this query and manually run it using \crosstabview
, we'll get:
SELECT load_nr, 's0_table1', count(*) FROM s0_table1 GROUP BY load_nr
UNION
SELECT load_nr, 's0_table2', count(*) FROM s0_table2 GROUP BY load_nr
ORDER BY load_nr ASC \crosstabview
┌─────────┬───────────┬───────────┐
│ load_nr │ s0_table1 │ s0_table2 │
├─────────┼───────────┼───────────┤
│ 101 │ 2 │ 1 │
│ 102 │ 1 │ 4 │
│ 103 │ 3 │ 1 │
└─────────┴───────────┴───────────┘
(3 rows)
Nice, so this was precisely what we wanted. We can now use SQL-generating query
as the first argument to the function crosstab
:
with
source_sql as (
select string_agg(sql_text, E'\nUNION\n') || E'\nORDER BY load_nr ASC' from (
select
'SELECT load_nr, ''' || relname || ''', count(*) FROM ' || relname || ' GROUP BY load_nr' as sql_text
from
pg_catalog.pg_class
where relname like 's0_%'
order by relname asc
) as queries
)
select * from crosstab(
-- The 1st argument is the query producing the values on the two dimentions:
-- `load_nr` and table name.
(select * from source_sql),
-- The 2nd argument must return the headers names, which in our case are
-- the table names
$$
select * from (values ('s0_table1'), ('s0_table2')) as headers
$$
) as (
"load_nr" int,
"count s0_table1" int,
"count s0_table2" int
);
Executing it will get us:
┌─────────┬─────────────────┬─────────────────┐
│ load_nr │ count s0_table1 │ count s0_table2 │
├─────────┼─────────────────┼─────────────────┤
│ 101 │ 2 │ 1 │
│ 102 │ 1 │ 4 │
│ 103 │ 3 │ 1 │
└─────────┴─────────────────┴─────────────────┘
(3 rows)
Now, the final thing is to replace the 2nd argument to crosstab
with a proper
query — a way to fetch all table names prefixed with s0_
, which we already
know how to do — and with that the solution is complete:
with
source_sql as (
select string_agg(sql_text, E'\nUNION\n') || E'\nORDER BY load_nr ASC' from (
select
'SELECT load_nr, ''' || relname || ''', count(*) FROM ' || relname || ' GROUP BY load_nr' as sql_text
from
pg_catalog.pg_class
where relname like 's0_%'
order by relname asc
) as queries
)
-- select * from source_sql;
select * from public.crosstab(
(select * from source_sql),
$categories$
select relname from pg_catalog.pg_class
where relname like 's0_%'
order by relname asc
$categories$
) as (
load_nr int,
"count s0_table1" int,
"count s0_table2" int
);
Coming back to my observation that this will bring you just 99% to your ideal
solution. The reason for that is the fact that crosstab
requires a column
list definition, this part, where you have to manually list the table headers
(again):
) as (
load_nr int,
"count s0_table1" int,
"count s0_table2" int
);
Unfortunately, I don't know of a way to get rid of that last hurdle. And as I mentioned in the beginning, this query might perform quite badly, but maybe it is good enough for your use case.