Home > Enterprise >  How to select cumulative counts by group over time across in postgres
How to select cumulative counts by group over time across in postgres

Time:12-02

I have cumulative counts for two groups over time in this format:

Date Group Cumulative Count
1/1/2020 A 1
1/2/2020 A 3
1/2/2020 B 1
1/3/2020 B 2

And I'd like to reshape this data into this format:

Date Group Cumulative Count
1/1/2020 A 1
1/1/2020 B 0
1/2/2020 A 3
1/2/2020 B 1
1/3/2020 A 3
1/3/2020 B 2

So that I can get it to display accurately in a stacked area chart in metabase - any advice?

CodePudding user response:

You may generate all possible pairs of dates and groups with a cross join before using a left join to get the combined dataset eg

Since your dataset already has the cumulative counts, the missing values identified by null have been replaced using the most recent cumulative count with MAX and COALESCE.

SELECT
    d."Date"::text,
    d."Group",
    COALESCE(m."CumulativeCount",COALESCE(MAX(m."CumulativeCount") OVER (
        PARTITION BY d."Group"
        ORDER BY d."Date"
    ),0)) as CumulativeCount
FROM (
    SELECT "Date", "Group" FROM (
        SELECT DISTINCT
            "Date"
        FROM
            my_data
    ) t1
    CROSS JOIN (
        SELECT DISTINCT
            "Group"
        FROM
            my_data
    ) t2
) d 
LEFT JOIN my_data m ON m."Date"=d."Date" AND
                       m."Group" = d."Group"
ORDER BY 1,2;
Date Group cumulativecount
2020-01-01 A 1
2020-01-01 B 0
2020-01-02 A 3
2020-01-02 B 1
2020-01-03 A 3
2020-01-03 B 2

View working demo on DB Fiddle

Update 1

If it is that you would like to generate values for dates between missing dates eg you had the next date to be 1/7/2020 and you wanted to fill the gaps for 1/3/2020 you could use generate_series to generate the possible dates and MAX to get the most recent value. I have included a fiddle below with additional sample data eg

Schema (PostgreSQL v13)

CREATE TABLE my_data (
  "Date" DATE,
  "Group" VARCHAR(1),
  "CumulativeCount" INTEGER
);

INSERT INTO my_data
  ("Date", "Group", "CumulativeCount")
VALUES
  ('1/1/2020', 'A', '1'),
  ('1/2/2020', 'A', '3'),
  ('1/2/2020', 'B', '1'),
  ('1/3/2020', 'B', '2'),
  ('1/1/2020', 'C', '2'),
  ('1/7/2020', 'C', '3');

Query #1

SELECT
    d."Date"::text,
    d."Group",
    COALESCE(
        m."CumulativeCount",
        COALESCE(MAX(m."CumulativeCount") OVER (
            PARTITION BY d."Group"
            ORDER BY d."Date"
        ),0)
    ) as CumulativeCount
FROM (
    SELECT "Date", "Group" FROM (
        SELECT
            GENERATE_SERIES(
                MIN("Date"),
                MAX("Date"),
                INTERVAL '1' DAY
            ) as "Date"
        FROM
            my_data
    ) t1
    CROSS JOIN (
        SELECT DISTINCT
            "Group"
        FROM
            my_data
    ) t2
) d 
LEFT JOIN my_data m ON m."Date"=d."Date" AND
                       m."Group" = d."Group"
ORDER BY 1,2;
Date Group cumulativecount
2020-01-01 00:00:00 00 A 1
2020-01-01 00:00:00 00 B 0
2020-01-01 00:00:00 00 C 2
2020-01-02 00:00:00 00 A 3
2020-01-02 00:00:00 00 B 1
2020-01-02 00:00:00 00 C 2
2020-01-03 00:00:00 00 A 3
2020-01-03 00:00:00 00 B 2
2020-01-03 00:00:00 00 C 2
2020-01-04 00:00:00 00 A 3
2020-01-04 00:00:00 00 B 2
2020-01-04 00:00:00 00 C 2
2020-01-05 00:00:00 00 A 3
2020-01-05 00:00:00 00 B 2
2020-01-05 00:00:00 00 C 2
2020-01-06 00:00:00 00 A 3
2020-01-06 00:00:00 00 B 2
2020-01-06 00:00:00 00 C 2
2020-01-07 00:00:00 00 A 3
2020-01-07 00:00:00 00 B 2
2020-01-07 00:00:00 00 C 3

View working demo on DB Fiddle

CodePudding user response:

You can achieve this by following steps:

Note: I have created few tables. Feel free to use sub-query or CTE depending on your preference.

Firstly, creating possible date-group pairs:

create table ads as 
SELECT
    sq."date",
    sq."group",
    COALESCE(m."cummulativecount",0) as CummulativeCount
FROM 
(
    SELECT "date", "group" 
    FROM 
    (
        SELECT DISTINCT "date"
        FROM tbl
    ) dt
    
    CROSS JOIN 
    (
        SELECT DISTINCT "group"
        FROM tbl
    ) grp
) sq 
LEFT JOIN tbl m 
   ON m."date"=sq."date" 
   AND
   m."group" = sq."group"

(The above step was referenced from @ggordon's answer). Now, Since we imputed all cummulativeCount as 0 for additional records, we have to get recent cummulativeCount for the actual table (i.e., for A, 3 will be taken from "1/2/2020". If recent most value is unavailable, then it will not be taken.

create table prev_cnt as 
select t."group", t.cummulativecount 
from tbl as t
inner join 
(
   select tbl."group", max(tbl."date") m_date
   from ads 
   inner join tbl
      on ads."group"=tbl."group"
   where ads.cummulativecount =0 and
      tbl."date" < ads."date"
   group by tbl."group"
) as sq
   on t."group"=sq."group" 
   and t."date"=sq.m_date

Finally, joining the recent most values with derived ads table:

create table fin_ads as 
select 
   ads."date", 
   ads."group", 
   case 
     when ads.cummulativecount=0 and pc.cummulativecount IS NOT NULL then pc.cummulativecount 
     else ads.cummulativecount 
   end as cummulativecount
from ads
LEFT join prev_cnt as pc
   on ads."group"=pc."group"

The table fin_ads will be your desired output.

sample code and date here: DB<>fiddle

  • Related