I have cumulative counts for two groups over time in this format:
Date | Group | Cumulative Count |
---|---|---|
1/1/2020 | A | 1 |
1/2/2020 | A | 3 |
1/2/2020 | B | 1 |
1/3/2020 | B | 2 |
And I'd like to reshape this data into this format:
Date | Group | Cumulative Count |
---|---|---|
1/1/2020 | A | 1 |
1/1/2020 | B | 0 |
1/2/2020 | A | 3 |
1/2/2020 | B | 1 |
1/3/2020 | A | 3 |
1/3/2020 | B | 2 |
So that I can get it to display accurately in a stacked area chart in metabase - any advice?
CodePudding user response:
You may generate all possible pairs of dates and groups with a cross join before using a left join to get the combined dataset eg
Since your dataset already has the cumulative counts, the missing values identified by null have been replaced using the most recent cumulative count with MAX
and COALESCE
.
SELECT
d."Date"::text,
d."Group",
COALESCE(m."CumulativeCount",COALESCE(MAX(m."CumulativeCount") OVER (
PARTITION BY d."Group"
ORDER BY d."Date"
),0)) as CumulativeCount
FROM (
SELECT "Date", "Group" FROM (
SELECT DISTINCT
"Date"
FROM
my_data
) t1
CROSS JOIN (
SELECT DISTINCT
"Group"
FROM
my_data
) t2
) d
LEFT JOIN my_data m ON m."Date"=d."Date" AND
m."Group" = d."Group"
ORDER BY 1,2;
Date | Group | cumulativecount |
---|---|---|
2020-01-01 | A | 1 |
2020-01-01 | B | 0 |
2020-01-02 | A | 3 |
2020-01-02 | B | 1 |
2020-01-03 | A | 3 |
2020-01-03 | B | 2 |
View working demo on DB Fiddle
Update 1
If it is that you would like to generate values for dates between missing dates eg you had the next date to be 1/7/2020
and you wanted to fill the gaps for 1/3/2020
you could use generate_series
to generate the possible dates and MAX
to get the most recent value. I have included a fiddle below with additional sample data eg
Schema (PostgreSQL v13)
CREATE TABLE my_data (
"Date" DATE,
"Group" VARCHAR(1),
"CumulativeCount" INTEGER
);
INSERT INTO my_data
("Date", "Group", "CumulativeCount")
VALUES
('1/1/2020', 'A', '1'),
('1/2/2020', 'A', '3'),
('1/2/2020', 'B', '1'),
('1/3/2020', 'B', '2'),
('1/1/2020', 'C', '2'),
('1/7/2020', 'C', '3');
Query #1
SELECT
d."Date"::text,
d."Group",
COALESCE(
m."CumulativeCount",
COALESCE(MAX(m."CumulativeCount") OVER (
PARTITION BY d."Group"
ORDER BY d."Date"
),0)
) as CumulativeCount
FROM (
SELECT "Date", "Group" FROM (
SELECT
GENERATE_SERIES(
MIN("Date"),
MAX("Date"),
INTERVAL '1' DAY
) as "Date"
FROM
my_data
) t1
CROSS JOIN (
SELECT DISTINCT
"Group"
FROM
my_data
) t2
) d
LEFT JOIN my_data m ON m."Date"=d."Date" AND
m."Group" = d."Group"
ORDER BY 1,2;
Date | Group | cumulativecount |
---|---|---|
2020-01-01 00:00:00 00 | A | 1 |
2020-01-01 00:00:00 00 | B | 0 |
2020-01-01 00:00:00 00 | C | 2 |
2020-01-02 00:00:00 00 | A | 3 |
2020-01-02 00:00:00 00 | B | 1 |
2020-01-02 00:00:00 00 | C | 2 |
2020-01-03 00:00:00 00 | A | 3 |
2020-01-03 00:00:00 00 | B | 2 |
2020-01-03 00:00:00 00 | C | 2 |
2020-01-04 00:00:00 00 | A | 3 |
2020-01-04 00:00:00 00 | B | 2 |
2020-01-04 00:00:00 00 | C | 2 |
2020-01-05 00:00:00 00 | A | 3 |
2020-01-05 00:00:00 00 | B | 2 |
2020-01-05 00:00:00 00 | C | 2 |
2020-01-06 00:00:00 00 | A | 3 |
2020-01-06 00:00:00 00 | B | 2 |
2020-01-06 00:00:00 00 | C | 2 |
2020-01-07 00:00:00 00 | A | 3 |
2020-01-07 00:00:00 00 | B | 2 |
2020-01-07 00:00:00 00 | C | 3 |
View working demo on DB Fiddle
CodePudding user response:
You can achieve this by following steps:
Note: I have created few tables. Feel free to use sub-query or CTE depending on your preference.
Firstly, creating possible date-group
pairs:
create table ads as
SELECT
sq."date",
sq."group",
COALESCE(m."cummulativecount",0) as CummulativeCount
FROM
(
SELECT "date", "group"
FROM
(
SELECT DISTINCT "date"
FROM tbl
) dt
CROSS JOIN
(
SELECT DISTINCT "group"
FROM tbl
) grp
) sq
LEFT JOIN tbl m
ON m."date"=sq."date"
AND
m."group" = sq."group"
(The above step was referenced from @ggordon's answer). Now, Since we imputed all cummulativeCount as 0 for additional records, we have to get recent cummulativeCount for the actual table (i.e., for A, 3
will be taken from "1/2/2020"
. If recent most value is unavailable, then it will not be taken.
create table prev_cnt as
select t."group", t.cummulativecount
from tbl as t
inner join
(
select tbl."group", max(tbl."date") m_date
from ads
inner join tbl
on ads."group"=tbl."group"
where ads.cummulativecount =0 and
tbl."date" < ads."date"
group by tbl."group"
) as sq
on t."group"=sq."group"
and t."date"=sq.m_date
Finally, joining the recent most values with derived ads table:
create table fin_ads as
select
ads."date",
ads."group",
case
when ads.cummulativecount=0 and pc.cummulativecount IS NOT NULL then pc.cummulativecount
else ads.cummulativecount
end as cummulativecount
from ads
LEFT join prev_cnt as pc
on ads."group"=pc."group"
The table fin_ads
will be your desired output.
sample code and date here: DB<>fiddle