I have a table that looks the following way
time | group | sub_group | count |
---|---|---|---|
2022-01-01 | A | True | 3 |
2022-01-01 | A | False | 1 |
2022-01-01 | B | True | 2 |
2022-01-01 | B | False | 1 |
2022-01-02 | A | False | 2 |
2022-01-02 | A | True | 5 |
2022-01-02 | B | False | 3 |
2022-01-03 | A | False | 3 |
2022-01-03 | B | False | 4 |
2022-01-03 | B | True | 3 |
So an increasing count per group sub_group per day, unless on a day when a count did not change for a group subgroup, the row is missing.
in the example above missing rows would be: ...
| 2022-01-02 | B | True | 2 |
...
| 2022-01-03 | A | True | 5 |
...
For ease of data handling, I need a continuous timestamp per day for all groups sub_groups. So the result would look like this:
time | group | sub_group | count |
---|---|---|---|
2022-01-01 | A | True | 3 |
2022-01-01 | A | False | 1 |
2022-01-01 | B | True | 2 |
2022-01-01 | B | False | 1 |
2022-01-02 | A | False | 2 |
2022-01-02 | A | True | 5 |
2022-01-02 | B | False | 3 |
2022-01-02 | B | True | 2 |
2022-01-03 | A | False | 3 |
2022-01-03 | A | True | 5 |
2022-01-03 | B | False | 4 |
2022-01-03 | B | True | 3 |
How could I achieve this? Probably some parition by
... over
select construct, but I can't wrap my head around how to partition by timestamps from other groups in this case, as I don't have the NULL counts to forward fill for each group as intermediate.
update: So far, I seem to have the reached the intermediate state that filled the missing timestamps (basically just daily frequency is fine here) between groups like this:
with time_range as (
select min(time) as start_time, -- current_timestamp - interval '2 day'
max(time) as end_time
from my_table-- current_timestamp
),
interested_events as (
select e.group, e.sub_group, e.time, e.count
from my_table e
),
classes_having_events as (
select distinct group, sub_group
from interested_events
ORDER BY group, sub_group
),
periods as (
select ts as period_start, ts interval '1 day' as period_end
from generate_series(
(select start_time from time_range),
(select end_time from time_range) - interval '1 second',
interval '1 day') ts
), resampled as (
SELECT period_start,
period_end,
classes_having_events.group,
classes_having_events.sub_group,
interested_events.count
FROM periods
CROSS JOIN classes_having_events
LEFT JOIN interested_events
ON time >= period_start AND time < period_end
AND interested_events.group = classes_having_events.group
AND interested_events.sub_group = classes_having_events.sub_group
ORDER BY period_start DESC
)
CodePudding user response:
Okay, seems like I was pretty close and rubber duck debugging helped.
This seems to do what I wanted to have:
WITH time_range AS (
SELECT MIN(time) AS start_time, -- current_timestamp - interval '2 day'
MAX(time) AS end_time
FROM my_table-- current_timestamp
),
interested_events AS (
SELECT e.group, e.sub_group, e.time, e.count
FROM my_table e
),
classes_having_events AS (
SELECT DISTINCT
GROUP, sub_group
FROM interested_events
ORDER BY
GROUP, sub_group
),
periods AS (
SELECT ts AS period_start, ts INTERVAL '1 day' AS period_end
FROM GENERATE_SERIES(
(
SELECT start_time
FROM time_range
),
(
SELECT end_time
FROM time_range
) - INTERVAL '1 second',
INTERVAL '1 day') ts
),
resampled AS (
SELECT period_start,
period_end,
classes_having_events.group,
classes_having_events.sub_group,
interested_events.count
FROM periods
CROSS JOIN classes_having_events
LEFT JOIN interested_events
ON time >= period_start AND time < period_end
AND interested_events.group = classes_having_events.group
AND interested_events.sub_group = classes_having_events.sub_group
ORDER BY period_start DESC
)
SELECT period_start AS time,
"group",
sub_group,
MAX(count) OVER (PARTITION BY "group", "sub_group" ORDER BY period_start) AS count
FROM resampled
ORDER BY period_start DESC, "group", sub_group;