Home > OS >  Forward fill grouped timeseries using timestamps from other groups in postgresql
Forward fill grouped timeseries using timestamps from other groups in postgresql

Time:11-21

I have a table that looks the following way

time group sub_group count
2022-01-01 A True 3
2022-01-01 A False 1
2022-01-01 B True 2
2022-01-01 B False 1
2022-01-02 A False 2
2022-01-02 A True 5
2022-01-02 B False 3
2022-01-03 A False 3
2022-01-03 B False 4
2022-01-03 B True 3

So an increasing count per group sub_group per day, unless on a day when a count did not change for a group subgroup, the row is missing.

in the example above missing rows would be: ...

| 2022-01-02 | B | True | 2 |

...

| 2022-01-03 | A | True | 5 |

...

For ease of data handling, I need a continuous timestamp per day for all groups sub_groups. So the result would look like this:

time group sub_group count
2022-01-01 A True 3
2022-01-01 A False 1
2022-01-01 B True 2
2022-01-01 B False 1
2022-01-02 A False 2
2022-01-02 A True 5
2022-01-02 B False 3
2022-01-02 B True 2
2022-01-03 A False 3
2022-01-03 A True 5
2022-01-03 B False 4
2022-01-03 B True 3

How could I achieve this? Probably some parition by ... over select construct, but I can't wrap my head around how to partition by timestamps from other groups in this case, as I don't have the NULL counts to forward fill for each group as intermediate.

update: So far, I seem to have the reached the intermediate state that filled the missing timestamps (basically just daily frequency is fine here) between groups like this:

with time_range as (
    select min(time) as start_time,   -- current_timestamp - interval '2 day'
           max(time) as end_time
    from my_table-- current_timestamp
),
interested_events as (
    select e.group, e.sub_group, e.time, e.count
    from my_table e
),
classes_having_events as (
    select distinct group, sub_group
    from interested_events
    ORDER BY group, sub_group
),
periods as (
    select ts as period_start, ts   interval '1 day' as period_end
    from generate_series(
        (select start_time from time_range),
        (select end_time from time_range) - interval '1 second',
        interval '1 day') ts
), resampled as (
    SELECT period_start,
           period_end,
           classes_having_events.group,
           classes_having_events.sub_group,
           interested_events.count
    FROM periods
             CROSS JOIN classes_having_events
             LEFT JOIN interested_events
                       ON time >= period_start AND time < period_end
                           AND interested_events.group = classes_having_events.group
                           AND interested_events.sub_group = classes_having_events.sub_group
    ORDER BY period_start DESC
    )

CodePudding user response:

Okay, seems like I was pretty close and rubber duck debugging helped.

This seems to do what I wanted to have:

WITH time_range AS (
    SELECT MIN(time) AS start_time, -- current_timestamp - interval '2 day'
           MAX(time) AS end_time
    FROM my_table-- current_timestamp
    ),
     interested_events AS (
         SELECT e.group, e.sub_group, e.time, e.count
         FROM my_table e
         ),
     classes_having_events AS (
         SELECT DISTINCT
         GROUP, sub_group
         FROM interested_events
         ORDER BY
         GROUP, sub_group
         ),
     periods AS (
         SELECT ts AS period_start, ts   INTERVAL '1 day' AS period_end
         FROM GENERATE_SERIES(
                      (
                          SELECT start_time
                          FROM time_range
                          ),
                      (
                          SELECT end_time
                          FROM time_range
                          ) - INTERVAL '1 second',
                      INTERVAL '1 day') ts
         ),
     resampled AS (
         SELECT period_start,
                period_end,
                classes_having_events.group,
                classes_having_events.sub_group,
                interested_events.count
         FROM periods
                  CROSS JOIN classes_having_events
                  LEFT JOIN interested_events
                            ON time >= period_start AND time < period_end
                                AND interested_events.group = classes_having_events.group
                                AND interested_events.sub_group = classes_having_events.sub_group
         ORDER BY period_start DESC
         )
SELECT period_start                                                              AS time,
       "group",
       sub_group,
       MAX(count) OVER (PARTITION BY "group", "sub_group" ORDER BY period_start) AS count
FROM resampled
ORDER BY period_start DESC, "group", sub_group;
  • Related