Snowflake/SQL: create a time-series table such that every ID is visible, and if ID is null, it uses-CodePudding

Suppose I have the following table:

Day	ID	Value
2022-11-05	0	A
2022-11-06	1	B
2022-11-07	0	C

Now given a time window of 1 day, I want to create a time-series table that:

The Day column granular unit is 1 day
Each Day row displays every ID in the table (like cross-join)
Moreover, if for that day, the ID is not recorded, then it uses the Value from the previous day. If it does not exist before this day, we can ignore it.

Let's say I want to view this time series from 2022-11-05 to 2022-11-08, this is the desired output:

Day	ID	Value
2022-11-05	0	A
2022-11-06	0	A
2022-11-06	1	B
2022-11-07	0	C
2022-11-07	1	B
2022-11-08	0	C
2022-11-08	1	B

Explanation: ID=0 is not recorded on 11-06 so it uses the value from the previous day. ID=1 does not record new value on 11-07 so it uses the value from 11-06.

Note that the number of columns can be large, so if possible, I am looking for a solution that handles it too.

CodePudding user response：

Way One:

first we start with some data
then we find the_days in the period we are interested in
then we find the data_start for each id
then we join those values together, and use LAG with the IGNORE NULLS OVER clause to find the "prior values" if the current values in not present via NVL

with data(Day, ID, Value) as (
    select * from values
        ('2022-11-05'::date, 0, 'A'),
        ('2022-11-06'::date, 1, 'B'),
        ('2022-11-07'::date, 0, 'C')
), the_days as (
    select 
        row_number() over (order by null)-1 as rn
        ,dateadd('day', rn, from_day) as day
    from (
        select 
            min(day) as from_day
            ,'2022-11-08' as to_day
            ,datediff('days', from_day, to_day) as days
        from data
    ), table(generator(ROWCOUNT => 200))
    qualify rn <= days
), data_starts as (
    select 
        id, 
        min(day) as start_day
    from data
    group by 1
)
select 
    td.day,
    ds.id,
    nvl(d.value, lag(d.value) ignore nulls over (partition by ds.id order by td.day)) as value
from data_starts as ds
join the_days as td 
    on td.day >= ds.start_day
left join data as d
    on ds.id = d.id and d.day = td.day
order by 1,2;

gives:

DAY	ID	VALUE
2022-11-05	0	A
2022-11-06	0	A
2022-11-06	1	B
2022-11-07	0	C
2022-11-07	1	B
2022-11-08	0	C
2022-11-08	1	B

Way Two:

with data(Day, ID, Value) as (
    select * from values
        ('2022-11-05'::date, 0, 'A'),
        ('2022-11-06'::date, 1, 'B'),
        ('2022-11-07'::date, 0, 'C')
), the_days as (
    select 
        dateadd('day', row_number() over (order by null)-1, '2022-11-05') as day
    from table(generator(ROWCOUNT => 4))
)
select 
    td.day,
    i.id,
    nvl(d.value, lag(d.value) ignore nulls over (partition by i.id order by td.day)) as _value
from the_days as td
cross join (select distinct id from data) as i
left join data as d
    on i.id = d.id and d.day = td.day
qualify _value is not null
order by 1,2;

this requires a unique name for the _values output so it can be referenced in the qualify without needing to duplicate the code.