Home > Software design >  Best way to interpolate missing dates by group in SQL
Best way to interpolate missing dates by group in SQL

Time:11-09

I have a dataset like this:

Date Account Spend
2/1/21 A 4
3/1/21 A 6
5/1/21 A 7
6/1/21 A 2
4/1/21 B 8
5/1/21 B 2
6/1/21 B 1
9/1/21 B 7

Note that the dates the accounts span are NOT the same. I want to fill in zeroes for the missing months that are in between the minimum and maximum date for each account. The final table would look like this:

Date Account Spend
2/1/21 A 4
3/1/21 A 6
4/1/21 A 0
5/1/21 A 7
6/1/21 A 2
4/1/21 B 8
5/1/21 B 2
6/1/21 B 1
7/1/21 B 0
8/1/21 B 0
9/1/21 B 7

What's the best way to go about this in Snowflake SQL?

I thought I could cross join the accounts with another table that contains all months. Then I could join back to the original table and fill in any missing values within the Spend column with zeroes. But I'm unsure how to deal with the "lagging" and "leading" nulls that result from this. For example, there would be a null value for the combination of 2/1/21 and B after the cross join but that date occurs before the first occurrence of B in the original table (4/1/21) so I wouldn't want that row in my final dataset.

CodePudding user response:

We indeed can solve this using a calendar table of month beginnings, say calendar(date).

We can define the date range of each account in an aggregate subquery, then cross join it with the calendar table ; this gives us all possible date/account tuples. All that is left to do is to try and bring the corresponding row, if any, with a left join.

select c.date, a.account, coalesce(t.spend, 0) spend
from (
    select account, min(date) min_date, max(date) max_date 
    from mytable
    group by account
) a
inner join calendar c on c.date >= a.min_date and c.date <= a.max_date
left join mytable   t on t.date = c.date and t.account = a.account

CodePudding user response:

Steps and approach:

  • First you need to find the max and min date for each group using the first_value function

  • You need a data set that contains unique months combined with all the unique accounts - this is where the cross join comes in. You can adjust the name of the tables accordingly

  • You need to join the max_date and min_date values to each account value, this is needed to filter out the rows where you have the case that the min_date is out of the range of the actual data set, just like you described the case of the account B for date 2/1/21. To achieve that you need to join only on the key accounts = accounts

  • Then you need to join based on the dates as well and the accounts i.e m2, to get the spend values based on dates and the accounts

  • lastly, you can filter out the rows that occur after the max and before the min of the original dates that you have for a given account in the where clause

with main as (

select 
date,
account,
coalesce(spend,0) as total_spend,
first_value(Date) over(partition by account order by date desc) as max_date,
first_value(Date) over(partition by account order by date) as min_date
from <table_name>
),
combining as (
-- make sure you have distinct accounts and months stored in these tables
select distinct accounts from <account_name_table>
cross join <calendar_month_table>
),
joining as (
select
c.date,
c.accounts,
coalesce(m2.total_spend,0) as new_spend,
main.max_date,
main.min_date
from combining
left join main
on combining.accounts = main.accounts
left join main as m2
on combining.accounts = m2.accounts
and combining.date = m2.date
)
select * from joining where min_date <= date and max_date >= date
  • Related