I have a dataset like this:
Date | Account | Spend |
---|---|---|
1/2/21 | A | null |
1/3/21 | A | null |
1/4/21 | A | 4 |
1/5/21 | A | 6 |
1/6/21 | A | null |
1/7/21 | A | 7 |
1/8/21 | A | null |
1/2/21 | B | null |
1/3/21 | B | 4 |
1/4/21 | B | null |
1/5/21 | B | 7 |
1/6/21 | B | null |
I want to trim any leading and lagging nulls by group but keep the nulls where there is a value both before and after. The final dataset would look like this:
Date | Account | Spend |
---|---|---|
1/4/21 | A | 4 |
1/5/21 | A | 6 |
1/6/21 | A | null |
1/7/21 | A | 7 |
1/3/21 | B | 4 |
1/4/21 | B | null |
1/5/21 | B | 7 |
How can I do this with SQL (specifically Snowflake SQL)?
CodePudding user response:
I believe the basic idea here is to check before and after values in case a given spend column value is NULL. If so, delete it.
To check before and after we have window functions lead
and lag
; however, we cannot use them in where clause and so cannot build a straight delete.
Here's my take on it -
delete from yourtable a where exists
(select * from
(select date1, account, spend,
case
when spend is null AND (lag(spend) over (order by 0) is null
OR
lead(spend) over (order by 0) is null)
then 0 else 1
end testcol
from yourtable
) as x
where
x.date1 = a.date1
and x.account = a.account
and coalesce(x.spend,0) = coalesce(a.spend,0)
and x.testcol=0
)
CodePudding user response:
You can identify the offending rows with lag
, lead
and conditional logic; the following query evits them from the result set:
select *
from (
select t.*,
lead(spend) over(partition by account order by date) lead_spend,
lag(spend) over(partition by account order by date) lag_spend
from mytable t
) t
where spend is not null or coalesce(lead_sped, lag_spend) is not null
In Snowflake we might be able to express this with the qualify
clause:
select *
from mytable
qualify
spend is not null
or coalesce(
lead(spend) over(partition by account order by date),
lag (spend) over(partition by account order by date)
) is not null
If you want to actually delete
the rows from the table, I would recommend the using
clause. Assuming that (date, account)
can be used as a unique key:
delete from mytable t
using (
select *
from mytable
qualify
spend is null
and (
lead(spend) over(partition by account order by date) is null
or lag (spend) over(partition by account order by date) is null
)
) x
where x.date = t.date and x.account = t.account