Home > Software design >  Remove rows with leading and lagging nulls by group in SQL
Remove rows with leading and lagging nulls by group in SQL

Time:11-09

I have a dataset like this:

Date Account Spend
1/2/21 A null
1/3/21 A null
1/4/21 A 4
1/5/21 A 6
1/6/21 A null
1/7/21 A 7
1/8/21 A null
1/2/21 B null
1/3/21 B 4
1/4/21 B null
1/5/21 B 7
1/6/21 B null

I want to trim any leading and lagging nulls by group but keep the nulls where there is a value both before and after. The final dataset would look like this:

Date Account Spend
1/4/21 A 4
1/5/21 A 6
1/6/21 A null
1/7/21 A 7
1/3/21 B 4
1/4/21 B null
1/5/21 B 7

How can I do this with SQL (specifically Snowflake SQL)?

CodePudding user response:

I believe the basic idea here is to check before and after values in case a given spend column value is NULL. If so, delete it.

To check before and after we have window functions lead and lag; however, we cannot use them in where clause and so cannot build a straight delete.

Here's my take on it -

delete from yourtable a where exists 
    (select * from 
        (select date1, account, spend,
        case 
            when spend is null AND (lag(spend) over (order by 0) is null
            OR
            lead(spend) over (order by 0) is null) 
            then 0 else 1
        end testcol
        from yourtable 
        ) as x
    where
        x.date1 = a.date1
        and x.account = a.account 
        and coalesce(x.spend,0) = coalesce(a.spend,0)
        and x.testcol=0
)

CodePudding user response:

You can identify the offending rows with lag, lead and conditional logic; the following query evits them from the result set:

select *
from (
    select t.*,
        lead(spend) over(partition by account order by date) lead_spend,
        lag(spend) over(partition by account order by date) lag_spend
    from mytable t
) t
where spend is not null or coalesce(lead_sped, lag_spend) is not null

In Snowflake we might be able to express this with the qualify clause:

select *
from mytable
qualify 
    spend is not null
    or coalesce(
        lead(spend) over(partition by account order by date),
        lag (spend) over(partition by account order by date)
    ) is not null

If you want to actually delete the rows from the table, I would recommend the using clause. Assuming that (date, account) can be used as a unique key:

delete from mytable t
using (
    select *
    from mytable
    qualify 
        spend is null
        and (
               lead(spend) over(partition by account order by date) is null
            or lag (spend) over(partition by account order by date) is null
        )
) x
where x.date = t.date and x.account = t.account
  • Related