Using LEAD with condition(?) - snowflake-CodePudding

I have 10k IDs and, for each one of them, there are 10 zones; each zone can be affected in some way

I want to count the time duration that each zone was affected for each ID, ordered by day (considering last week as a whole)

To know if/when a zone was affected, the column AFFECTED_ZONE will return a value from 1 to 10 (determining which zone was the one)

I know the zone was normalized once the next row within AFFECTED_ZONE is 0

So, for example, it looks a little like this:

DATE	ID	AFFECTED_ZONE
2022-12-21 15:00:00	1	1
2022-12-21 15:03:00	1	0
2022-12-21 15:15:00	1	3
2022-12-21 15:25:00	1	0
2022-12-21 16:00:00	1	0
2022-12-21 16:43:00	1	4
2022-12-21 17:00:00	1	0

In this case, the zone 1 from ID 1 was affected at 15:00:00 and was normalized at 15:03:00 - overall affected time should be 3 min; same thing with zone 4 in this example (affected at 16:43:00 and normalized at 17:00:00 - overall affected time should be 17 min)

For zone 3, the affectation happened at 15:15:00 and was normalized at 15:25:00 (first 0) and we had another 0 at a posterior time that we do not consider - overall affected time should be 10 min

The problem is that, sometimes, it can look like this:

DATE	ID	AFFECTED_ZONE
2022-12-21 15:00:00	1	1
2022-12-21 15:03:00	1	1
2022-12-21 15:15:00	1	0
2022-12-21 15:25:00	1	6
2022-12-21 16:00:00	1	4
2022-12-21 16:43:00	1	3
2022-12-21 17:00:00	1	0

In this case, the zone 1 from ID 1 was affected at 15:00:00 and was normalized at 15:15:00, however the 1 showed up again at 15:03:00, but it should be desconsidered since the same zone has already been affected since 15:00:00 - overall affected time should be 15 min

After this, zones 6, 4 and 3 were affected in a row, and normalization only came at 17:00:00; the overall afected times for each zone, respectively, should be 95 min, 60 min and 17 min

I can't figure this second part out. At first, I separated the dates of each event (affectation and normalization) like this:

case when affectation_zone <> 0 then date end as affected_at,
case when affectation_zone = 0 then date end as normal_at

Then, I added a LEAD() function so that I could subtract the AFFECTED_AT date from the NORMAL_AT date and thus find the overall affected time, like this:

datediff(minutes, affected_at, lead(normal_at) over (partition by id order by date)) as lead

It works just fine for the first scenario

DATE	ID	AFFECTED_ZONE	AFFECTED_AT	NORMAL_AT	LEAD
2022-12-21 15:00:00	1	1	2022-12-21 15:00:00	null	3
2022-12-21 15:03:00	1	0	null	2022-12-21 15:03:00	null
2022-12-21 15:15:00	1	3	2022-1-21 15:15:00	null	10
2022-12-21 15:25:00	1	0	null	2022-12-21 15:25:00	null
2022-12-21 16:00:00	1	0	null	2022-12-21 16:00:00	null
2022-12-21 16:43:00	1	4	2022-12-21 16:43:00	null	17
2022-12-21 17:00:00	1	0	null	2022-12-21 17:00:00	null

However, for the second one, the LEAD() only considers the last row in which the AFFECTED_AT column is not null, desconsidering the other ones, like this:

DATE	ID	AFFECTED_ZONE	AFFECTED_AT	NORMAL_AT	LEAD
2022-12-21 15:00:00	1	1	2022-12-21 15:00:00	null	null
2022-12-21 15:03:00	1	1	2022-12-21 15:03:00	null	12
2022-12-21 15:15:00	1	0	null	2022-12-21 15:15:00	null
2022-12-21 15:25:00	1	6	2022-12-21 15:25:00	null	null
2022-12-21 16:00:00	1	4	2022-12-21 16:00:00	null	null
2022-12-21 16:43:00	1	3	2022-12-21 16:43:00	null	17
2022-12-21 17:00:00	1	0	null	2022-12-21 17:00:00	null

I could ignore nulls with the LEAD() function, and it would work well for the cases in which there are different zones one after the other, but it wouldn't work in cases in which the same zone repeats itself, as I would be adding unnecessary time, for example:

DATE	ID	AFFECTED_ZONE	AFFECTED_AT	NORMAL_AT	LEAD
2022-12-21 15:00:00	1	1	2022-12-21 15:00:00	null	15
2022-12-21 15:03:00	1	1	2022-12-21 15:03:00	null	12
2022-12-21 15:15:00	1	0	null	2022-12-21 15:15:00	null
2022-12-21 15:25:00	1	6	2022-12-21 15:25:00	null	95
2022-12-21 16:00:00	1	4	2022-12-21 16:00:00	null	60
2022-12-21 16:43:00	1	3	2022-12-21 16:43:00	null	17
2022-12-21 17:00:00	1	0	null	2022-12-21 17:00:00	null

the overall affection time for zone 1 should be 15 min, but if I add everything it would be 23 min

Any ideas on how to solve this? I'm no expert on Snowflake/SQL (quite on the contrary) so I would much appreciate it!!

CodePudding user response：

I can think of two possible approaches, second probably the best but I'll let you decide:

1 - Remove Extra Records

Assuming, based on your question, that an ID can only affect an AFFECTED_ZONE once (each occurrence possibly including multiple records). i.e.

DATE	ID	AFFECTED_ZONE
2022-12-21 15:00:00	1	1
2022-12-21 15:03:00	1	1
2022-12-21 15:15:00	1	0
2022-12-21 15:25:00	1	6
2022-12-21 16:00:00	1	4
2022-12-21 16:43:00	1	3
2022-12-21 17:00:00	1	0

and not

DATE	ID	AFFECTED_ZONE
2022-12-21 15:00:00	1	1
2022-12-21 15:03:00	1	1
2022-12-21 15:15:00	1	0
2022-12-21 15:25:00	1	1
2022-12-21 16:00:00	1	0
2022-12-21 16:43:00	1	3
2022-12-21 17:00:00	1	0

We could use a LAG function to find each records previous AFFECTED_ZONE and remove those with the same ID and AFFECTED_ZONE - while ignoring where AFFECTED_ZONE = 0. If you do have more than one occurrence of an ID, AFFECTED_ZONE pairing, this process would merge them together.

select foo.id,
       foo.date,
       foo.affected_zone
  from (select id,
               date,
               affected_zone,
               lag(affected_zone,1) over (partition by id
                                              order by date) prev_affected_zone
          from your_table) foo
 where ifnull(foo.affected_zone,-1) != ifnull(foo.prev_affected_zone,-1)
    or ifnull(foo.affected_zone,-1)  = 0

This approach will give you something like

DATE	ID	AFFECTED_ZONE
2022-12-21 15:00:00	1	1
2022-12-21 15:15:00	1	0
2022-12-21 15:25:00	1	6
2022-12-21 16:00:00	1	4
2022-12-21 16:43:00	1	3
2022-12-21 17:00:00	1	0

Allowing you to use your existing LEAD

2 - Use FIRST_VALUE instead of LEAD

Use your current process but replace LEAD with FIRST_VALUE.

FIRST_VALUE will select the first value in an ordered group of values, so we can ignore nulls and return the first normal_at value after our current row.

first_value

select date,
       id,
       affected_zone,
       affected_at,
       first_value(normal_at ignore nulls) over (partition by id
                                                     order by date 
                                                      rows between current row and unbound following) normal_at
  from (select id,
               date,
               affected_zone,
               case when affected_zone != 0 then date end  affected_at,
               case when affected_zone  = 0 then date end  normal_at
          from your_table) foo

This should give you:

DATE	ID	AFFECTED_ZONE	AFFECTED_AT	NORMAL_AT
2022-12-21 15:00:00	1	1	2022-12-21 15:00:00	2022-12-21 15:15:00
2022-12-21 15:03:00	1	1	2022-12-21 15:03:00	2022-12-21 15:15:00
2022-12-21 15:15:00	1	0	null	2022-12-21 15:15:00
2022-12-21 15:25:00	1	6	2022-12-21 15:25:00	null
2022-12-21 16:00:00	1	4	2022-12-21 16:00:00	null
2022-12-21 16:43:00	1	3	2022-12-21 16:43:00	2022-12-21 17:00:00
2022-12-21 17:00:00	1	0	null	2022-12-21 17:00:00

You can then do your duration calculation and select the first record for each ID, AFFECTED_ZONE pairing, probably with a ROW_NUMBER.

CodePudding user response：

I would approach this as a gaps-and-islands problem, which gives us a lot of flexibility to address the various use cases.

My pick would be to define groups of adjacent records that start with one or more affected zones and end with a normalization (affected_zone = 0), using window functions:

select t.*,
    sum(case when lag_affected_zone = 0 then 1 else 0 end) over(partition by id order by date) grp
from (
    select t.*,
        lag(affected_zone, 1, 0) over(partition by id order by date) lag_affected_zone
    from mytable t
) t

Starting with a mix of some of the data you provided, that I hope represents the different use cases, this returns:

DATE	ID	AFFECTED_ZONE	lag_affected_zone	grp
2022-12-21 15:00:00.000	1	1	0	1
2022-12-21 15:03:00.000	1	1	1	1
2022-12-21 15:15:00.000	1	0	1	1
2022-12-21 15:17:00.000	1	0	0	2
2022-12-21 15:25:00.000	1	6	0	3
2022-12-21 16:00:00.000	1	4	6	3
2022-12-21 16:43:00.000	1	3	4	3
2022-12-21 16:50:00.000	1	1	3	3
2022-12-21 17:00:00.000	1	0	1	3

You can see how records are being grouped together to form consistend islands. Now we can work on each group: we want to bring the earliest date of each affected zone in the group, and compare it to the latest date of the group (which corresponds to the normalization step); we can use aggregation:

select *
from (
    select id, affected_zone, min(date) affected_at, max(max(date)) over(partition by grp) normalized_at
    from (
        select t.*,
            sum(case when lag_affected_zone = 0 then 1 else 0 end) over(partition by id order by date) grp
        from (
            select t.*,
                lag(affected_zone, 1, 0) over(partition by id order by date) lag_affected_zone
            from mytable t
        ) t
    ) t
    group by id, affected_zone, grp
) t
where affected_zone != 0
order by id, affected_at

id	affected_zone	affected_at	normalized_at
1	1	2022-12-21 15:00:00.000	2022-12-21 15:15:00.000
1	6	2022-12-21 15:25:00.000	2022-12-21 17:00:00.000
1	4	2022-12-21 16:00:00.000	2022-12-21 17:00:00.000
1	3	2022-12-21 16:43:00.000	2022-12-21 17:00:00.000
1	1	2022-12-21 16:50:00.000	2022-12-21 17:00:00.000

Here is a demo on DB Fiddle: this is SQL Server, but uses standard SQL that Snowflakes supports as well.