I have 10k IDs and, for each one of them, there are 10 zones; each zone can be affected in some way
I want to count the time duration that each zone was affected for each ID, ordered by day (considering last week as a whole)
To know if/when a zone was affected, the column AFFECTED_ZONE
will return a value from 1 to 10 (determining which zone was the one)
I know the zone was normalized once the next row within AFFECTED_ZONE
is 0
So, for example, it looks a little like this:
DATE | ID | AFFECTED_ZONE |
---|---|---|
2022-12-21 15:00:00 | 1 | 1 |
2022-12-21 15:03:00 | 1 | 0 |
2022-12-21 15:15:00 | 1 | 3 |
2022-12-21 15:25:00 | 1 | 0 |
2022-12-21 16:00:00 | 1 | 0 |
2022-12-21 16:43:00 | 1 | 4 |
2022-12-21 17:00:00 | 1 | 0 |
In this case, the zone 1 from ID 1 was affected at 15:00:00 and was normalized at 15:03:00 - overall affected time should be 3 min; same thing with zone 4 in this example (affected at 16:43:00 and normalized at 17:00:00 - overall affected time should be 17 min)
For zone 3, the affectation happened at 15:15:00 and was normalized at 15:25:00 (first 0) and we had another 0 at a posterior time that we do not consider - overall affected time should be 10 min
The problem is that, sometimes, it can look like this:
DATE | ID | AFFECTED_ZONE |
---|---|---|
2022-12-21 15:00:00 | 1 | 1 |
2022-12-21 15:03:00 | 1 | 1 |
2022-12-21 15:15:00 | 1 | 0 |
2022-12-21 15:25:00 | 1 | 6 |
2022-12-21 16:00:00 | 1 | 4 |
2022-12-21 16:43:00 | 1 | 3 |
2022-12-21 17:00:00 | 1 | 0 |
In this case, the zone 1 from ID 1 was affected at 15:00:00 and was normalized at 15:15:00, however the 1 showed up again at 15:03:00, but it should be desconsidered since the same zone has already been affected since 15:00:00 - overall affected time should be 15 min
After this, zones 6, 4 and 3 were affected in a row, and normalization only came at 17:00:00; the overall afected times for each zone, respectively, should be 95 min, 60 min and 17 min
I can't figure this second part out. At first, I separated the dates of each event (affectation and normalization) like this:
case when affectation_zone <> 0 then date end as affected_at,
case when affectation_zone = 0 then date end as normal_at
Then, I added a LEAD() function so that I could subtract the AFFECTED_AT
date from the NORMAL_AT
date and thus find the overall affected time, like this:
datediff(minutes, affected_at, lead(normal_at) over (partition by id order by date)) as lead
It works just fine for the first scenario
DATE | ID | AFFECTED_ZONE | AFFECTED_AT | NORMAL_AT | LEAD |
---|---|---|---|---|---|
2022-12-21 15:00:00 | 1 | 1 | 2022-12-21 15:00:00 | null | 3 |
2022-12-21 15:03:00 | 1 | 0 | null | 2022-12-21 15:03:00 | null |
2022-12-21 15:15:00 | 1 | 3 | 2022-1-21 15:15:00 | null | 10 |
2022-12-21 15:25:00 | 1 | 0 | null | 2022-12-21 15:25:00 | null |
2022-12-21 16:00:00 | 1 | 0 | null | 2022-12-21 16:00:00 | null |
2022-12-21 16:43:00 | 1 | 4 | 2022-12-21 16:43:00 | null | 17 |
2022-12-21 17:00:00 | 1 | 0 | null | 2022-12-21 17:00:00 | null |
However, for the second one, the LEAD() only considers the last row in which the
AFFECTED_AT
column is not null, desconsidering the other ones, like this:
DATE | ID | AFFECTED_ZONE | AFFECTED_AT | NORMAL_AT | LEAD |
---|---|---|---|---|---|
2022-12-21 15:00:00 | 1 | 1 | 2022-12-21 15:00:00 | null | null |
2022-12-21 15:03:00 | 1 | 1 | 2022-12-21 15:03:00 | null | 12 |
2022-12-21 15:15:00 | 1 | 0 | null | 2022-12-21 15:15:00 | null |
2022-12-21 15:25:00 | 1 | 6 | 2022-12-21 15:25:00 | null | null |
2022-12-21 16:00:00 | 1 | 4 | 2022-12-21 16:00:00 | null | null |
2022-12-21 16:43:00 | 1 | 3 | 2022-12-21 16:43:00 | null | 17 |
2022-12-21 17:00:00 | 1 | 0 | null | 2022-12-21 17:00:00 | null |
I could ignore nulls with the LEAD() function, and it would work well for the cases in which there are different zones one after the other, but it wouldn't work in cases in which the same zone repeats itself, as I would be adding unnecessary time, for example:
DATE | ID | AFFECTED_ZONE | AFFECTED_AT | NORMAL_AT | LEAD |
---|---|---|---|---|---|
2022-12-21 15:00:00 | 1 | 1 | 2022-12-21 15:00:00 | null | 15 |
2022-12-21 15:03:00 | 1 | 1 | 2022-12-21 15:03:00 | null | 12 |
2022-12-21 15:15:00 | 1 | 0 | null | 2022-12-21 15:15:00 | null |
2022-12-21 15:25:00 | 1 | 6 | 2022-12-21 15:25:00 | null | 95 |
2022-12-21 16:00:00 | 1 | 4 | 2022-12-21 16:00:00 | null | 60 |
2022-12-21 16:43:00 | 1 | 3 | 2022-12-21 16:43:00 | null | 17 |
2022-12-21 17:00:00 | 1 | 0 | null | 2022-12-21 17:00:00 | null |
the overall affection time for zone 1 should be 15 min, but if I add everything it would be 23 min
Any ideas on how to solve this? I'm no expert on Snowflake/SQL (quite on the contrary) so I would much appreciate it!!
CodePudding user response:
I can think of two possible approaches, second probably the best but I'll let you decide:
1 - Remove Extra Records
Assuming, based on your question, that an ID
can only affect an AFFECTED_ZONE
once (each occurrence possibly including multiple records). i.e.
DATE | ID | AFFECTED_ZONE |
---|---|---|
2022-12-21 15:00:00 | 1 | 1 |
2022-12-21 15:03:00 | 1 | 1 |
2022-12-21 15:15:00 | 1 | 0 |
2022-12-21 15:25:00 | 1 | 6 |
2022-12-21 16:00:00 | 1 | 4 |
2022-12-21 16:43:00 | 1 | 3 |
2022-12-21 17:00:00 | 1 | 0 |
and not
DATE | ID | AFFECTED_ZONE |
---|---|---|
2022-12-21 15:00:00 | 1 | 1 |
2022-12-21 15:03:00 | 1 | 1 |
2022-12-21 15:15:00 | 1 | 0 |
2022-12-21 15:25:00 | 1 | 1 |
2022-12-21 16:00:00 | 1 | 0 |
2022-12-21 16:43:00 | 1 | 3 |
2022-12-21 17:00:00 | 1 | 0 |
We could use a LAG
function to find each records previous AFFECTED_ZONE
and remove those with the same ID
and AFFECTED_ZONE
- while ignoring where AFFECTED_ZONE = 0
. If you do have more than one occurrence of an ID, AFFECTED_ZONE
pairing, this process would merge them together.
select foo.id,
foo.date,
foo.affected_zone
from (select id,
date,
affected_zone,
lag(affected_zone,1) over (partition by id
order by date) prev_affected_zone
from your_table) foo
where ifnull(foo.affected_zone,-1) != ifnull(foo.prev_affected_zone,-1)
or ifnull(foo.affected_zone,-1) = 0
This approach will give you something like
DATE | ID | AFFECTED_ZONE |
---|---|---|
2022-12-21 15:00:00 | 1 | 1 |
2022-12-21 15:15:00 | 1 | 0 |
2022-12-21 15:25:00 | 1 | 6 |
2022-12-21 16:00:00 | 1 | 4 |
2022-12-21 16:43:00 | 1 | 3 |
2022-12-21 17:00:00 | 1 | 0 |
Allowing you to use your existing LEAD
2 - Use FIRST_VALUE instead of LEAD
Use your current process but replace LEAD
with FIRST_VALUE
.
FIRST_VALUE
will select the first value in an ordered group of values, so we can ignore nulls and return the first normal_at
value after our current row.
select date,
id,
affected_zone,
affected_at,
first_value(normal_at ignore nulls) over (partition by id
order by date
rows between current row and unbound following) normal_at
from (select id,
date,
affected_zone,
case when affected_zone != 0 then date end affected_at,
case when affected_zone = 0 then date end normal_at
from your_table) foo
This should give you:
DATE | ID | AFFECTED_ZONE | AFFECTED_AT | NORMAL_AT |
---|---|---|---|---|
2022-12-21 15:00:00 | 1 | 1 | 2022-12-21 15:00:00 | 2022-12-21 15:15:00 |
2022-12-21 15:03:00 | 1 | 1 | 2022-12-21 15:03:00 | 2022-12-21 15:15:00 |
2022-12-21 15:15:00 | 1 | 0 | null | 2022-12-21 15:15:00 |
2022-12-21 15:25:00 | 1 | 6 | 2022-12-21 15:25:00 | null |
2022-12-21 16:00:00 | 1 | 4 | 2022-12-21 16:00:00 | null |
2022-12-21 16:43:00 | 1 | 3 | 2022-12-21 16:43:00 | 2022-12-21 17:00:00 |
2022-12-21 17:00:00 | 1 | 0 | null | 2022-12-21 17:00:00 |
You can then do your duration calculation and select the first record for each ID, AFFECTED_ZONE
pairing, probably with a ROW_NUMBER
.
CodePudding user response:
I would approach this as a gaps-and-islands problem, which gives us a lot of flexibility to address the various use cases.
My pick would be to define groups of adjacent records that start with one or more affected zones and end with a normalization (affected_zone = 0
), using window functions:
select t.*,
sum(case when lag_affected_zone = 0 then 1 else 0 end) over(partition by id order by date) grp
from (
select t.*,
lag(affected_zone, 1, 0) over(partition by id order by date) lag_affected_zone
from mytable t
) t
Starting with a mix of some of the data you provided, that I hope represents the different use cases, this returns:
DATE | ID | AFFECTED_ZONE | lag_affected_zone | grp |
---|---|---|---|---|
2022-12-21 15:00:00.000 | 1 | 1 | 0 | 1 |
2022-12-21 15:03:00.000 | 1 | 1 | 1 | 1 |
2022-12-21 15:15:00.000 | 1 | 0 | 1 | 1 |
2022-12-21 15:17:00.000 | 1 | 0 | 0 | 2 |
2022-12-21 15:25:00.000 | 1 | 6 | 0 | 3 |
2022-12-21 16:00:00.000 | 1 | 4 | 6 | 3 |
2022-12-21 16:43:00.000 | 1 | 3 | 4 | 3 |
2022-12-21 16:50:00.000 | 1 | 1 | 3 | 3 |
2022-12-21 17:00:00.000 | 1 | 0 | 1 | 3 |
You can see how records are being grouped together to form consistend islands. Now we can work on each group: we want to bring the earliest date of each affected zone in the group, and compare it to the latest date of the group (which corresponds to the normalization step); we can use aggregation:
select *
from (
select id, affected_zone, min(date) affected_at, max(max(date)) over(partition by grp) normalized_at
from (
select t.*,
sum(case when lag_affected_zone = 0 then 1 else 0 end) over(partition by id order by date) grp
from (
select t.*,
lag(affected_zone, 1, 0) over(partition by id order by date) lag_affected_zone
from mytable t
) t
) t
group by id, affected_zone, grp
) t
where affected_zone != 0
order by id, affected_at
id | affected_zone | affected_at | normalized_at |
---|---|---|---|
1 | 1 | 2022-12-21 15:00:00.000 | 2022-12-21 15:15:00.000 |
1 | 6 | 2022-12-21 15:25:00.000 | 2022-12-21 17:00:00.000 |
1 | 4 | 2022-12-21 16:00:00.000 | 2022-12-21 17:00:00.000 |
1 | 3 | 2022-12-21 16:43:00.000 | 2022-12-21 17:00:00.000 |
1 | 1 | 2022-12-21 16:50:00.000 | 2022-12-21 17:00:00.000 |
Here is a demo on DB Fiddle: this is SQL Server, but uses standard SQL that Snowflakes supports as well.