I have to rewrite a script written for Snowflake into Databricks and need some help on how to replicate CONDITIONAL_TRUE_EVENT as Databricks doesn't have that function.
The need is for me to group events together if they have the same user and device and took place within 300 seconds (5 minutes) of each other.
CREATE TABLE events
(
event_timestamp timestamp,
user_id bigint,
device_id bigint
);
INSERT INTO events VALUES
('2022-07-12 05:00:00',1,1),
('2022-07-12 05:03:00',1,1),
('2022-07-12 05:04:00',1,2),
('2022-07-12 05:05:00',1,2),
('2022-07-12 05:06:00',2,1),
('2022-07-12 05:07:00',1,1),
('2022-07-12 05:15:00',1,1);
SELECT event_timestamp, user_id, device_id, group_id
FROM events
should return
'2022-07-12 05:00:00',1,1,1
'2022-07-12 05:03:00',1,1,1
'2022-07-12 05:04:00',1,2,2
'2022-07-12 05:05:00',1,2,2
'2022-07-12 05:06:00',2,1,3
'2022-07-12 05:07:00',1,1,1
'2022-07-12 05:15:00',1,1,4
The first 3 instances where user_id = 1, device_id = 1 are all group_id = 1 because the next event is within 5 minute of the previous event except for the last one because (group_id = 4) because at 05:15:00 it is more than 5 minutes away from the previous event with user_id = 1, device_id = 1 (05:07:00).
It seems to me that I should be able to find some combination of LAG, CASE, and SUM to calculate the group_id, but I just cannot figure it out.
Any help is appreciated!
CodePudding user response:
Edit: I had previously answered this for the CONDITIONAL_CHANGE_EVENT
, which is a bit more challenging to express in ANSI SQL. This updated answer is for CONDITIONAL_TRUE_EVENT
as the question asks.
It is simply a matter of conditional summing in the window function.
create or replace table T1(PK int, EVNT string);
insert into T1(PK, EVNT) values
(1, 'A'), (2, 'C'), (3, 'B'), (4, 'A'), (5, 'A'),
(6, 'C'), (7, 'C'), (8, 'A'), (9, 'D'), (10, 'A');
select
PK,
conditional_true_event(EVNT = 'A') over (partition by null order by PK)
from T1;
PK | CONDITIONAL_TRUE_EVENT(EVNT = 'A') OVER (PARTITION BY NULL ORDER BY PK) |
---|---|
1 | 1 |
2 | 1 |
3 | 1 |
4 | 2 |
5 | 3 |
6 | 3 |
7 | 3 |
8 | 4 |
9 | 4 |
10 | 5 |
select
PK,
sum(iff(EVNT = 'A', 1, 0)) over (partition by null order by PK) as TRUE_EVENT
from T1;
PK | TRUE_EVENT |
---|---|
1 | 1 |
2 | 1 |
3 | 1 |
4 | 2 |
5 | 3 |
6 | 3 |
7 | 3 |
8 | 4 |
9 | 4 |
10 | 5 |
So for your query you would replace iif(EVNT = 'A', 1, 0)
in the conditional sum with iif(TIME_DIFF > 300, 1, 0)
CodePudding user response:
this expected output is based on the sample table I have provided in the OP.