I'm trying to detect users first sessions.
ID | event_server_date | Event | Row_number |
---|---|---|---|
1 | 2022-10-26 09:43 | abc | 1 |
1 | 2022-10-26 09:45 | cde | 2 |
1 | 2022-10-26 09:47 | ykz | 3 |
1 | 2022-10-26 09:48 | fun | 4 |
1 | 2022-10-26 09:50 | start_event | 5 |
1 | 2022-10-26 09:55 | x | 6 |
1 | 2022-10-26 09:56 | y | 7 |
1 | 2022-10-26 09:56 | z | 8 |
2 | 2022-10-26 09:12 | plz | 1 |
2 | 2022-10-26 09:15 | rck | 2 |
2 | 2022-10-26 09:15 | dsp | 3 |
2 | 2022-10-26 09:17 | vnl | 4 |
2 | 2022-10-26 09:23 | start_event | 5 |
2 | 2022-10-26 09:23 | k | 6 |
2 | 2022-10-26 09:26 | l | 7 |
Desired Output:
ID | Timestamp | Event | Row_number |
---|---|---|---|
1 | 2022-10-26 09:50 | start_event | 5 |
1 | 2022-10-26 09:55 | x | 6 |
1 | 2022-10-26 09:56 | y | 7 |
1 | 2022-10-26 09:56 | z | 8 |
2 | 2022-10-26 09:23 | start_event | 5 |
2 | 2022-10-26 09:23 | k | 6 |
2 | 2022-10-26 09:26 | l | 7 |
Real timestamp column looks like this: 1970-01-20 06:57:25.583738 UTC I'm using event based data and my table is quite large.
Is there any way for me pick those desired rows only? And discard all events before start_event in every partition.
I've got this far but I have no idea how to discard unwanted events for every partition.
SELECT ID, event_server_date , Event,
row_number() over(partition by ID ORDER BY event_server_date ASC) AS Row_number
FROM `my_table`
ORDER BY event_server_date ASC
Note: I have been using SQL for only two months so I might not know the concepts you're talking about.
CodePudding user response:
We can use COUNT()
as an analytic function along with a rolling sum:
WITH cte AS (
SELECT *, COUNT(CASE WHEN Event = 'start_event' THEN 1 END)
OVER (PARTITION BY ID ORDER BY event_server_date) cnt,
ROW_NUMBER() OVER (PARTITION BY makro_id
ORDER BY event_server_date) AS Row_number
FROM my_table
)
SELECT ID, event_server_date, Event, Row_number
FROM cte
WHERE cnt > 0
ORDER BY ID, event_server_date;
The count generated above will be one or greater once a row is encountered having start_event
. We only retain rows late enough for this condition to be true.
CodePudding user response:
I would do it with LAST_VALUE()
analytic function and QUALIFY
SELECT *
FROM my_table
QUALIFY LAST_VALUE(IF(event = 'start_event', TRUE, NULL) IGNORE NULLS)
OVER (PARTITION BY ID ORDER BY event_server_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
You can ignore ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
because it's default window in analytic functions but I like it to leave it for the sake of clarity.