The title is a bit confusing so the easiest think is to describe my problem using an example.
Assuming I'm having this dataset:
Username | Event_time | Is_new_session |
---|---|---|
userA | 2022-09-30 00:00:01.000000 | True |
userA | 2022-09-30 00:00:02.000000 | False |
userA | 2022-09-30 01:00:00.000000 | True |
userA | 2022-09-30 02:00:00.000000 | True |
userA | 2022-09-30 02:00:02.000000 | False |
userA | 2022-09-30 02:00:04.000000 | False |
userA | 2022-09-30 00:00:05.000000 | False |
userB | 2022-09-30 03:00:00.000000 | True |
At first I want to achieve something like this:
Username | Event_time | Is_new_session | rnk |
---|---|---|---|
userA | 2022-09-30 00:00:01.000000 | True | 1 |
userA | 2022-09-30 00:00:02.000000 | False | 1 |
userA | 2022-09-30 01:00:00.000000 | True | 2 |
userA | 2022-09-30 02:00:00.000000 | True | 3 |
userA | 2022-09-30 02:00:02.000000 | False | 3 |
userA | 2022-09-30 02:00:04.000000 | False | 3 |
userA | 2022-09-30 02:00:05.000000 | False | 3 |
userB | 2022-09-30 03:00:00.000000 | True | 4 |
So I want to add the same rnk value (probably rank is not the correct term) to all the records of the group until the next record has Is_new_session = True, in order to apply some aggregate functions similar and produce something like this:
SessiondID | Username | First_Event_time | Last_Event_time | Event_counts |
---|---|---|---|---|
1 | userA | 2022-09-30 00:00:01.000000 | 2022-09-30 00:00:02.000000 | 2 |
2 | userA | 2022-09-30 01:00:00.000000 | 2022-09-30 01:00:00.000000 | 1 |
3 | userA | 2022-09-30 02:00:00.000000 | 2022-09-30 02:00:05.000000 | 4 |
4 | userB | 2022-09-30 03:00:00.000000 | 2022-09-30 03:00:00.000000 | 1 |
Do you have any ideas on window functions that could help achieve this "ranking" (results shown on table no.2 )? T-SQL won't really work in my case. Any other recommendations for going from the 1st table to the 3rd without the need of the 2nd are of course very welcome.
CodePudding user response:
We use case when
to mark every time event is true and then run a windw_function count()
to divide the table into groups.
select session_id
,Username
,min(Event_time) as First_Event_time
,max(Event_time) as Last_Event_time
,count(*) as Event_counts
from (
select *
,count(case Is_new_session when 'True' then 1 end) over(order by Event_time) as session_id
from t
) t
group by Username, session_id
session_id | Username | First_Event_time | Last_Event_time | Event_counts |
---|---|---|---|---|
1 | userA | 2022-09-30 00:00:01.000 | 2022-09-30 00:00:02.000 | 2 |
2 | userA | 2022-09-30 01:00:00.000 | 2022-09-30 01:00:00.000 | 1 |
3 | userA | 2022-09-30 02:00:00.000 | 2022-09-30 02:00:05.000 | 4 |
4 | userB | 2022-09-30 03:00:00.000 | 2022-09-30 03:00:00.000 | 1 |