I have a table like the following and I am required to show the subtotal of the use_time_sec column grouping by event_datetime, event_name (only show login), user_id and system_id.
with sample_input as (
select '12/01/2023 14:27:59' as event_datetime, 'login' as event_name,'1' as user_id, 'X' as system_id, '0' as use_time_sec
union all
select '12/01/2023 14:28:05', 'screen 1', '1', 'X', '2'
union all
select '12/01/2023 14:28:05', 'screen 2', '1', 'X', '5',
union all
select '12/01/2023 14:28:17', 'screen 1', '1', 'X', '3',
union all
select '12/01/2023 14:28:23', 'logout', '1', '', '0',
union all
select '12/01/2023 14:28:23', 'login', '2', 'Y', '0',
union all
select '12/01/2023 14:28:23', 'screen 1', '2', 'Y', '10',
union all
select '12/01/2023 14:28:24', 'screen 2', '2', 'Y', '100',
union all
select '12/01/2023 14:28:29', 'login', '1', 'X', '0',
union all
select '12/01/2023 14:28:29', 'screen 1', '1', 'X', '500',
union all
select '12/01/2023 14:28:29', 'logout', '1', '', '0',
)
select * from sample_input
I can loop through the table to get my desired output. But thats not the most efficient solution as there are few millions of record in the table and growing everyday.
Will appreciate if someone can provide a better solution than what I have.
Note: The data is in google BigQuery.
Thanks
CodePudding user response:
This is known as the Gaps and Islands problem. We're trying to identify the islands of user sessions. We need to do a query which gives us some way to identify a session. This relies heavily on window functions.
One way is to count the number of logins seen per user.
select
*,
sum(1)
filter(where event_name = 'login')
over(partition by user_id order by event_time)
as session_num
from events
order by event_time
That will keep a tally per user_id. It will add to the tally every time it sees a user login.
event_time | event_type | user_id | use_time_sec | session_num |
---|---|---|---|---|
1000 | login | 1 | 0 | 1 |
1001 | things | 1 | 3 | 1 |
1001 | login | 2 | 10 | 1 |
1002 | logout | 1 | 7 | 1 |
1005 | logout | 2 | 20 | 1 |
1100 | login | 1 | 5 | 2 |
1101 | logout | 1 | 10 | 2 |
Now we have a way to identify each user's sessions. We can grouping by user_id and session_num. These are our islands.
with sessions as (
select
*,
sum(1)
filter(where event_name = 'login')
over(partition by user_id order by event_time)
as session_num
from events
order by event_time
)
select
min(event_time) as session_start,
user_id,
sum(use_time_sec) as total_use_time_sec
from sessions
group by user_id, session_num
order by session_start
session_start | user_id | total_use_time_sec |
---|---|---|
1000 | 1 | 10 |
1001 | 2 | 130 |
1100 | 1 | 15 |
Demonstration in PostgreSQL, but it should work fine on BigQuery.
CodePudding user response:
Thank you so much Schwern. That worked. I had to replace filter with IF.
with sample_input as (
select '12/01/2023 14:27:59' as event_datetime, 'login' as event_name,'1' as user_id, 'X' as system_id, '0' as use_time_sec
union all
select '12/01/2023 14:28:05', 'screen 1', '1', 'X', '2'
union all
select '12/01/2023 14:28:05', 'screen 2', '1', 'X', '5',
union all
select '12/01/2023 14:28:17', 'screen 1', '1', 'X', '3',
union all
select '12/01/2023 14:28:23', 'logout', '1', 'X', '0',
union all
select '12/01/2023 14:28:23', 'login', '2', 'Y', '0',
union all
select '12/01/2023 14:28:23', 'screen 1', '2', 'Y', '10',
union all
select '12/01/2023 14:28:24', 'screen 2', '2', 'Y', '100',
union all
select '12/01/2023 14:28:29', 'login', '1', 'X', '0',
union all
select '12/01/2023 14:28:30', 'screen 1', '1', 'X', '500',
union all
select '12/01/2023 14:35:26', 'logout', '1', 'X', '0',
),
page_time_be_session_details as (
select
*,
sum(if(event_name = 'login',1,0)) over(partition by user_id, system_id order by sample_input.event_datetime) as session_pageview
from sample_input
order by sample_input.event_datetime
)
select min(event_datetime) as login_time, user_id, system_id,sum(cast(use_time_sec as int)) as usage_duration
from page_time_be_session_details
group by user_id, system_id, session_pageview