Assume I have data structured as so in BigQuery:
WITH session_log AS (
SELECT 'ABC' as site_id, 1234 user_id, 12 session_id, '2020-02-10 00:29:59.376000 UTC' start_time, '2020-02-10 01:13:02.817000 UTC' end_time UNION ALL
SELECT 'ABC' as site_id, 1234 user_id, 13 session_id, '2020-02-10 02:41:56.330000 UTC' start_time, '2020-02-10 02:41:56.389999 UTC' end_time UNION ALL
SELECT 'ABC' as site_id, 1234 user_id, 14 session_id, '2020-02-10 04:24:46.649999 UTC' start_time, '2020-02-10 05:14:08.243000 UTC' end_time UNION ALL
SELECT 'ABC' as site_id, 1234 user_id, 15 session_id, '2020-02-10 04:59:21.356999 UTC' start_time, '2020-02-10 15:57:11.501000 UTC' end_time
SELECT 'ABC' as site_id, 6789 user_id, 25 session_id, '2020-02-10 02:15:38.560000 UTC' start_time, '2020-02-10 02:56:38.784500 UTC' end_time UNION ALL
SELECT 'ABC' as site_id, 6789 user_id, 26 session_id, '2020-02-10 04:59:21.356999 UTC' start_time, '2020-02-10 15:57:11.501000 UTC' end_time
)
SELECT site_id, user_id, session_id, start_time, end_time FROM session_log
I would like to query the session_log
so that a session is flagged as "concurrent" if there is another record with the same site_id
and user_id
whose timeframe overlaps another session. (The meaning here is that a user is visiting the same site on more than one device simultaneously.)
Ideally, I need a query that would produce the following, because at least a portion of sessions 14 and 15 overlap for that specific user. Session 26 overlaps sessions 14 & 15 but is not concurrent because it is a different user_id
.
site_id | user_id | session_id | start_time | end_time | concurrent_session |
---|---|---|---|---|---|
ABC | 1234 | 12 | 2020-02-10 00:29:59.376000 UTC | 2020-02-10 01:13:02.817000 UTC | false |
ABC | 1234 | 13 | 2020-02-10 02:41:56.330000 UTC | 2020-02-10 02:41:56.389999 UTC | false |
ABC | 1234 | 14 | 2020-02-10 04:24:46.649999 UTC | 2020-02-10 05:14:08.243000 UTC | true |
ABC | 1234 | 15 | 2020-02-10 04:59:21.356999 UTC | 2020-02-10 05:57:11.501000 UTC | true |
ABC | 6789 | 25 | 2020-02-10 02:15:38.560000 UTC | 2020-02-10 02:56:38.784500 UTC | false |
ABC | 6789 | 26 | 2020-02-10 04:44:21.356999 UTC | 2020-02-10 06:57:11.501000 UTC | false |
I tried to create a user defined function that would search the table for sessions with that same site_id
and user_id
but not the session_id
with overlapping times but it's failing miserably. I'm almost embarrassed to put this here, but...what the heck.
CREATE TEMPORARY FUNCTION getConcurrentSessions(_site_id STRING, _user_id INT64, _session_id INT64, _start_time TIMESTAMP, _end_time TIMESTAMP)
AS
(
(
SELECT count(session_id)
FROM `session_log`
WHERE site_id = _site_id
AND user_id = _user_id
AND session_id != _session_id
AND (
(_start_time BETWEEN start_time AND end_time)
OR
(_end_time BETWEEN start_time AND end_time)
)
)
);
SELECT site_id, user_id, session_id, start_time, end_time,
IF (
getConcurrentSessions(site_id, user_id, session_id, start_time, end_time) > 0,
TRUE,
FALSE
) AS concurrent_sessions
FROM session_log
Any and all advice appreciated. Thank you.
CodePudding user response:
This could be calculated with a "correlated subquery" like this:
WITH session_log AS (
SELECT 'ABC' as site_id, 1234 user_id, 12 session_id, '2020-02-10 00:29:59.376000 UTC' start_time, '2020-02-10 01:13:02.817000 UTC' end_time UNION ALL
SELECT 'ABC' as site_id, 1234 user_id, 13 session_id, '2020-02-10 02:41:56.330000 UTC' start_time, '2020-02-10 02:41:56.389999 UTC' end_time UNION ALL
SELECT 'ABC' as site_id, 1234 user_id, 14 session_id, '2020-02-10 04:24:46.649999 UTC' start_time, '2020-02-10 05:14:08.243000 UTC' end_time UNION ALL
SELECT 'ABC' as site_id, 1234 user_id, 15 session_id, '2020-02-10 04:59:21.356999 UTC' start_time, '2020-02-10 15:57:11.501000 UTC' end_time UNION ALL
SELECT 'ABC' as site_id, 6789 user_id, 25 session_id, '2020-02-10 02:15:38.560000 UTC' start_time, '2020-02-10 02:56:38.784500 UTC' end_time UNION ALL
SELECT 'ABC' as site_id, 6789 user_id, 26 session_id, '2020-02-10 04:59:21.356999 UTC' start_time, '2020-02-10 15:57:11.501000 UTC' end_time
)
select
(select count(*)
from session_log as xtra_sess
where logins.site_id = xtra_sess.site_id
and logins.user_id = xtra_sess.user_id
and logins.session_id <> xtra_sess.session_id
and (xtra_sess.start_time between logins.start_time and logins.end_time
or xtra_sess.end_time between logins.start_time and logins.end_time
)
) xtras_sess
, logins.*
from session_log as logins
or you could use a self join and group by
select
count(distinct xtra_sess.site_id) as xtra_sess
, logins.site_id
, logins.user_id
, logins.session_id
, logins.start_time
, logins.end_time
from session_log as logins
left join session_log as xtra_sess
on logins.site_id = xtra_sess.site_id
and logins.user_id = xtra_sess.user_id
and logins.session_id <> xtra_sess.session_id
and (xtra_sess.start_time between logins.start_time and logins.end_time
or xtra_sess.end_time between logins.start_time and logins.end_time
)
group by
logins.site_id
, logins.user_id
, logins.session_id
, logins.start_time
, logins.end_time
Result
------------ --------- --------- ------------ -------------------------------- --------------------------------
| xtras_sess | site_id | user_id | session_id | start_time | end_time |
------------ --------- --------- ------------ -------------------------------- --------------------------------
| 0 | ABC | 1234 | 12 | 2020-02-10 00:29:59.376000 UTC | 2020-02-10 01:13:02.817000 UTC |
| 0 | ABC | 1234 | 13 | 2020-02-10 02:41:56.330000 UTC | 2020-02-10 02:41:56.389999 UTC |
| 1 | ABC | 1234 | 14 | 2020-02-10 04:24:46.649999 UTC | 2020-02-10 05:14:08.243000 UTC |
| 1 | ABC | 1234 | 15 | 2020-02-10 04:59:21.356999 UTC | 2020-02-10 15:57:11.501000 UTC |
| 0 | ABC | 6789 | 25 | 2020-02-10 02:15:38.560000 UTC | 2020-02-10 02:56:38.784500 UTC |
| 0 | ABC | 6789 | 26 | 2020-02-10 04:59:21.356999 UTC | 2020-02-10 15:57:11.501000 UTC |
------------ --------- --------- ------------ -------------------------------- --------------------------------