Home > Software engineering >  Find overlapping time periods in BigQuery
Find overlapping time periods in BigQuery

Time:03-19

Assume I have data structured as so in BigQuery:

WITH session_log AS (
  SELECT 'ABC' as site_id, 1234 user_id, 12 session_id, '2020-02-10 00:29:59.376000 UTC' start_time, '2020-02-10 01:13:02.817000 UTC' end_time UNION ALL
  SELECT 'ABC' as site_id, 1234 user_id, 13 session_id, '2020-02-10 02:41:56.330000 UTC' start_time, '2020-02-10 02:41:56.389999 UTC' end_time UNION ALL
  SELECT 'ABC' as site_id, 1234 user_id, 14 session_id, '2020-02-10 04:24:46.649999 UTC' start_time, '2020-02-10 05:14:08.243000 UTC' end_time UNION ALL
  SELECT 'ABC' as site_id, 1234 user_id, 15 session_id, '2020-02-10 04:59:21.356999 UTC' start_time, '2020-02-10 15:57:11.501000 UTC' end_time  
  SELECT 'ABC' as site_id, 6789 user_id, 25 session_id, '2020-02-10 02:15:38.560000 UTC' start_time, '2020-02-10 02:56:38.784500 UTC' end_time UNION ALL
  SELECT 'ABC' as site_id, 6789 user_id, 26 session_id, '2020-02-10 04:59:21.356999 UTC' start_time, '2020-02-10 15:57:11.501000 UTC' end_time
)
SELECT site_id, user_id, session_id, start_time, end_time FROM session_log

I would like to query the session_log so that a session is flagged as "concurrent" if there is another record with the same site_id and user_id whose timeframe overlaps another session. (The meaning here is that a user is visiting the same site on more than one device simultaneously.)

Ideally, I need a query that would produce the following, because at least a portion of sessions 14 and 15 overlap for that specific user. Session 26 overlaps sessions 14 & 15 but is not concurrent because it is a different user_id.

site_id user_id session_id start_time end_time concurrent_session
ABC 1234 12 2020-02-10 00:29:59.376000 UTC 2020-02-10 01:13:02.817000 UTC false
ABC 1234 13 2020-02-10 02:41:56.330000 UTC 2020-02-10 02:41:56.389999 UTC false
ABC 1234 14 2020-02-10 04:24:46.649999 UTC 2020-02-10 05:14:08.243000 UTC true
ABC 1234 15 2020-02-10 04:59:21.356999 UTC 2020-02-10 05:57:11.501000 UTC true
ABC 6789 25 2020-02-10 02:15:38.560000 UTC 2020-02-10 02:56:38.784500 UTC false
ABC 6789 26 2020-02-10 04:44:21.356999 UTC 2020-02-10 06:57:11.501000 UTC false

I tried to create a user defined function that would search the table for sessions with that same site_id and user_id but not the session_id with overlapping times but it's failing miserably. I'm almost embarrassed to put this here, but...what the heck.

CREATE TEMPORARY FUNCTION getConcurrentSessions(_site_id STRING, _user_id INT64, _session_id INT64, _start_time TIMESTAMP, _end_time TIMESTAMP)
AS 
(
    (
        SELECT count(session_id)
        FROM `session_log`
        WHERE site_id = _site_id
        AND user_id = _user_id
        AND session_id != _session_id
        AND (
            (_start_time BETWEEN start_time AND end_time)
            OR
            (_end_time BETWEEN start_time AND end_time)
        )
    )
);
SELECT site_id, user_id, session_id, start_time, end_time,
IF (
  getConcurrentSessions(site_id, user_id, session_id, start_time, end_time) > 0,
  TRUE,
  FALSE
) AS concurrent_sessions 
FROM session_log

Any and all advice appreciated. Thank you.

CodePudding user response:

This could be calculated with a "correlated subquery" like this:

WITH session_log AS (
  SELECT 'ABC' as site_id, 1234 user_id, 12 session_id, '2020-02-10 00:29:59.376000 UTC' start_time, '2020-02-10 01:13:02.817000 UTC' end_time UNION ALL
  SELECT 'ABC' as site_id, 1234 user_id, 13 session_id, '2020-02-10 02:41:56.330000 UTC' start_time, '2020-02-10 02:41:56.389999 UTC' end_time UNION ALL
  SELECT 'ABC' as site_id, 1234 user_id, 14 session_id, '2020-02-10 04:24:46.649999 UTC' start_time, '2020-02-10 05:14:08.243000 UTC' end_time UNION ALL
  SELECT 'ABC' as site_id, 1234 user_id, 15 session_id, '2020-02-10 04:59:21.356999 UTC' start_time, '2020-02-10 15:57:11.501000 UTC' end_time UNION ALL
  SELECT 'ABC' as site_id, 6789 user_id, 25 session_id, '2020-02-10 02:15:38.560000 UTC' start_time, '2020-02-10 02:56:38.784500 UTC' end_time UNION ALL
  SELECT 'ABC' as site_id, 6789 user_id, 26 session_id, '2020-02-10 04:59:21.356999 UTC' start_time, '2020-02-10 15:57:11.501000 UTC' end_time
)
select
  (select count(*) 
   from session_log as xtra_sess
   where logins.site_id = xtra_sess.site_id
   and logins.user_id = xtra_sess.user_id
   and logins.session_id <> xtra_sess.session_id
   and (xtra_sess.start_time between logins.start_time and logins.end_time
     or xtra_sess.end_time between logins.start_time and logins.end_time
     )
  ) xtras_sess
  , logins.*
from session_log as logins

or you could use a self join and group by

select
  count(distinct xtra_sess.site_id) as xtra_sess
, logins.site_id
, logins.user_id
, logins.session_id
, logins.start_time
, logins.end_time
from session_log as logins
left join session_log as xtra_sess
   on  logins.site_id = xtra_sess.site_id
   and logins.user_id = xtra_sess.user_id
   and logins.session_id <> xtra_sess.session_id
   and (xtra_sess.start_time between logins.start_time and logins.end_time
     or xtra_sess.end_time between logins.start_time and logins.end_time
     )
group by
  logins.site_id
, logins.user_id
, logins.session_id
, logins.start_time
, logins.end_time

Result

 ------------ --------- --------- ------------ -------------------------------- -------------------------------- 
| xtras_sess | site_id | user_id | session_id |           start_time           |            end_time            |
 ------------ --------- --------- ------------ -------------------------------- -------------------------------- 
|          0 | ABC     |    1234 |         12 | 2020-02-10 00:29:59.376000 UTC | 2020-02-10 01:13:02.817000 UTC |
|          0 | ABC     |    1234 |         13 | 2020-02-10 02:41:56.330000 UTC | 2020-02-10 02:41:56.389999 UTC |
|          1 | ABC     |    1234 |         14 | 2020-02-10 04:24:46.649999 UTC | 2020-02-10 05:14:08.243000 UTC |
|          1 | ABC     |    1234 |         15 | 2020-02-10 04:59:21.356999 UTC | 2020-02-10 15:57:11.501000 UTC |
|          0 | ABC     |    6789 |         25 | 2020-02-10 02:15:38.560000 UTC | 2020-02-10 02:56:38.784500 UTC |
|          0 | ABC     |    6789 |         26 | 2020-02-10 04:59:21.356999 UTC | 2020-02-10 15:57:11.501000 UTC |
 ------------ --------- --------- ------------ -------------------------------- -------------------------------- 

see: db<>fiddle enter image description here

  • Related