Home > other >  Bigquery extracting sequences from timeseries data
Bigquery extracting sequences from timeseries data

Time:01-27

I have a timeseries in BQ, with additional data, and based on some of the data I want to extract sequences from the timeseries for further processing.

The following demonstrates the source table:

with dataset as (
 select
    timestamp('2023-01-25 00:00:00') as last_seen, 1 as vehicle_id, 1 as mode, 0 as activity 
    union all select timestamp('2023-01-25 00:00:02'), 1, 1, 0
    union all select timestamp('2023-01-25 00:00:04'), 1, 1, 0
    union all select timestamp('2023-01-25 00:00:00'), 2, 1, 0
    union all select timestamp('2023-01-25 00:00:02'), 2, 1, 0
    union all select timestamp('2023-01-25 00:00:04'), 2, 1, 0
    union all select timestamp('2023-01-25 00:00:06'), 1, 2, 1
    union all select timestamp('2023-01-25 00:00:08'), 1, 2, 1
    union all select timestamp('2023-01-25 00:00:10'), 1, 2, 1
    union all select timestamp('2023-01-25 00:00:12'), 1, 1, 0
    union all select timestamp('2023-01-25 00:00:14'), 1, 1, 0
    union all select timestamp('2023-01-25 00:00:16'), 1, 1, 0
    union all select timestamp('2023-01-25 00:00:12'), 2, 1, 1
    union all select timestamp('2023-01-25 00:00:14'), 2, 1, 1
    union all select timestamp('2023-01-25 00:00:17'), 2, 1, 1
)

What I want is to have a result that for every time the mode and/or activity changes for each vehicle_id which includes the start and end timestamps. Eg like this:

vehicle_id mode activity start end
1 1 0 2023-01-25 00:00:00 2023-01-25 00:00:04
1 2 1 2023-01-25 00:00:06 2023-01-25 00:00:10
1 1 0 2023-01-25 00:00:12 2023-01-25 00:00:16
2 1 0 2023-01-25 00:00:00 2023-01-25 00:00:04
2 1 1 2023-01-25 00:00:12 2023-01-25 00:00:17

I have tried:

select * from dataset where true
qualify ifnull(mode != lag(mode) over win or activity != lag(activity) over win or mode != lead(mode) over win or activity != lead(activity) over win, true)
window win as (partition by vehicle_id order by last_seen)

But that gives start and end on separate rows, so it feels like a dead end as it might cause issues if a sequence does not have an end.

Thanks

CodePudding user response:

You might consider below.

SELECT vehicle_id, 
       ANY_VALUE(mode) mode, ANY_VALUE(activity) activity,
       MIN(last_seen) AS start, MAX(last_seen) AS `end`
  FROM (
    SELECT *, COUNTIF(flag) OVER w1 AS part FROM (
      SELECT *, mode <> LAG(mode) OVER w0 OR activity <> LAG(activity) OVER w0 AS flag
        FROM dataset
      WINDOW w0 AS (PARTITION BY vehicle_id ORDER BY last_seen)
    ) WINDOW w1 AS (PARTITION BY vehicle_id ORDER BY last_seen)
  ) GROUP BY vehicle_id, part;

Query results

enter image description here

  • Related