I am working in Google Bigquery, and I am trying to calculate a column in standard SQL that would assign a Group ID to rows, based on some criteria. The criteria would be that a group ID, starting at 1, should be created per unique Variable
value, and the group should be split into a new group if the time difference between the current and consecutive Time
value is > 2 mins.
See image: Sample Data
I have added a column called LEAD_Time
, allowing me to also calculate a Time_Diff
column (mins). My desired result is the last column (GroupID
). Note how variable C has been split into two groups between rows 23 and 24 due to the time difference being > 2 mins.
It is my understanding that I would need to partition by Variable
, and also by some alteration of the TimeStamp_Diff
column. I have however not been able to reproduce the last column as per the sample image.
Any help would be greatly appreciated!
CodePudding user response:
Try the following
with sample_data as (
SELECT 1 as row, 'A' as variable, TIME '07:31:30' as time UNION ALL
SELECT 2, 'A', TIME '07:33:30' UNION ALL
SELECT 3, 'A', TIME '07:35:30' UNION ALL
SELECT 4, 'A', TIME '07:37:30' UNION ALL
SELECT 5, 'B', TIME '08:01:30' UNION ALL
SELECT 6, 'B', TIME '08:03:30' UNION ALL
SELECT 7, 'B', TIME '08:05:30' UNION ALL
SELECT 8, 'B', TIME '08:07:30' UNION ALL
SELECT 9, 'C', TIME '09:03:30' UNION ALL
SELECT 10, 'C', TIME '09:05:30' UNION ALL
SELECT 11, 'C', TIME '09:07:30' UNION ALL
SELECT 12, 'C', TIME '09:09:30' UNION ALL
SELECT 13, 'C', TIME '09:11:30' UNION ALL
SELECT 14, 'C', TIME '09:21:30' UNION ALL
SELECT 15, 'C', TIME '09:31:30' UNION ALL
SELECT 16, 'C', TIME '09:33:30' UNION ALL
SELECT 17, 'D', TIME '09:55:30'
),
time_diff_data as (
SELECT *
, LEAD(time) OVER (PARTITION BY variable ORDER BY time) as lead_time
, TIME_DIFF(LEAD(time) OVER (PARTITION BY variable ORDER BY time), time, minute) as time_diff
, TIME_DIFF(time, LAG(time) OVER (PARTITION BY variable ORDER BY time), minute) as prev_time_diff
FROM sample_data
)
select *
,(countif(prev_time_diff > 2) OVER (PARTITION BY variable ORDER BY time)) 1 as group_id
from time_diff_data
I think the problem is you really want to be looking at the lag in time_diff rather than the lead. From there you can perform a count based on if the prev_time_diff has gone beyond your threshold.