Home > Software design >  Is there a way to create a Group ID in standard SQL that changes based on some criteria in other col
Is there a way to create a Group ID in standard SQL that changes based on some criteria in other col

Time:11-05

I am working in Google Bigquery, and I am trying to calculate a column in standard SQL that would assign a Group ID to rows, based on some criteria. The criteria would be that a group ID, starting at 1, should be created per unique Variable value, and the group should be split into a new group if the time difference between the current and consecutive Time value is > 2 mins.

See image: Sample Data

I have added a column called LEAD_Time, allowing me to also calculate a Time_Diff column (mins). My desired result is the last column (GroupID). Note how variable C has been split into two groups between rows 23 and 24 due to the time difference being > 2 mins.

It is my understanding that I would need to partition by Variable, and also by some alteration of the TimeStamp_Diff column. I have however not been able to reproduce the last column as per the sample image.

Any help would be greatly appreciated!

CodePudding user response:

Try the following

with sample_data as (
    SELECT 1 as row, 'A' as variable, TIME '07:31:30' as time UNION ALL
    SELECT 2, 'A', TIME '07:33:30' UNION ALL 
    SELECT 3, 'A', TIME '07:35:30' UNION ALL 
    SELECT 4, 'A', TIME '07:37:30' UNION ALL 
    SELECT 5, 'B', TIME '08:01:30' UNION ALL
    SELECT 6, 'B', TIME '08:03:30' UNION ALL
    SELECT 7, 'B', TIME '08:05:30' UNION ALL
    SELECT 8, 'B', TIME '08:07:30' UNION ALL
    SELECT 9, 'C', TIME '09:03:30' UNION ALL
    SELECT 10, 'C', TIME '09:05:30' UNION ALL
    SELECT 11, 'C', TIME '09:07:30' UNION ALL
    SELECT 12, 'C', TIME '09:09:30' UNION ALL
    SELECT 13, 'C', TIME '09:11:30' UNION ALL
    SELECT 14, 'C', TIME '09:21:30' UNION ALL
    SELECT 15, 'C', TIME '09:31:30' UNION ALL
    SELECT 16, 'C', TIME '09:33:30' UNION ALL 
    SELECT 17, 'D', TIME '09:55:30'
),
time_diff_data as (
    SELECT *
        , LEAD(time) OVER (PARTITION BY variable ORDER BY time) as lead_time
        , TIME_DIFF(LEAD(time) OVER (PARTITION BY variable ORDER BY time), time, minute) as time_diff
        , TIME_DIFF(time, LAG(time) OVER (PARTITION BY variable ORDER BY time), minute) as prev_time_diff
    FROM sample_data
)

select *
    ,(countif(prev_time_diff > 2) OVER (PARTITION BY variable ORDER BY time)) 1 as group_id
from time_diff_data

I think the problem is you really want to be looking at the lag in time_diff rather than the lead. From there you can perform a count based on if the prev_time_diff has gone beyond your threshold.

  • Related