Home > Software design >  BigQuery SQL: Group rows with shared ID that occur within 7 days of each other, and return values fr
BigQuery SQL: Group rows with shared ID that occur within 7 days of each other, and return values fr

Time:11-01

I have a table of datestamped events that I need to bundle into 7-day groups, starting with the earliest occurrence of each event_id.

The final output should return each bundle's start and end date and 'value' column of the most recent event from each bundle.

There is no predetermined start date, and the '7-day' windows are arbitrary, not 'week of the year'.

I've tried a ton of examples from other posts but none quite fit my needs or use things I'm not sure how to refactor for BigQuery

Sample Data;

Event_Id Event_Date Value
1 2022-01-01 010203
1 2022-01-02 040506
1 2022-01-03 070809
1 2022-01-20 101112
1 2022-01-23 131415
2 2022-01-02 161718
2 2022-01-08 192021
3 2022-02-12 212223

Expected output;

Event_Id Start_Date End_Date Value
1 2022-01-01 2022-01-03 070809
1 2022-01-20 2022-01-23 131415
2 2022-01-02 2022-01-08 192021
3 2022-02-12 2022-02-12 212223

CodePudding user response:

You might consider below.

CREATE TEMP FUNCTION cumsumbin(a ARRAY<INT64>) RETURNS INT64
LANGUAGE js AS """
  bin = 0;
  a.reduce((c, v) => {
    if (c   Number(v) > 6) { bin  = 1; return 0; }
    else return c  = Number(v); 
  }, 0);

  return bin;
""";

WITH sample_data AS (
    select 1 event_id, DATE '2022-01-01' event_date, '010203' value union all
    select 1 event_id, '2022-01-02' event_date, '040506' value union all
    select 1 event_id, '2022-01-03' event_date, '070809' value union all
    select 1 event_id, '2022-01-20' event_date, '101112' value union all
    select 1 event_id, '2022-01-23' event_date, '131415' value union all
    select 2 event_id, '2022-01-02' event_date, '161718' value union all
    select 2 event_id, '2022-01-08' event_date, '192021' value union all
    select 3 event_id, '2022-02-12' event_date, '212223' value
),
binning AS (
  SELECT *, cumsumbin(ARRAY_AGG(diff) OVER w1) bin
    FROM (
      SELECT *, DATE_DIFF(event_date, LAG(event_date) OVER w0, DAY) AS diff
        FROM sample_data
      WINDOW w0 AS (PARTITION BY event_id ORDER BY event_date)
    ) WINDOW w1 AS (PARTITION BY event_id ORDER BY event_date)
)
SELECT event_id, 
       MIN(event_date) start_date,
       ARRAY_AGG(
         STRUCT(event_date AS end_date, value) ORDER BY event_date DESC LIMIT 1
       )[OFFSET(0)].*
  FROM binning GROUP BY event_id, bin;

enter image description here

  • Related