I have a table of datestamped events that I need to bundle into 7-day groups, starting with the earliest occurrence of each event_id.
The final output should return each bundle's start and end date and 'value' column of the most recent event from each bundle.
There is no predetermined start date, and the '7-day' windows are arbitrary, not 'week of the year'.
I've tried a ton of examples from other posts but none quite fit my needs or use things I'm not sure how to refactor for BigQuery
Sample Data;
Event_Id | Event_Date | Value |
---|---|---|
1 | 2022-01-01 | 010203 |
1 | 2022-01-02 | 040506 |
1 | 2022-01-03 | 070809 |
1 | 2022-01-20 | 101112 |
1 | 2022-01-23 | 131415 |
2 | 2022-01-02 | 161718 |
2 | 2022-01-08 | 192021 |
3 | 2022-02-12 | 212223 |
Expected output;
Event_Id | Start_Date | End_Date | Value |
---|---|---|---|
1 | 2022-01-01 | 2022-01-03 | 070809 |
1 | 2022-01-20 | 2022-01-23 | 131415 |
2 | 2022-01-02 | 2022-01-08 | 192021 |
3 | 2022-02-12 | 2022-02-12 | 212223 |
CodePudding user response:
You might consider below.
CREATE TEMP FUNCTION cumsumbin(a ARRAY<INT64>) RETURNS INT64
LANGUAGE js AS """
bin = 0;
a.reduce((c, v) => {
if (c Number(v) > 6) { bin = 1; return 0; }
else return c = Number(v);
}, 0);
return bin;
""";
WITH sample_data AS (
select 1 event_id, DATE '2022-01-01' event_date, '010203' value union all
select 1 event_id, '2022-01-02' event_date, '040506' value union all
select 1 event_id, '2022-01-03' event_date, '070809' value union all
select 1 event_id, '2022-01-20' event_date, '101112' value union all
select 1 event_id, '2022-01-23' event_date, '131415' value union all
select 2 event_id, '2022-01-02' event_date, '161718' value union all
select 2 event_id, '2022-01-08' event_date, '192021' value union all
select 3 event_id, '2022-02-12' event_date, '212223' value
),
binning AS (
SELECT *, cumsumbin(ARRAY_AGG(diff) OVER w1) bin
FROM (
SELECT *, DATE_DIFF(event_date, LAG(event_date) OVER w0, DAY) AS diff
FROM sample_data
WINDOW w0 AS (PARTITION BY event_id ORDER BY event_date)
) WINDOW w1 AS (PARTITION BY event_id ORDER BY event_date)
)
SELECT event_id,
MIN(event_date) start_date,
ARRAY_AGG(
STRUCT(event_date AS end_date, value) ORDER BY event_date DESC LIMIT 1
)[OFFSET(0)].*
FROM binning GROUP BY event_id, bin;