How do I calculate a rolling average over a specific range timeframe in BigQuery?-CodePudding

I have a BigQuery table like the one below, where data wasn't necessarily recorded at a consistent rate:

| timestamp               | value |
|-------------------------|-------|
| 2022-10-01 00:03:00 UTC | 2.43  |
| 2022-10-01 00:17:00 UTC | 4.56  |
| 2022-10-01 00:36:00 UTC | 3.64  |
| 2022-10-01 00:58:00 UTC | 2.15  |
| 2022-10-01 01:04:00 UTC | 2.90  |
| 2022-10-01 01:13:00 UTC | 5.88  |
  ...                       ...

I want to calculate a rolling average (as a new column) on value over a certain timeframe, e.g. the previous 12 hours. I know it's relatively simple to do over a fixed number of rows, and I've tried using LAG and TIMESTAMP_SUB functions to select the right values to average over, but I'm quite new to SQL so I'm not even sure if this is the right approach.

Does anyone know how to go about this? Thanks!

CodePudding user response：

Please use a window function.

You need to calculate a date and hour column as an integer. For this we take the unix date and multiply it by 24 hours. Then we add the hours of the day. We ignore daylight saving time.

WITH
  tbl AS (SELECT 10* rand() as val, timestamp_add(snapshot_date,interval cast(rand()*5000 as int64) minute) as timestamps FROM UNNEST(GENERATE_Timestamp_ARRAY("2021-01-01 00:00:00","2023-01-01 0:00:00",INTERVAL 1 hour)) AS snapshot_date)
SELECT
  *,
 unix_date(date(timestamps))*24 extract(hour from timestamps) as dummy_time,
 avg(val) over WIN1_range as rolling_avg,
 sum(1) over WIN1_range as values_in_avg

FROM
  tbl
  window WIN1_range as (order by unix_date(date(timestamps))*24 extract(hour from timestamps) range between 12 PRECEDING and current row)

CodePudding user response：

BigQuery has simplified specifications for the range frame of window functions:

Tip: If you want to use a range with a date, use ORDER BY with the UNIX_DATE() function. If you want to use a range with a timestamp, use the UNIX_SECONDS(), UNIX_MILLIS(), or UNIX_MICROS() function.

Here, we can simply use unix_seconds() when ordering the records in the partition, and accordingly specify an interval of 12 hours as seconds:

select ts, val,
    avg(value) over(
        order by unix_seconds(ts)
        range between 12 * 60 * 60 preceding and current row
    ) as avg_last_12_hours
from mytable

Now say we wanted the average over the last 2 days, we would use unix_date() instead:

select ts, val,
    avg(value) over(
        order by unix_date(ts)
        range between 2 preceding and current row
    ) as avg_last_12_hours
from mytable