Recursive loop in BigQuery for capped cumulative sum?-CodePudding

I'd like to be able to implement a "capped" cumulative sum in BigQuery using SQL.

Here's what I mean: I have a table whose rows have the amount by which a value is increased/decreased each day, but the value cannot go below 0 or above 100. I want to compute the cumulative sum of the changes to keep track of this value.

As an example, consider the following table:

day | change 
--------------
1   | 70
2   | 50
3   | 20
4   | -30
5   | 10
6   | -90
7   | 20

I want to make a column that has the capped cumulative sum so that it looks like this:

day | change | capped cumsum
----------------------------
1   | 70     | 70
2   | 50     | 100
3   | 20     | 100
4   | -30    | 70
5   | 10     | 80
6   | -90    | 0
7   | 20     | 20

Simply doing SUM (change) OVER (ORDER BY day) and capping the values at 100 and 0 won't work. I need some sort of recursive loop and I don't know how to implement this in BigQuery.

Eventually I'd also like to do this over partitions, so that if I have something like

day | class | change 
--------------
1   | A     | 70
1   | B     | 12
2   | A     | 50
2   | B     | 83
3   | A     | -30
3   | B     | 17
4   | A     | 10
5   | A     | -90
6   | A     | 20

I can do the capped cumulative sum partitioned over each class.

CodePudding user response：

I need some sort of recursive loop and I don't know how to implement this in BigQuery

Super naïve / cursor based approach

declare cumulative_change int64 default 0;

create temp table temp_table as (
  select * , 0 as capped_cumsum from your_table where false
);

for rec in (select * from your_table order by day)
do
  set cumulative_change = cumulative_change   rec.change;
  set cumulative_change = case when cumulative_change < 0 then 0 when cumulative_change > 100 then 100 else cumulative_change end;
  insert into temp_table (select rec.*, cumulative_change);
end for;

select * from temp_table order by day;

if applied to sample data in your question - output is

Slightly modified option with use of array instead of temp table

declare cumulative_change int64 default 0;
declare result array<struct<day int64, change int64, capped_cumsum int64>>;
 
for rec in (select * from your_table order by day)
do
  set cumulative_change = cumulative_change   rec.change;
  set cumulative_change = case when cumulative_change < 0 then 0 when cumulative_change > 100 then 100 else cumulative_change end;
  set result = array(select as struct * from unnest(result) union all select as struct rec.*, cumulative_change);
end for;

select * from unnest(result) order by day;

P.S. I like none of above options so far :o)
Meantime, that approach might work for relatively small tables, set of data

CodePudding user response：

Using RECURSIVE CTE can be another option:

DECLARE sample ARRAY<STRUCT<day INT64, change INT64>> DEFAULT [
 (1, 70), (2, 50), (3, 20), (4, -30), (5, 10), (6, -90), (7, 20)
];

WITH RECURSIVE ccsum AS (
  SELECT 0 AS n, vals[OFFSET(0)] AS change, 
         CASE 
           WHEN vals[OFFSET(0)] > 100 THEN 100
           WHEN vals[OFFSET(0)] < 0 THEN 0 
           ELSE vals[OFFSET(0)] 
         END AS cap_csum 
    FROM sample
   UNION ALL
  SELECT n   1 AS n, vals[OFFSET(n   1)] AS change,
         CASE 
           WHEN cap_csum   vals[OFFSET(n   1)] > 100 THEN 100
           WHEN cap_csum   vals[OFFSET(n   1)] < 0 THEN 0 
           ELSE cap_csum   vals[OFFSET(n   1)] 
         END AS cap_csum
    FROM ccsum, sample 
   WHERE n < ARRAY_LENGTH(vals) - 1
),
sample AS (
  SELECT ARRAY_AGG(change ORDER BY day) vals FROM UNNEST(sample)
)
SELECT * EXCEPT(n) FROM ccsum ORDER BY n;

output: