Home > Software design >  BigQuery: resources exceeded when computing running sum
BigQuery: resources exceeded when computing running sum

Time:06-05

I have the following query which causes a 'resources exceeded' error

SELECT TIME, VALUE, SUM(TAG) OVER (ORDER BY TIME ASC) AS RUNNING 
FROM `grid-frequency.frequency.tagged_excursions` 

The error states that the OVER clause is to blame. But if I just sort the table by time without computing a running total like

SELECT TIME, VALUE 
FROM `grid-frequency.frequency.tagged_excursions`  ORDER BY TIME ASC LIMIT 1000

It works just fine. Why is the former more expensive than the latter, and how could I compute this more efficiently?

CodePudding user response:

Actually, your second query doesn't require the whole table to be sorted. It's more close to Top-N sort problem due to LIMIT 1000 which is more efficient than the first query, I guess.

What I tried before to address the similar problem is to divide the problem space into smaller ones, conquer them one by one and combine them to generate same result as I wanted.

Below is a simplified query I'd tried before. In the query, I've divided the table into smaller ones by using month info from a date and calculate cumulative monthly sum first. By adding this cumulative monthly sum to net cumulative daily sum, same results could be calculated.

I think his approach reduces the complexity of sorting and help resolve 'resources exceeded' error.

Hope this is helpful.

DECLARE purchase_log ARRAY<STRUCT<
    dt STRING,
    order_id INT64,
    user_id STRING,
    purchase_amount INT64
>>
DEFAULT [
  ('2014-01-01',  1, 'rhwpvvitou', 13900),
  ('2014-01-02',  4, 'wkmqqwbyai', 14893),
  ('2014-01-03',  5, 'ciecbedwbq', 13054),
  ('2014-02-03',  7, 'dfgqftdocu', 15591),
  ('2014-02-04',  8, 'sbgqlzkvyn',  3025),
  ('2014-02-05', 11, 'jqcmmguhik',  4235),
  ('2014-03-05', 13, 'pgeojzoshx', 16008),
  ('2014-03-06', 16, 'gbchhkcotf',  3966),
  ('2014-03-07', 17, 'zfmbpvpzvu', 28159),
  ('2014-04-07', 19, 'uyqboqfgex', 10805),
  ('2014-04-08', 21, 'zosbvlylpv', 13999),
  ('2014-05-08', 22, 'bwfbchzgnl',  2299),
  ('2014-05-09', 23, 'zzgauelgrt', 16475),
  ('2014-05-09', 24, 'qrzfcwecge',  6469),
  ('2014-05-10', 26, 'cyxfgumkst', 11339)
];


WITH sales AS (
  SELECT p.*,
         -- divide the problem space into smaller ones  
         EXTRACT(MONTH FROM DATE(dt)) AS month,
         SUM(purchase_amount) OVER (PARTITION BY EXTRACT(MONTH FROM DATE(dt)) ORDER BY dt) AS net_cumulative_sales,
    FROM UNNEST(purchase_log) p
),
monthly_cumulative_sales AS (
  SELECT month,
         IFNULL(SUM(SUM(purchase_amount)) OVER w, 0) AS cumulative_monthly_sales 
    FROM sales GROUP BY 1
  WINDOW w AS (ORDER BY month RANGE BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
)
SELECT dt, purchase_amount,
       net_cumulative_sales   cumulative_monthly_sales AS cumulative_sales,
       -- below column is for validation, should be same as `cumulative_sales`
       SUM(purchase_amount) OVER (ORDER BY dt) AS validation
  FROM sales JOIN monthly_cumulative_sales USING (month)
;

enter image description here

  • Related