I have the following query which causes a 'resources exceeded' error
SELECT TIME, VALUE, SUM(TAG) OVER (ORDER BY TIME ASC) AS RUNNING
FROM `grid-frequency.frequency.tagged_excursions`
The error states that the OVER clause is to blame. But if I just sort the table by time without computing a running total like
SELECT TIME, VALUE
FROM `grid-frequency.frequency.tagged_excursions` ORDER BY TIME ASC LIMIT 1000
It works just fine. Why is the former more expensive than the latter, and how could I compute this more efficiently?
CodePudding user response:
Actually, your second query doesn't require the whole table to be sorted. It's more close to Top-N sort problem due to LIMIT 1000
which is more efficient than the first query, I guess.
What I tried before to address the similar problem is to divide the problem space into smaller ones, conquer them one by one and combine them to generate same result as I wanted.
Below is a simplified query I'd tried before. In the query, I've divided the table into smaller ones by using month
info from a date and calculate cumulative monthly sum
first. By adding this cumulative monthly sum
to net cumulative daily sum
, same results could be calculated.
I think his approach reduces the complexity of sorting and help resolve 'resources exceeded' error.
Hope this is helpful.
DECLARE purchase_log ARRAY<STRUCT<
dt STRING,
order_id INT64,
user_id STRING,
purchase_amount INT64
>>
DEFAULT [
('2014-01-01', 1, 'rhwpvvitou', 13900),
('2014-01-02', 4, 'wkmqqwbyai', 14893),
('2014-01-03', 5, 'ciecbedwbq', 13054),
('2014-02-03', 7, 'dfgqftdocu', 15591),
('2014-02-04', 8, 'sbgqlzkvyn', 3025),
('2014-02-05', 11, 'jqcmmguhik', 4235),
('2014-03-05', 13, 'pgeojzoshx', 16008),
('2014-03-06', 16, 'gbchhkcotf', 3966),
('2014-03-07', 17, 'zfmbpvpzvu', 28159),
('2014-04-07', 19, 'uyqboqfgex', 10805),
('2014-04-08', 21, 'zosbvlylpv', 13999),
('2014-05-08', 22, 'bwfbchzgnl', 2299),
('2014-05-09', 23, 'zzgauelgrt', 16475),
('2014-05-09', 24, 'qrzfcwecge', 6469),
('2014-05-10', 26, 'cyxfgumkst', 11339)
];
WITH sales AS (
SELECT p.*,
-- divide the problem space into smaller ones
EXTRACT(MONTH FROM DATE(dt)) AS month,
SUM(purchase_amount) OVER (PARTITION BY EXTRACT(MONTH FROM DATE(dt)) ORDER BY dt) AS net_cumulative_sales,
FROM UNNEST(purchase_log) p
),
monthly_cumulative_sales AS (
SELECT month,
IFNULL(SUM(SUM(purchase_amount)) OVER w, 0) AS cumulative_monthly_sales
FROM sales GROUP BY 1
WINDOW w AS (ORDER BY month RANGE BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
)
SELECT dt, purchase_amount,
net_cumulative_sales cumulative_monthly_sales AS cumulative_sales,
-- below column is for validation, should be same as `cumulative_sales`
SUM(purchase_amount) OVER (ORDER BY dt) AS validation
FROM sales JOIN monthly_cumulative_sales USING (month)
;