Home > Mobile >  Ho do I a SQL query in BigQuery to calculate maximum events within a certain time interval?
Ho do I a SQL query in BigQuery to calculate maximum events within a certain time interval?

Time:06-08

I have a table like below:

timestamp               | customer_id | product_name
2022-04-01 23:49:07 UTC |   a23       |  dog_shampoo
2022-04-01 23:49:07 UTC |   a33       |  dog_shampoo
2022-04-01 23:49:07 UTC |   a45       |  dog_toy
2022-04-01 23:49:09 UTC |   a67       |  dog_shampoo
2022-04-01 23:49:09 UTC |   a66       |  dog_toy
2022-04-01 23:49:09 UTC |   a63       |  dog_toy
2022-04-01 23:50:10 UTC |   a50       |  dog_shampoo
2022-04-01 23:50:11 UTC |   a51       |  dog_shampoo

If I want to get number of concurrent purchases for each product, it can be simply done by a group by like below:

SELECT
   product_name,
   timestamp,
   COUNT(DISTINCT user_id) AS concurrent_purchases
FROM table
GROUP BY 1,2

Getting this output:

result1:
timestamp               | product_name  | concurrent_purchase
2022-04-01 23:49:07 UTC | dog_shampoo   |     2
2022-04-01 23:49:07 UTC | dog_toy       |     1
2022-04-01 23:49:09 UTC | dog_toy       |     2
2022-04-01 23:49:09 UTC | dog_shampoo   |     1
2022-04-01 23:50:10 UTC | dog_shampoo   |     1
2022-04-01 23:50:11 UTC | dog_shampoo   |     1

And then the maximum will be:

SELECT
     product_name,
     MAX(concurrent_purchases) AS max_concurrent_purchases
FROM result1
GROUP BY 1

from which the final result will be

product_name   | max_concurrent_purchases
dog_shampoo    |     2
dog_toy        |     2

But if I define concurrent purchases as within 2 seconds of each other to build an output like below:

timestamp                                          | product_name  | concurrent_purchase
2022-04-01 23:49:07 UTC - 2022-04-01 23:49:09 UTC  | dog_shampoo   |     3
2022-04-01 23:49:09 UTC - 2022-04-01 23:49:11 UTC  | dog_shampoo   |     0
2022-04-01 23:49:09 UTC - 2022-04-01 23:49:11 UTC  | dog_shampoo   |     0
                 .                                 | dog_shampoo   |     0
                 .                                 | dog_shampoo   |     0
                 .                                 | dog_shampoo   |     0
2022-04-01 23:50:10 UTC - 2022-04-01 23:50:12 UTC  | dog_shampoo   |     2
2022-04-01 23:49:07 UTC - 2022-04-01 23:49:09 UTC  | dog_toy       |     3

That'll give this result:

product_name   | max_concurrent_purchases
dog_shampoo    |     3
dog_toy        |     3

And I am interested in making a flexible solution where it is easy to change the interval values (2 seconds - 10 seconds) and run for different values.

I figured I can get the minimum timestamp for each product name and start incrementing by maybe a control flow statement but I am not very familiar with setting variables and looping in SQL and wasn’t sure how slow a solution like this would be.

How can I build a query that gives me the desired output?

CodePudding user response:

I made a query here to solve this but not using loops. What I explored was the GENERATE_TIMESTAMP_ARRAY function (enter image description here

If interval_value is 2, output will be:

enter image description here

CodePudding user response:

Consider below approach

select product_name, max(concurrent_purchase) as max_concurrent_purchases
from (
  select *, count(*) over win as concurrent_purchase 
  from your_table
  window win as (
    partition by product_name 
    order by unix_seconds(timestamp) 
    range between current row and 2 following
  )
)
group by product_name    

if applied to sample data in your question - output is

enter image description here

If to use range between current row and 1 following - the output is

enter image description here

  • Related