I have a table like below:
timestamp | customer_id | product_name
2022-04-01 23:49:07 UTC | a23 | dog_shampoo
2022-04-01 23:49:07 UTC | a33 | dog_shampoo
2022-04-01 23:49:07 UTC | a45 | dog_toy
2022-04-01 23:49:09 UTC | a67 | dog_shampoo
2022-04-01 23:49:09 UTC | a66 | dog_toy
2022-04-01 23:49:09 UTC | a63 | dog_toy
2022-04-01 23:50:10 UTC | a50 | dog_shampoo
2022-04-01 23:50:11 UTC | a51 | dog_shampoo
If I want to get number of concurrent purchases for each product, it can be simply done by a group by like below:
SELECT
product_name,
timestamp,
COUNT(DISTINCT user_id) AS concurrent_purchases
FROM table
GROUP BY 1,2
Getting this output:
result1:
timestamp | product_name | concurrent_purchase
2022-04-01 23:49:07 UTC | dog_shampoo | 2
2022-04-01 23:49:07 UTC | dog_toy | 1
2022-04-01 23:49:09 UTC | dog_toy | 2
2022-04-01 23:49:09 UTC | dog_shampoo | 1
2022-04-01 23:50:10 UTC | dog_shampoo | 1
2022-04-01 23:50:11 UTC | dog_shampoo | 1
And then the maximum will be:
SELECT
product_name,
MAX(concurrent_purchases) AS max_concurrent_purchases
FROM result1
GROUP BY 1
from which the final result will be
product_name | max_concurrent_purchases
dog_shampoo | 2
dog_toy | 2
But if I define concurrent purchases as within 2 seconds of each other to build an output like below:
timestamp | product_name | concurrent_purchase
2022-04-01 23:49:07 UTC - 2022-04-01 23:49:09 UTC | dog_shampoo | 3
2022-04-01 23:49:09 UTC - 2022-04-01 23:49:11 UTC | dog_shampoo | 0
2022-04-01 23:49:09 UTC - 2022-04-01 23:49:11 UTC | dog_shampoo | 0
. | dog_shampoo | 0
. | dog_shampoo | 0
. | dog_shampoo | 0
2022-04-01 23:50:10 UTC - 2022-04-01 23:50:12 UTC | dog_shampoo | 2
2022-04-01 23:49:07 UTC - 2022-04-01 23:49:09 UTC | dog_toy | 3
That'll give this result:
product_name | max_concurrent_purchases
dog_shampoo | 3
dog_toy | 3
And I am interested in making a flexible solution where it is easy to change the interval values (2 seconds - 10 seconds) and run for different values.
I figured I can get the minimum timestamp for each product name and start incrementing by maybe a control flow statement
but I am not very familiar with setting variables and looping in SQL and wasn’t sure how slow a solution like this would be.
How can I build a query that gives me the desired output?
CodePudding user response:
I made a query here to solve this but not using loops. What I explored was the GENERATE_TIMESTAMP_ARRAY function (
If interval_value
is 2, output will be:
CodePudding user response:
Consider below approach
select product_name, max(concurrent_purchase) as max_concurrent_purchases
from (
select *, count(*) over win as concurrent_purchase
from your_table
window win as (
partition by product_name
order by unix_seconds(timestamp)
range between current row and 2 following
)
)
group by product_name
if applied to sample data in your question - output is
If to use range between current row and 1 following
- the output is