I came across this question in "Ace the Data Science Interview":
"Assume you are given the table below containing information on user purchases. Write a query to obtain the number of people who purchased at least one or more of the same product on multiple days."
purchases
column_name | type |
---|---|
purchase_id | integer |
user_id | integer |
product_id | integer |
quantity | integer |
price | float |
purchase_time | datetime |
I came up with the following query:
SELECT
count(DISTINCT user_id) as num_ppl_w_multiple_purchases
FROM (
SELECT
user_id,
count(*) as num_product_purchases /* only increments for purchases made on different days */
FROM
purchases
GROUP BY
user_id,
product_id,
DATE(purchase_time)
HAVING
num_product_purchases > 1
) t
And the book provides this solution:
SELECT
COUNT(DISTINCT user_id)
FROM
(
SELECT
user_id,
RANK() OVER (
PARTITION BY user_id,
product_id
ORDER BY
CAST(purchase_time as DATE)
) AS purchase_no
FROM
purchases
) t
WHERE
purchase_no = 2
I have no way to verify if these two queries achieve the same thing. Could someone give me affirmation that my query is correct and if it isn't why not?
EDIT Modified query in response to Bill Karwin's response:
SELECT
count(DISTINCT user_id) as num_ppl_w_multiple_purchases
FROM (
SELECT
user_id,
count(*) as num_product_purchases
FROM (
SELECT
DISTINCT user_id, product_id, DATE(purchase_time)
FROM
purchases
) t1
GROUP BY
user_id,
product_id
HAVING
num_product_purchases > 1
) t2
CodePudding user response:
Your query looks like it would match a user who bought the same product multiple times on the same day, but it would not find users who bought the same product on multiple days. Your GROUP BY criteria include the date, so the count will only include those entries per date.
------------- --------- ------------ ---------- ------- ---------------------
| purchase_id | user_id | product_id | quantity | price | purchase_time |
------------- --------- ------------ ---------- ------- ---------------------
| 1 | 2 | 10 | 1 | 10 | 2022-04-20 00:00:00 |
| 2 | 2 | 10 | 1 | 10 | 2022-04-21 00:00:00 |
| 3 | 8 | 10 | 1 | 10 | 2022-04-19 00:00:00 |
| 4 | 8 | 10 | 1 | 10 | 2022-04-19 00:00:00 |
------------- --------- ------------ ---------- ------- ---------------------
I'll demonstrate by using part of your subquery:
SELECT user_id, date(purchase_time), count(*) as num_product_purchases
FROM purchases GROUP BY user_id, product_id, DATE(purchase_time);
--------- --------------------- -----------------------
| user_id | date(purchase_time) | num_product_purchases |
--------- --------------------- -----------------------
| 2 | 2022-04-20 | 1 |
| 2 | 2022-04-21 | 1 |
| 8 | 2022-04-19 | 2 |
--------- --------------------- -----------------------
The user 2 has made purchases on two separate days, but since each day is in its own group, they are both counted as 1. So they will be excluded by the HAVING condition. Only user 8 gets a count of 2, because they purchased their product on the same day.
But user 8 should not be counted as having purchased the product on multiple days, because they only purchased on one day. So the results are wrong for both users.
The results from testing your updated query (just part of the subquery so we can see the interim result):
SELECT user_id, count(*) as num_product_purchases FROM (
SELECT DISTINCT user_id, product_id, DATE(purchase_time) FROM purchases) t1
GROUP BY user_id, product_id;
--------- -----------------------
| user_id | num_product_purchases |
--------- -----------------------
| 2 | 2 |
| 8 | 1 |
--------- -----------------------
That looks better. I can't think of another data sample that would make this wrong.