Would these two queries yield the same result (Using Window Function vs Group By)-CodePudding

I came across this question in "Ace the Data Science Interview":

"Assume you are given the table below containing information on user purchases. Write a query to obtain the number of people who purchased at least one or more of the same product on multiple days."

purchases

column_name	type
purchase_id	integer
user_id	integer
product_id	integer
quantity	integer
price	float
purchase_time	datetime

I came up with the following query:

SELECT 
    count(DISTINCT user_id) as num_ppl_w_multiple_purchases
FROM (
    SELECT
        user_id,
        count(*) as num_product_purchases /* only increments for purchases made on different days */
    FROM
        purchases
    GROUP BY
        user_id,
        product_id,
        DATE(purchase_time)
    HAVING
        num_product_purchases > 1
) t

And the book provides this solution:

SELECT
    COUNT(DISTINCT user_id)
FROM
    (
        SELECT
            user_id,
            RANK() OVER (
                PARTITION BY user_id,
                product_id
                ORDER BY
                    CAST(purchase_time as DATE)
            ) AS purchase_no
        FROM
            purchases
    ) t
WHERE
    purchase_no = 2

I have no way to verify if these two queries achieve the same thing. Could someone give me affirmation that my query is correct and if it isn't why not?

EDIT Modified query in response to Bill Karwin's response:

SELECT 
    count(DISTINCT user_id) as num_ppl_w_multiple_purchases
FROM (
    SELECT
        user_id,
        count(*) as num_product_purchases
    FROM (
        SELECT
            DISTINCT user_id, product_id, DATE(purchase_time)
        FROM
            purchases
    ) t1
    GROUP BY
        user_id,
        product_id
    HAVING
        num_product_purchases > 1
) t2

CodePudding user response：

Your query looks like it would match a user who bought the same product multiple times on the same day, but it would not find users who bought the same product on multiple days. Your GROUP BY criteria include the date, so the count will only include those entries per date.

 ------------- --------- ------------ ---------- ------- --------------------- 
| purchase_id | user_id | product_id | quantity | price | purchase_time       |
 ------------- --------- ------------ ---------- ------- --------------------- 
|           1 |       2 |         10 |        1 |    10 | 2022-04-20 00:00:00 |
|           2 |       2 |         10 |        1 |    10 | 2022-04-21 00:00:00 |
|           3 |       8 |         10 |        1 |    10 | 2022-04-19 00:00:00 |
|           4 |       8 |         10 |        1 |    10 | 2022-04-19 00:00:00 |
 ------------- --------- ------------ ---------- ------- ---------------------

I'll demonstrate by using part of your subquery:

SELECT user_id, date(purchase_time), count(*) as num_product_purchases 
FROM purchases GROUP BY user_id, product_id, DATE(purchase_time);

 --------- --------------------- ----------------------- 
| user_id | date(purchase_time) | num_product_purchases |
 --------- --------------------- ----------------------- 
|       2 | 2022-04-20          |                     1 |
|       2 | 2022-04-21          |                     1 |
|       8 | 2022-04-19          |                     2 |
 --------- --------------------- -----------------------

The user 2 has made purchases on two separate days, but since each day is in its own group, they are both counted as 1. So they will be excluded by the HAVING condition. Only user 8 gets a count of 2, because they purchased their product on the same day.

But user 8 should not be counted as having purchased the product on multiple days, because they only purchased on one day. So the results are wrong for both users.

The results from testing your updated query (just part of the subquery so we can see the interim result):

SELECT user_id, count(*) as num_product_purchases FROM (
  SELECT DISTINCT user_id, product_id, DATE(purchase_time) FROM purchases) t1 
GROUP BY user_id, product_id;

 --------- ----------------------- 
| user_id | num_product_purchases |
 --------- ----------------------- 
|       2 |                     2 |
|       8 |                     1 |
 --------- -----------------------

That looks better. I can't think of another data sample that would make this wrong.