How to calculate the difference in the average check for M/F users for a certain period in PostgresS-CodePudding

There are two tables:

Trips (order_id, user_id, order_dt, order_cost) - all taxi trips

Users (user_id, gender) - table of users, where the gender of the user (M / W) is marked

Required: Calculate the difference in the average check for M/W users over a certain period (using any method in SQL, Python, Excel, etc.) Explain why it may differ

My solution to the problem: To calculate the revenue, you need to multiply the average check at the end of the month by the number of customers who made a purchase from you. To calculate the average check, you need to divide the revenue for the day (or the period of interest to us) by the number of checks, the resulting amount is the average check for the day. It can differ only from the number of male and female users.

1)in SQL -- calculation of the average check

SELECT SUM(u.order_cost) / COUNT(u.order_id)
FROM taxi AS t
INNER JOIN users AS u ON t.user_id = u.user_id
GROUP BY u.gender

-- difference I think so

SELECT avg(order_cost) filter(where gender = 'M') - avg(order_cost) filter(where gender = 'W') 
FROM taxi AS t 
INNER JOIN users AS u ON t.user_id = u.user_id

But they tell me it's wrong. I don't understand where I have an error: in the syntax or should I use a subquery, window functions? I'm new to SQL, so I don't understand how to do it right here.

In Excel, this task is calculated in 2 minutes with pulling up the floor, the summary and the calculated field of the average check.
In Python, I would use the Pandas and Numpy libraries

CodePudding user response：

This

SELECT SUM(u.order_cost) / COUNT(u.order_id)
FROM taxi AS t
INNER JOIN users AS u ON t.user_id = u.user_id
GROUP BY u.gender

Will give two rows -- one for male and one for female.

This will give for male only

SELECT SUM(u.order_cost) / COUNT(u.order_id)
FROM taxi AS t
INNER JOIN users AS u ON t.user_id = u.user_id
WHERE u.gender = 'M'

Here is how I would do it, the simple way

SELECT SUM(AM.AVG_O) / COUNT(AM.USER_ID) AS M_AVG,
       SUM(AW.AVG_O) / COUNT(AW.USER_ID) AS W_AVG,
       SUM(AM.AVG_O) / COUNT(AM.USER_ID) - SUM(AW.AVG_O) / COUNT(AW.USER_ID) AS DIFF
FROM taxi AS t
JOIN (
  SELECT USER_ID, SUM(u.order_cost) / COUNT(u.order_id) AS AVG_O
  FROM users 
  WHERE u.gender = 'M'
  GROUP BY USER_ID
) AS AM ON AM.USER_ID = T.USER_ID
JOIN (
  SELECT USER_ID, SUM(u.order_cost) / COUNT(u.order_id) AS AVG_O
  FROM users 
  WHERE u.gender = 'W'
  GROUP BY USER_ID
) AS AW ON AW.USER_ID = T.USER_ID

You could be fancy and reduce this to only one sub-query but this is clearer to understand. (Use CASE in main select)

Looking at it again I realized it is possible to simplify since you don't care about each user in the outer query. A cross join feels bad, but there is only 1 row in each of these tables so the result is just one row:

SELECT AM.AVG_O AS M_AVG,
       AW.AVG_O AS W_AVG,
       AM.AVG_O - AW.AVG AS DIFF
FROM taxi AS t
CROSS JOIN (
  SELECT SUM(u.order_cost) / COUNT(u.order_id) AS AVG_O
  FROM users 
  WHERE u.gender = 'M'
) 
CROSS JOIN (
  SELECT SUM(u.order_cost) / COUNT(u.order_id) AS AVG_O
  FROM users 
  WHERE u.gender = 'W'
)