(Simplified example). I have this table:
USER ID | code | sales
1 x 100
1 y 200
1 z 150
2 x 300
2 z 17
2 y 100
For each user, I need to show just the record with the max sales.
USER ID | code | sales
1 y 200
2 x 300
In the real table I have, It is millions of records. What would be the most efficient way of doing this?
Right now I find for each user the max sales record and join it unto itself to find the full record.
SELECT * from T
WHERE sales = (SELECT MAX(sales) FROM T WHERE user_id=outr.user_id)
Let's assume that sales number does not repeat itself for a specific user.
CodePudding user response:
Here's my attempt:
SELECT T_with_maxsales.user_id,
T_with_maxsales.code,
T_with_maxsales.sales
FROM (SELECT T.*,
MAX(sales) OVER(PARTITION BY user_id) AS max_sales
FROM T) T_with_maxsales
WHERE sales = max_sales
CodePudding user response:
Refactor your dependent subquery into an independent subquery and then JOIN it.
SELECT T.*
FROM T
JOIN (
SELECT MAX(sales) max_sales, user_id
FROM T
GROUP BY user_id
) M ON T.user_id = M.user_id
AND T.sales = M.max_sales;
The subquery gets the largest sale for each user it. The JOIN operation's ON clause retrieves the detail record for each user's largest sale.
A multicolumn index on (user_id, sales)
will make this query work efficiently when running on a large table.
This works on both mysql and postgresql.
CodePudding user response:
With Postgres the most efficient way is typically using distinct on()
select distinct on (user_id) *
from the_table
order by user_id, sales desc;