I have this query that will run on very large data sets and it is too slow.
SELECT *
FROM tableA
WHERE columnA =
(SELECT MAX(columnA) -- select most recent date from entries we care about
FROM tableA
WHERE columnB = '1234' )
AND columnC in (1,2,3) -- pull a subset out of those entries, this set here can be a thousand (ish) large.
table A looks something like this
pk | columnA | columnB | columnC |
---|---|---|---|
1 | 5/6/2022 | 1234 | 1 |
2 | 5/6/2022 | 1234 | 2 |
3 | 5/5/2022 | 0000 | 3 |
4 | 5/3/2022 | 0000 | 4 |
There are about 1000 distinct entries in columnB
and many orders of magnitude more in the table. Is there a better way to structure the query? Or columns I can add to the table that will make it faster?
CodePudding user response:
I suspect that it will be the last line which is taking the most time because the list must be parsed.
AND columnC in (1,2,3)
-- pull a subset out of those entries, this set here can be a thousand (ish) large.
It would be better to put these values in a table with an index (PRIMARY KEY) so that the query only consults the index.
Join tableX X
On x.id = columnC;
We can also create indexes on columns A and B.
https://dbfiddle.uk/?rdbms=postgres_12&fiddle=6223777b7cbfa986d1eb852ac08aeaaf
CodePudding user response:
You can use window functions to improve performance. For instance,
SELECT *
FROM tableA
QUALIFY ROW_NUMBER() OVER (
PARTITION BY columnB
ORDER BY columnA DESC
) = 1
The above query will select the most recent columnB, ordered by columnA.
However, it appears as though you have a tie for pk's 1 and 2... so you might want to think about adding a tie-breaker in the orderby clause.
I am not sure if postgres syntax is slightly different, but another way to do this is:
SELECT
a.*
FROM
tableA as a
INNER JOIN (
SELECT
columnB,
MAX_A
FROM
(
SELECT
columnB,
MAX(columnsA) as MAX_A
FROM
tableA
GROUP BY
columnB
) as rsMax
GROUP BY
columnB,
MAX_A
) as rsUnique ON a.columnA = rsUnique.MAX_A
AND a.columnB = rsUnique.columnB
I had to nest the inner subquery rsMax and dedupe it, because of the ties.
I used Rasgo to generate the SQL and tested it on Snowflake. At least 1 of these 2 methods should work for you in postgres.