PostgreSQL get all records with most recent date-CodePudding

I have this query that will run on very large data sets and it is too slow.

SELECT *
FROM tableA
WHERE columnA =
        (SELECT MAX(columnA) -- select most recent date from entries we care about
            FROM tableA
            WHERE columnB = '1234' )
    AND columnC in (1,2,3) -- pull a subset out of those entries, this set here can be a thousand (ish) large.

table A looks something like this

pk	columnA	columnB	columnC
1	5/6/2022	1234	1
2	5/6/2022	1234	2
3	5/5/2022	0000	3
4	5/3/2022	0000	4

There are about 1000 distinct entries in columnB and many orders of magnitude more in the table. Is there a better way to structure the query? Or columns I can add to the table that will make it faster?

CodePudding user response：

I suspect that it will be the last line which is taking the most time because the list must be parsed.

AND columnC in (1,2,3) 
-- pull a subset out of those entries, this set here can be a thousand (ish) large.

It would be better to put these values in a table with an index (PRIMARY KEY) so that the query only consults the index.

Join tableX X
On x.id = columnC;

We can also create indexes on columns A and B.
https://dbfiddle.uk/?rdbms=postgres_12&fiddle=6223777b7cbfa986d1eb852ac08aeaaf

CodePudding user response：

You can use window functions to improve performance. For instance,

SELECT *
FROM tableA
QUALIFY ROW_NUMBER() OVER (
    PARTITION BY columnB
    ORDER BY columnA DESC
) = 1

The above query will select the most recent columnB, ordered by columnA.

However, it appears as though you have a tie for pk's 1 and 2... so you might want to think about adding a tie-breaker in the orderby clause.

I am not sure if postgres syntax is slightly different, but another way to do this is:

SELECT 
  a.* 
FROM 
  tableA as a 
  INNER JOIN (
    SELECT 
      columnB, 
      MAX_A 
    FROM 
      (
        SELECT 
          columnB, 
          MAX(columnsA) as MAX_A 
        FROM 
          tableA 
        GROUP BY 
          columnB
      ) as rsMax 
    GROUP BY 
      columnB, 
      MAX_A
  ) as rsUnique ON a.columnA = rsUnique.MAX_A 
  AND a.columnB = rsUnique.columnB

I had to nest the inner subquery rsMax and dedupe it, because of the ties.

I used Rasgo to generate the SQL and tested it on Snowflake. At least 1 of these 2 methods should work for you in postgres.