Incorrect index usage Postgresql Version 12-CodePudding

Query Plan:

db=> explain
db-> SELECT MIN("id"), MAX("id") FROM "public"."tablename" WHERE ( "updated_at" >= '2022-07-24 09:08:05.926533' AND "updated_at" < '2022-07-28 09:16:54.95459' );
                                                                                   QUERY PLAN                                                                                    
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=128.94..128.95 rows=1 width=16)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.57..64.47 rows=1 width=8)
           ->  Index Scan using tablename_pkey on tablename  (cost=0.57..416250679.26 rows=6513960 width=8)
                 Index Cond: (id IS NOT NULL)
                 Filter: ((updated_at >= '2022-07-24 09:08:05.926533'::timestamp without time zone) AND (updated_at < '2022-07-28 09:16:54.95459'::timestamp without time zone))
   InitPlan 2 (returns $1)
     ->  Limit  (cost=0.57..64.47 rows=1 width=8)
           ->  Index Scan Backward using tablename_pkey on tablename tablename_1  (cost=0.57..416250679.26 rows=6513960 width=8)
                 Index Cond: (id IS NOT NULL)
                 Filter: ((updated_at >= '2022-07-24 09:08:05.926533'::timestamp without time zone) AND (updated_at < '2022-07-28 09:16:54.95459'::timestamp without time zone))
(11 rows)

Indexes:
    "tablename_pkey" PRIMARY KEY, btree (id)
    "tablename_updated_at_incl_id_partial_idx" btree (updated_at) INCLUDE (id) WHERE updated_at >= '2022-07-01 00:00:00'::timestamp without time zone

Idea is when there is already a filtered index which only has small subset of records, why is query doing index scan on primary key, instead of tablename_updated_at_incl_id_partial_idx. Also this is a heap table not clustered table.

CodePudding user response：

Because you're using MIN and MAX, try redefining your second index so id is part of the BTREE index, not just INCLUDEd in it. That may make searching for the MIN and MAX items faster.

CodePudding user response：

Since a small fraction of your table really is over 6e6 rows, then your data must be huge. And I am guessing that id and updated_at are nearly perfectly correlated with each other, so selecting specifically for recent updated_at means you are also selecting for higher id. But the planner doesn't now about that. It thinks that by walking up the id index it can stop after walking about 1/6513960 of it, once it finds the first row qualifying on the time column. But instead it has to walk most of the index before finding that row.

The simplest solution probably to introduce some dummy arithmetic into the aggregates: SELECT MIN("id" 0), MAX("id" 0) ... This will force it not to use the index on id. This will probably be the most robust and simplest solution as long as you have the flexibility to change the query text in your app. But even if you can't change the app, this should at least allow you to verify my assumptions and capture an EXPLAIN (ANALYZE) of it while it is not using the pk index.

None of PostgreSQL's advanced statistics will (as of yet) fix this problem. so you are stuck with fixing it by changing the query or the indexes. Changing the query in the silly way I described is the best currently available solution, but if you need to do just with indexes there are some other less-good options but which will likely still be better than what you currently have.

One is to make the horrible index scan at least into a horrible index-only scan. You could replace your existing primary key index with one like create unique index on tablename (id) include (updated_at). Here the INCLUDE is necessary because otherwise the UNIQUE would not do what you want. It will still have to walk a large part of the index, but at least it won't need to keep jumping between index and table to fetch the time column. (Make sure the table is well-vacuumed)

Or, you could provide a partial index that the planner would find attractive, by switching the order of the columns in it: create index on tablename (id, updated_at) WHERE updated_at >= '2022-07-01 00:00:00'::timestamp without time zone The only thing that makes this better than your existing partial index is that this one would actually get used.