Postgres query planner won't use less expensive composite index-CodePudding

We've run into a strange issue with Postgres, initially unearthed as we prepare to upgrade from 10.21 to 11 . What we'd initially thought was a performance regression in newer versions of Postgres appears, in fact, to be a performance regression once more accurate stats are generated for our database. As it turns out, both last_analyze and last_autoanalyze are NULL for the table in question. The upgrade corrects this, but makes a certain subset of queries much slower.

Background

We're currently running PG 10.21 in production. The table we're interacting with here, events_test includes two columns that we're testing on, account (varchar(40)) and timestamp (timestamptz). There exist indexes on account, timestamp and a composite index on account and timestamp.

Our test query is

EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM events_test 
WHERE account = '1234' AND TIMESTAMP >= (NOW() AT TIME ZONE 'EST'  - interval '1 month');

We're testing it against a couple different account values — some we expect to return a large quantity of data (tens of thousands of rows for account 1234) and some a low quantity (tens of rows for account 5678).

On our current production DB, this runs quickly enough for our purposes, around 80-100ms for the large response (account 1234) and ~1ms for the small response (account 5678). In both cases, the query planner opts to start with an index scan on the composite index events_test_account_timestamp_idx.

Out of date stats

For the account returning a large number of rows, the query planner vastly underestimates the number of rows to be returned — it estimates ~500 rows, but gets around ~60,000. This is a result of the table stats being out of date on our production DB.

Manually running ANALYZE events_test;, or upgrading to PG11, PG12 or PG14 (which seems to cause this reanalysis to be run shortly after the upgrade) updates the table stats and causes the row estimate to be closer to the true value — the new estimate is on the order of ~20,000 rows for the account with more data returned.

Query planner

Unfortunately, this more-accurate estimate seems to result in the query planner running a much less efficient query. For the account with little data, the row estimate doesn't change and the planner continues to use the composite index. For the account with a larger amount of returned data, however, the planner now opts for an index scan on the events_test_timestamp_idx, followed by a filter on the account field, which takes the execution time of the query up to 400-500ms.

The query that expects a small number of rows to be returned continues to use the composite index on all versions of Postgres.

What we've tried

We've dug around on SO looking for similar questions about Postgres bypassing a composite index, but nothing we've found is of use. Also:

Analyze, reindex, stats target

We've tried updating the default_statistics_target from 100 to 1000 (and have also tried it briefly at 10 and 10000), followed by reanalyzing our table and reindexing, and the slower query plan has not changed.

Postgres versions

This issue was inintially revealed as we validated against newer versions of Postgres. We've tested after upgrading to PG11, PG12 and PG14 and the slower performance query planning shows up in all cases. We can also cause it to happen on PG10 by reanalyzing to update our stats, so it doesn't appear that this is a change in planner behaviour in a Postgres update.

Force composite index

We've verified that we can indeed force Postgres to use the composite index by (temporarily) dropping the timestamp index the planner is now preferring with this test command:

BEGIN;

DROP INDEX events_test_timestamp_idx;

EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM events_test 
WHERE account = '1234' AND TIMESTAMP >= (NOW() AT TIME ZONE 'EST'  - interval '1 month');

ROLLBACK;

This causes Postgres to revert to the composite index events_test_account_timestamp_idx, however it now chooses to do a bitmap index scan on this index. This continues to run in under 100ms, and is comparable in performance to the previous straight index scan on the composite index.

Query `EXPLAIN`s

Here is what the query looks like when it's running more efficiently, in production, on PG10:

Index Scan using events_test_account_timestamp_idx on events_test  (cost=0.11..1109.11 rows=579 width=206) (actual time=0.062..116.439 rows=71700 loops=1)
  Index Cond: (((account)::text = '1234'::text) AND ("timestamp" >= (timezone('EST'::text, now()) - '1 mon'::interval)))
  Buffers: shared hit=68619 read=1249
  I/O Timings: read=23.426
Planning time: 0.662 ms
Execution time: 119.339 ms

And here is what it looks like once it has updated statistics, on PG14 (but the output is similar on all versions):

Index Scan using events_test_timestamp_idx on events_test  (cost=0.11..31867.79 rows=18994 width=204) (actual time=0.164..311.960 rows=55583 loops=1)
  Index Cond: ("timestamp" >= ((now() AT TIME ZONE 'EST'::text) - '1 mon'::interval))
  Filter: ((account)::text = '1234'::text)
  Rows Removed by Filter: 462327
  Buffers: shared hit=492867
Planning:
  Buffers: shared hit=144 read=30
  I/O Timings: read=6.160
Planning Time: 7.021 ms
Execution Time: 314.676 ms

And finally, if we force it to use the composite index by temporarily dropping the timestamp index, on PG14 (but again, similar on all versions):

Bitmap Heap Scan on events_test  (cost=187.05..35867.12 rows=18992 width=204) (actual time=11.373..38.937 rows=55582 loops=1)
  Recheck Cond: (((account)::text = '1234'::text) AND ("timestamp" >= ((now() AT TIME ZONE 'EST'::text) - '1 mon'::interval)))
  Heap Blocks: exact=13586
  Buffers: shared hit=13803
  ->  Bitmap Index Scan on events_test_account_timestamp_idx  (cost=0.00..186.10 rows=18992 width=0) (actual time=9.376..9.376 rows=55582 loops=1)
        Index Cond: (((account)::text = '1234'::text) AND ("timestamp" >= ((now() AT TIME ZONE 'EST'::text) - '1 mon'::interval)))
        Buffers: shared hit=217
Planning:
  Buffers: shared hit=5
Planning Time: 0.316 ms
Execution Time: 41.446 ms

Edit: It seems like the core issue here is that, with an accurate estimate, Postgres expects an index and filter on the single index to be faster (31867.79) than using the composite index (35867.12), when in fact the composite is nearly 10x faster in the above case. Which is interesting, given that the last two EXPLAINs above are using the same set of statistics to come up with these predictions.

CodePudding user response：

Sometimes, when selecting small data, or when the data brought during the selection is much more than the remaining data, then PostgreSQL uses a filter instead of using an index, and on the contrary, deletes the remaining data. If you want to interfere with this and always use an index, then try checking execution plan and time with temporarily set enable_seqscan = off and if possible, read this as well, you may need it: Examining Index Usage