How does postgres decide what multi-column index to use?-CodePudding

Given the following ddl

CREATE TABLE test
(
    c1 INTERVAL  NOT NULL,
    c2 TIMESTAMP NOT NULL,
    c3 VARCHAR   NOT NULL,
    PRIMARY KEY (c1, c2)
);

CREATE INDEX c2_c1_index ON test (c2, c1);

The following query

EXPLAIN ANALYSE
SELECT *
FROM test
WHERE c1 = '1 minute'::INTERVAL
  AND c2 > '2020-06-27 00:00:00.0'::TIMESTAMP AND c2 < '2022-06-27 00:00:00.0'
ORDER BY c2
LIMIT 10000

returns this query plan in PostgreSQL 13.3

Limit  (cost=0.43..1102.23 rows=10000 width=40) (actual time=2.127..15.670 rows=10000 loops=1)
  ->  Index Scan using c2_c1_index on test  (cost=0.43..110174.94 rows=999957 width=40) (actual time=2.126..14.715 rows=10000 loops=1)
        Index Cond: ((c2 > '2020-06-25 00:00:00'::timestamp without time zone) AND (c2 < '2022-06-25 00:00:00'::timestamp without time zone) AND (c1 = '00:01:00'::interval))
Planning Time: 0.290 ms
Execution Time: 16.201 ms

Why does postgres chooses (c2, c1) index instead of (c1, c2) primary key index? Based on the postgres documentation and many other SO answers, columns that are used for equality constraints should be leading column in the index definition. So (c1, c2) seems like a perfect index for that query especially considering sorting.

Also I should note that c1 only has 8 possible values but based on other answers (e.g.) it shouldn't really matter in multicolumn index.

CodePudding user response：

I am assuming that the physical ordering of rows in the table is closely aligned with the order of c2, in other words that pg_stats.correlation is close to 1.0 for c2. So it thinks that by following the index leading with c2, it will get to read the needed portion of the table mostly in order, minimizing the the number of block reads (because each block will have many needed rows) and making the reads it does need to do be more sequential than random. That is a pretty big benefit. The downside is that you need to filter out the rows with the wrong c1, but if most rows have the correct c1 ('1 minute' is the dominant value), that is not much of a downside. Also, this filtering mostly involves CPU usage, and you can expect to get a lot of CPU work done in the time needed to wait for a random disk read.

You could reason that if the table is well correlated on c2, then it probably continues to be well correlated on it even after conditioning on c1, so the other index should enjoy the same benefit. But, the planner does not reason that far ahead, it only looks at the correlation of the first column of the index.

CodePudding user response：

I suspect it's preferring that for reasons of cardinality. You hinted at this:

Also I should note that c1 only has 8 possible values but based on other answers (e.g.) it shouldn't really matter in multicolumn index.

Postgres tracks the number of distinct values in each column, as well as how often the most frequent values occur.

Example of column statistics from Postgres documentation:

SELECT null_frac, n_distinct, most_common_vals, most_common_freqs > FROM pg_stats 
WHERE tablename='tenk1' AND attname='stringu1';

null_frac         | 0
n_distinct        | 672
most_common_vals  | {FDAAAA,NHAAAA,ATAAAA,BGAAAA,EBAAAA,MOAAAA,NDAAAA,OWAAAA,BHAAAA,BJAAAA}
most_common_freqs | {0.00333333,0.00333333,0.003,0.003,0.003,0.003,0.003,0.003,0.00266667,0.00266667}

That means that if '1 minute' is one of the most common values in the c1 column, the planner would know approximately how many rows would be selected from just the c1 = '1 minute' part of the query. If that number is high, then it would be preferable to do an index scan on another index first, even if it's a range query. (Assuming the range query is selective enough, which it can estimate through a histogram. See the link above for how that works.)

An easy way to check this theory would be to re-create your table, with a different value for c1 in each row. Does it still prefer c2_c1_index?