Why postgresql can efficiently use the index in this query?-CodePudding

Given the following ddl

CREATE TABLE test
(
    c1 SMALLINT  NOT NULL,
    c2 INTERVAL  NOT NULL,
    c3 TIMESTAMP NOT NULL,
    c4 VARCHAR   NOT NULL,
    PRIMARY KEY (c1, c2, c3)
);

CREATE INDEX test_index ON test (c3, c2);

The following query

SELECT *
FROM test
WHERE c2 = '1 minute'::INTERVAL
ORDER BY c3
LIMIT 1000

gives the following query plan in PostgreSQL 13.3

Limit  (cost=0.43..49.92 rows=1000 width=60)
    ->  Index Scan using test_index on test  (cost=0.43..316739.07 rows=6400526 width=60)
            Index Cond: (c2 = '00:01:00'::interval)

Considering that test_index has columns in this order (c3, c2), why postgres can efficiently filter by c2 and sort by c3 using this index? From my understanding columns that appear in ORDER BY must be the last in the index definition otherwise index will not be used. It also works the same in case of ORDER BY c3 DESC

CodePudding user response：

Without the actual run statistics (EXPLAIN ANALYZE), we don't know that it is efficient. We only know that the planner thinks it is more efficient than the alternatives.

By addressing the rows already in the desired order, it can filter out the ones that fail the c2 condition, then stop once it accumulates 1000 which pass the condition. It thinks it will accomplish this after reading only about 1/6000 of the index.

The plan doesn't explicitly say that the index is being used to provide ordering. We can infer that based on the absence of a Sort node. PostgreSQL knows how to follow an index in either direction, which is why the index also works if the order is DESC.

Whatever efficiency this does have mostly comes from stopping early and avoiding the sort. The filtering on c2='00:01:00'::interval is not very efficient. It can't jump to the part of the index were it knows that condition to be true, but rather it has to scan the index and individually assess the index tuples to filter them. But at least it can apply the filter to the index tuple, without needing to visit the table tuple, which can save a lot of random IO. (I think it would be a good idea if the plan would somehow distinguish jump-to index usage from in-index filtering usage, but that is easier said than done).

An index just on c3 could still read in order and stop early, but it would have to visit the table for every tuple, even the ones that end up failing on c2. The better index would be on (c2,c3). That way it can jump to the part of the index satisfying c2condition, and then read in order by c3 within just that part.

CodePudding user response：

It may not be efficient.

However, the optimizer chose to filter on the index.

That means it will read index entries that are sorted according to the expected ordering, but not all of them will be useful. That's why it added the filtering predicate c2 = '00:01:00'::interval on the index scan.

Maybe the cost of the index scan that discards entries is still lower than a table scan, specially considering it will only keep 1000 at the most.