PostgreSQL chooses wrong multicolumn index for ordered query-CodePudding

I am using PostgreSQL 13.4.

I would like to execute the following query in order to receive the date of the most recent effect for each campaign in a set:

SELECT DISTINCT ON (campaignid)
    campaignid,
    created
FROM
    effects
WHERE
    campaignid IN(1, 2, 3) -- the condition. may include different values
ORDER BY
    campaignid,
    created DESC;

I also have several indexes on that table:

CREATE INDEX effects_campaignid_created_desc_idx ON effects (campaignid, created DESC);
CREATE INDEX effects_created_idx ON effects (created);

Usually when I execute the query, the index effects_campaignid_created_desc_idx is used. Here's the execution plan:

Unique  (cost=26172.58..26406.28 rows=5 width=16) (actual time=710.816..712.315 rows=2 loops=1)
  Buffers: shared hit=2480 read=2792
  ->  Sort  (cost=26172.58..26289.43 rows=46739 width=16) (actual time=710.814..711.355 rows=12200 loops=1)
        Sort Key: campaignid, created DESC
        Sort Method: quicksort  Memory: 956kB
        Buffers: shared hit=2480 read=2792
        ->  Index Only Scan using effects_campaign_created_desc_idx on effects  (cost=0.57..22547.42 rows=46739 width=16) (actual time=0.954..706.329 rows=12200 loops=1)
"              Index Cond: (campaignid = ANY ('{1,2,3}'::bigint[]))"
              Heap Fetches: 9079
              Buffers: shared hit=2474 read=2792
Planning:
  Buffers: shared hit=145 read=14 dirtied=1
Planning Time: 0.682 ms
Execution Time: 712.417 ms

However:

there is one campaign that has way more effects than any other effect (1000 times fold). Whenever campaign ID 7 is in the requested result set, the query planner chooses a full table scan. The execution plan (no analyze because it runs forever):

Unique  (cost=27736480.20..28359558.40 rows=5 width=16)
  ->  Sort  (cost=27736480.20..28048019.30 rows=124615640 width=16)
        Sort Key: campaignid, created DESC
        ->  Seq Scan on effects  (cost=0.00..7329244.30 rows=124615640 width=16)
"              Filter: (campaignid = ANY ('{1,2,3,7}'::bigint[]))"
JIT:
  Functions: 5
  Options: Inlining true, Optimization true, Expressions true, Deforming true

when I change the condition to WHERE campaignid = 7 the query planner changes behavior again and this time chooses a backward scan on the index effects_created_idx, which is equally not efficient. Here's the execution plan:

Unique  (cost=0.57..8071964.32 rows=5 width=16)
  ->  Index Scan Backward using effects_created_idx on effects  (cost=0.57..8071964.32 rows=124568901 width=16)
        Filter: (campaignid = 7)
Planning:
  Buffers: shared hit=5
JIT:
  Functions: 4
  Options: Inlining true, Optimization true, Expressions true, Deforming true

I understand that PostgreSQL 'knows' that most of the effects have campaignid=7, for this reason it chooses to not focus on using the index on campaignid and chooses another index instead or a sequential scan.

Is there a way to hint / convince PostgreSQL to use the more efficient index effects_campaignid_created_desc_idx for these queries, regardless of which campaign I choose to get results for?

CodePudding user response：

To do this efficiently, PostgreSQL would probably need a skip scan. But it doesn't know how to do one of them. (People are working to implement it, but even when done I don't know if it would work for DISTINCT ON).

Until it works automatically, you can get an efficient implementation using LATERAL and LIMIT.

select * from 
    (values(1),(2),(3),(7)) f(campaignid) 
cross join lateral 
    (select created from effects where f.campaignid=campaignid order by created desc limit 1)foo;

CodePudding user response：

Try to replace campaignid IN(1, 2, 3) by campaignid = ANY(1, 2, 3). This made several my queries faster.

However, I had campaignid IN (SELECT UNNEST($1::TEXT[])) and I replaced it by campaignid = ANY($1) so it may differ in your case if your values in query are constants.