I am using PostgreSQL 13.4.
I would like to execute the following query in order to receive the date of the most recent effect for each campaign in a set:
SELECT DISTINCT ON (campaignid)
campaignid,
created
FROM
effects
WHERE
campaignid IN(1, 2, 3) -- the condition. may include different values
ORDER BY
campaignid,
created DESC;
I also have several indexes on that table:
CREATE INDEX effects_campaignid_created_desc_idx ON effects (campaignid, created DESC);
CREATE INDEX effects_created_idx ON effects (created);
Usually when I execute the query, the index effects_campaignid_created_desc_idx
is used. Here's the execution plan:
Unique (cost=26172.58..26406.28 rows=5 width=16) (actual time=710.816..712.315 rows=2 loops=1)
Buffers: shared hit=2480 read=2792
-> Sort (cost=26172.58..26289.43 rows=46739 width=16) (actual time=710.814..711.355 rows=12200 loops=1)
Sort Key: campaignid, created DESC
Sort Method: quicksort Memory: 956kB
Buffers: shared hit=2480 read=2792
-> Index Only Scan using effects_campaign_created_desc_idx on effects (cost=0.57..22547.42 rows=46739 width=16) (actual time=0.954..706.329 rows=12200 loops=1)
" Index Cond: (campaignid = ANY ('{1,2,3}'::bigint[]))"
Heap Fetches: 9079
Buffers: shared hit=2474 read=2792
Planning:
Buffers: shared hit=145 read=14 dirtied=1
Planning Time: 0.682 ms
Execution Time: 712.417 ms
However:
- there is one campaign that has way more effects than any other effect (1000 times fold). Whenever campaign ID 7 is in the requested result set, the query planner chooses a full table scan. The execution plan (no
analyze
because it runs forever):
Unique (cost=27736480.20..28359558.40 rows=5 width=16)
-> Sort (cost=27736480.20..28048019.30 rows=124615640 width=16)
Sort Key: campaignid, created DESC
-> Seq Scan on effects (cost=0.00..7329244.30 rows=124615640 width=16)
" Filter: (campaignid = ANY ('{1,2,3,7}'::bigint[]))"
JIT:
Functions: 5
Options: Inlining true, Optimization true, Expressions true, Deforming true
- when I change the condition to
WHERE campaignid = 7
the query planner changes behavior again and this time chooses a backward scan on the indexeffects_created_idx
, which is equally not efficient. Here's the execution plan:
Unique (cost=0.57..8071964.32 rows=5 width=16)
-> Index Scan Backward using effects_created_idx on effects (cost=0.57..8071964.32 rows=124568901 width=16)
Filter: (campaignid = 7)
Planning:
Buffers: shared hit=5
JIT:
Functions: 4
Options: Inlining true, Optimization true, Expressions true, Deforming true
I understand that PostgreSQL 'knows' that most of the effects
have campaignid=7
, for this reason it chooses to not focus on using the index on campaignid
and chooses another index instead or a sequential scan.
Is there a way to hint / convince PostgreSQL to use the more efficient index effects_campaignid_created_desc_idx
for these queries, regardless of which campaign I choose to get results for?
CodePudding user response:
To do this efficiently, PostgreSQL would probably need a skip scan. But it doesn't know how to do one of them. (People are working to implement it, but even when done I don't know if it would work for DISTINCT ON).
Until it works automatically, you can get an efficient implementation using LATERAL and LIMIT.
select * from
(values(1),(2),(3),(7)) f(campaignid)
cross join lateral
(select created from effects where f.campaignid=campaignid order by created desc limit 1)foo;
CodePudding user response:
Try to replace campaignid IN(1, 2, 3)
by campaignid = ANY(1, 2, 3)
. This made several my queries faster.
However, I had campaignid IN (SELECT UNNEST($1::TEXT[]))
and I replaced it by campaignid = ANY($1)
so it may differ in your case if your values in query are constants.