Home > Back-end >  PostgreSQL doesn't consistently use partial index
PostgreSQL doesn't consistently use partial index

Time:10-14

I have a postgres 13.3 table that looks like the following:

CREATE TABLE public.enrollments (
    id bigint NOT NULL,
    portfolio_id bigint NOT NULL,
    consumer_id character varying(255) NOT NULL,
    identity_id character varying(255) NOT NULL,
    deleted_at timestamp(0) without time zone,
    batch_replace boolean DEFAULT false NOT NULL
);
CREATE UNIQUE INDEX enrollments_portfolio_id_consumer_id_index ON public.enrollments 
  USING btree (portfolio_id, consumer_id) WHERE (deleted_at IS NULL);

Each portfolio typically contains many millions of enrollments. My customers typically send me a batch file on a regular basis that contains all of their enrollments so I have to make the db match this file. I try to read a chunk of about 1000 at a time and then query to check if enrollments are preexisting or not with a query such as the following:

SELECT * FROM enrollments WHERE deleted_at IS NULL AND portfolio_id = 1 
  AND consumer_id = ANY(ARRAY["C1", "C2", ..., "C1000"])

It appears that for a new portfolio, it doesn't use the unique partial index so this query can take up to 30 seconds. When there are already several million enrollments in the portfolio, the index seems to work and takes around 20ms. I've had to change the sql to just query one enrollment at a time which takes about 1sec/1000. This isn't ideal as it can take up to a day to finish a file but at least it finishes.

Does anybody know what I can do to get the unique partial index to be used consistently when using many consumer_ids in the select?

Below is some explain output. The lengthy query took a little over 4 seconds and this increases up to at least 30 as more and more enrollments get inserted into the portfolio until it gets to some point and drops to about 20ms

Existing enrollments in this portfolio: 78140485

Index Scan using enrollments_portfolio_id_consumer_id_index on enrollments e0  (cost=0.70..8637.14 rows=1344 width=75) (actual time=3.529..37.827 rows=1000 loops=1)
  Index Cond: ((portfolio_id = '59031'::bigint) AND ((consumer_id)::text = ANY ('{C1,C2,...,C1000}'::text[])))
  I/O Timings: read=27.280
Planning Time: 0.477 ms
Execution Time: 37.914 ms

Benchmark time: 20 ms


Existing enrollments in this portfolio: 136000

Index Scan using enrollments_portfolio_id_consumer_id_index on enrollments e0  (cost=0.70..8.87 rows=1 width=75) (actual time=76.615..4354.081 rows=1000 loops=1)
  Index Cond: (portfolio_id = '59028'::bigint)
  Filter: ((consumer_id)::text = ANY ('{C1,C2,...,C1000}'::text[]))
  Rows Removed by Filter: 135000
Planning Time: 1.188 ms
Execution Time: 4354.341 ms

Benchmark time: 4398 ms

CodePudding user response:

The thing here that is actually slow is that =ANY is implemented by looping over the 1000 members of your array and testing each one, and doing that for each of the 136000 rows it needs to inspect. That is a lot looping (but not 4 seconds worth in my hands, "only" 1.5s for me). Worse, the planner doesn't anticipate that the =ANY has such a poor implementation, so sees no reason to choose the other plan to avoid it.

v14 will fix this by using a hash table to implement the =ANY, so it will no longer be so inordinately slow.

If you can't/don't want to upgrade to v14, you could rewrite the query by joining to a VALUES list, rather than using =ANY

SELECT * FROM enrollments join (VALUES ('C1'),...,('C1000')) f(c) on c=consumer_id
  WHERE deleted_at IS NULL AND portfolio_id = 1 
  • Related