Postgres underestimating the number of rows leading to bad query plan-CodePudding

I have a query which is taking 2.5 seconds to run. On checking the query plan, I got to know that postgres is heavily underestimating the number of rows leading to nested loops.
Following is the query

explain analyze 
SELECT 
  reprocessed_videos.video_id AS reprocessed_videos_video_id 
FROM 
  reprocessed_videos 
  JOIN commit_info ON commit_info.id = reprocessed_videos.commit_id 
WHERE 
  commit_info.tag = 'stop_sign_tbc_inertial_fix' 
  AND reprocessed_videos.reprocess_type_id = 28 
  AND reprocessed_videos.classification_crop_type_id = 0 
  AND reprocessed_videos.reprocess_status = 'success';

Following is the explain analyze output.

  Nested Loop  (cost=0.84..22941.18 rows=1120 width=4) (actual time=31.169..2650.181 rows=179524 loops=1)
   ->  Index Scan using commit_info_tag_key on commit_info  (cost=0.28..8.29 rows=1 width=4) (actual time=0.395..0.397 rows=1 loops=1)
         Index Cond: ((tag)::text = 'stop_sign_tbc_inertial_fix'::text)
   ->  Index Scan using ix_reprocessed_videos_commit_id on reprocessed_videos  (cost=0.56..22919.99 rows=1289 width=8) (actual time=30.770..2634.546 rows=179524 loops=1)
         Index Cond: (commit_id = commit_info.id)
         Filter: ((reprocess_type_id = 28) AND (classification_crop_type_id = 0) AND ((reprocess_status)::text = 'success'::text))
         Rows Removed by Filter: 1190
 Planning Time: 0.326 ms
 Execution Time: 2657.724 ms

As we can see index scan using ix_reprocessed_videos_commit_id anticipated 1289 rows, whereas there were 179524 rows. I have trying to find the reason for this but have been unsuccessful in whatever I tried.

Following are the things I tried.

Vacuum and analyzing all the involved tables. (helped a little but not much maybe because the tables were automatically vacuumed and analyzed)
Increasing the statistics count for commit_id column alter table reprocessed_videos alter column commit_id set statistics 1000; (helped a little)
I read about extended statistics, but not sure if they are of any use here.

Following are the number of tuples in each of these tables

kpis=> SELECT relname, reltuples FROM pg_class where relname in ('reprocessed_videos', 'video_catalog', 'commit_info');
      relname       |   reltuples   
-------------------- ---------------
 commit_info        |          1439
 reprocessed_videos | 3.1563756e 07

Following is some information related to table schemas

                                                 Table "public.reprocessed_videos"
           Column            |            Type             | Collation | Nullable |                    Default                     
----------------------------- ----------------------------- ----------- ---------- ------------------------------------------------
 id                          | integer                     |           | not null | nextval('reprocessed_videos_id_seq'::regclass)
 video_id                    | integer                     |           |          | 
 reprocess_status            | character varying           |           |          | 
 commit_id                   | integer                     |           |          | 
 reprocess_type_id           | integer                     |           |          | 
 classification_crop_type_id | integer                     |           |          | 
Indexes:
    "reprocessed_videos_pkey" PRIMARY KEY, btree (id)
    "ix_reprocessed_videos_commit_id" btree (commit_id)
    "ix_reprocessed_videos_video_id" btree (video_id)
    "reprocessed_videos_video_commit_reprocess_crop_key" UNIQUE CONSTRAINT, btree (video_id, commit_id, reprocess_type_id, classification_crop_type_id)
Foreign-key constraints:
    "reprocessed_videos_commit_id_fkey" FOREIGN KEY (commit_id) REFERENCES commit_info(id)

                                         Table "public.commit_info"
         Column         |       Type        | Collation | Nullable |                 Default                 
------------------------ ------------------- ----------- ---------- -----------------------------------------
 id                     | integer           |           | not null | nextval('commit_info_id_seq'::regclass)
 tag                    | character varying |           |          | 
 commit                 | character varying |           |          | 

Indexes:
    "commit_info_pkey" PRIMARY KEY, btree (id)
    "commit_info_tag_key" UNIQUE CONSTRAINT, btree (tag)

I am sure that postgres should not use nested loops in this case, but is using them because of bad row estimates. Any help is highly appreciated.

Following are the experiments I tried.

Disabling index scan

 Nested Loop  (cost=734.59..84368.70 rows=1120 width=4) (actual time=274.694..934.965 rows=179524 loops=1)
   ->  Bitmap Heap Scan on commit_info  (cost=4.29..8.30 rows=1 width=4) (actual time=0.441..0.444 rows=1 loops=1)
         Recheck Cond: ((tag)::text = 'stop_sign_tbc_inertial_fix'::text)
         Heap Blocks: exact=1
         ->  Bitmap Index Scan on commit_info_tag_key  (cost=0.00..4.29 rows=1 width=0) (actual time=0.437..0.439 rows=1 loops=1)
               Index Cond: ((tag)::text = 'stop_sign_tbc_inertial_fix'::text)
   ->  Bitmap Heap Scan on reprocessed_videos  (cost=730.30..84347.51 rows=1289 width=8) (actual time=274.250..920.137 rows=179524 loops=1)
         Recheck Cond: (commit_id = commit_info.id)
         Filter: ((reprocess_type_id = 28) AND (classification_crop_type_id = 0) AND ((reprocess_status)::text = 'success'::text))
         Rows Removed by Filter: 1190
         Heap Blocks: exact=5881
         ->  Bitmap Index Scan on ix_reprocessed_videos_commit_id  (cost=0.00..729.98 rows=25256 width=0) (actual time=273.534..273.534 rows=180714 loops=1)
               Index Cond: (commit_id = commit_info.id)
 Planning Time: 0.413 ms
 Execution Time: 941.874 ms

I also set updated the statistics for the commit_id column. I observe a approximately 3x speed increase.

On trying to disable bitmapscan, the query does a sequential scan and takes 19 seconds to run

CodePudding user response：

The nested loop is the perfect join strategy, because there is only one row from commit_info. Any other join strategy would lose.

The question is if the index scan on reprocessed_videos is really too slow. To experiment, try again after SET enable_indexscan = off; to get a bitmap index scan and see if that is better. Then also SET enable_bitmapscan = off; to get a sequential scan. I suspect that your current plan will win, but the bitmap index scan has a good chance.

If the bitmap index scan is better, you should indeed try to improve the estimate:

ALTER TABLE reprocessed_videos ALTER commit_id SET STATISTICS 1000;
ANALYZE reprocessed_videos;

You can try with other values; pick the lowest that gives you a good enough estimate.

Another thing to try are extended statistics:

CREATE STATISTICS corr (dependencies)
   ON (reprocess_type_id, classification_crop_type_id, reprocess_status)
   FROM reprocessed_videos;

ANALYZE reprocessed_videos;

Perhaps you don't need even all three columns in there; play with it.

If the bitmap index scan does not offer enough benefit, there is one way how you can speed up the current index scan:

CLUSTER reprocessed_videos USING ix_reprocessed_videos_commit_id;

That rewrites the table in index order (and blocks concurrent access while it is running, so be careful!). After that, the index scan could be considerably faster. However, the order is not maintained, so you'll have to repeat the CLUSTER occasionally if enough of the table has been modified.

CodePudding user response：

Create a covering index; one that has all the condition columns (first, in descending order of cardinality) and the value columns (last) needed for you query, which means the index alone can be used - avoiding accessing the table:

create index covering_index on reprocessed_videos(
    reprocess_type_id,
    classification_crop_type_id,
    reprocess_status,
    commit_id,
    video_id
);

And ensure there's one on commit_info(id) too - indexes are not automatically defined in postgres, even for primary keys:

create index commit_info__id on commit_info(id);

To get more accurate query plans, you can manually set the cardinality of condition columns, for example:

select count(distinct reprocess_type_id) from reprocessed_videos;

Then set that value to the column:

alter table reprocessed_videos alter column reprocess_type_id set (n_distinct = number_from_above_query)