I have a query that takes several seconds in my staging environment, but takes a few minutes in my production environment. The query is the same, but the production environment has many more rows in each table (appx 10x).
The goal of the query is to find all distinct software records which are installed on at least one of the assets in the search results (NOTE: the search criteria can vary across dozens of fields).
The tables:
- Assets: dozens of fields on which a user can search - production has several million records
- InstalledSoftwares: asset_id (reference), software_id (reference) - each asset has 10-100 installed software records, so there are 10s of millions of records in production.
- Softwares: the results. Production has less than 4000 unique software records.
I've removed duplicate WHERE clauses based on the suggestions from Laurenz Albe. How can I make this more efficient?
The Query:
SELECT DISTINCT "softwares"."id" FROM "softwares"
INNER JOIN "installed_softwares" ON "installed_softwares"."software_id" = "softwares"."id"
WHERE "installed_softwares"."asset_id" IN
(SELECT "assets"."id" FROM "assets"
WHERE "assets"."assettype_id" = 3
AND "assets"."archive_number" IS NULL
AND "assets"."expired" = FALSE
AND "assets"."local_status_id" != 4
AND "assets"."scorable" = TRUE)
Here is the EXPLAIN (analyze, buffers) for the query:
Unique (cost=153558.09..153588.92 rows=5524 width=8) (actual time=4224.203..5872.293 rows=3525 loops=1)
Buffers: shared hit=112428 read=187, temp read=3145 written=3164
I/O Timings: read=2.916
-> Sort (cost=153558.09..153573.51 rows=6165 width=8) (actual time=4224.200..5065.158 rows=1087807 loops=1)
Sort Key: softwares.id
Sort Method: external merge Disk: 19240kB
Buffers: shared hit=112428 read=187, temp read=3145 written=3164
I/O Timings: read=2.916
-> Hash Join (cost=119348.05..153170.01 rows=6165 width=8) (actual time=342.860..3159.458 rows=1087807 loops=1)
Hash Cond: (installed_softwares.software_id = softwares.id)
Buffers: shared hit=112428 read=187
I/O Timings: read=2.916
-> Gather (cost=119119.76..152925.53 rows=6165 width=8) (actual time=333.981..1320.277 rows=1087807 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=112324 read=187
I/O Timings: read=2.916
-> Parallel Hash Join (cost=118119.76..151309.03 rows=2569 width=8) (actual time=331.836..2010.171 rows=362602 loops=3)
Hash Cond: (installed_softwares.asset_id = assets.id)
Buffers: shared hit=112324 read=187
I/O Timings: read=2.916
-> Parallel Seq Scan on installed_softwares (cost=0.00..30518.88 rows=1017288 width=16) (actual time=0.007..667.564 rows=813396 loops=3)
Buffers: shared hit=20159 read=187
I/O Timings: read=2.916
-> Parallel Hash (cost=118065.04..118065.04 rows=4378 width=4) (actual time=331.648..331.651 rows=23407 loops=3)
Buckets: 131072 (originally 16384) Batches: 1 (originally 1) Memory Usage: 4672kB
Buffers: shared hit=92058
-> Parallel Seq Scan on assets (cost=0.00..118065.04 rows=4378 width=4) (actual time=0.012..302.134 rows=23407 loops=3)
Filter: ((archive_number IS NULL) AND (NOT expired) AND scorable AND (local_status_id <> 4) AND (assettype_id = 3))
Rows Removed by Filter: 1363624
Buffers: shared hit=92058
-> Hash (cost=159.24..159.24 rows=5524 width=8) (actual time=8.859..8.862 rows=5546 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 281kB
Buffers: shared hit=104
-> Seq Scan on softwares (cost=0.00..159.24 rows=5524 width=8) (actual time=0.006..4.426 rows=5546 loops=1)
Buffers: shared hit=104
Planning Time: 0.534 ms
Execution Time: 5878.476 ms
CodePudding user response:
The problem is the bad row count estimate for the scan on assets
. The estimate is so bad (48 rows instead of 23407), because the condition is somewhat redundant:
archived_at IS NULL AND archive_number IS NULL AND
NOT expired AND
scorable AND
local_status_id <> 4 AND local_status_id <> 4 AND
assettype_id = 3
PostgreSQL treats all these conditions as statistically independent, which leads it astray. One of the conditions (local_status_id <> 4
) is present twice; remove one copy. The first two conditions seem somewhat redundant too; perhaps one of them can be omitted.
Perhaps that is enough to improve the estimate so that PostgreSQL does not choose the slow nested loop join.