An app that I'm working with runs a query like this:
SELECT
"categories"."id"
FROM
"categories"
WHERE
(
('f' )
OR categories.id NOT IN
(
SELECT
category_id
FROM
category_groups
)
);
This query takes an extremely long time to finish, and I haven't seen it finish yet. Our "categories" table has 65947 rows and our "category_groups" table has 131,780 rows.
When running "EXPLAIN" on this query, it says that this query will "cost" a ton of work (cost=1000.29..109033834.49).
I edited this query to remove that ('f') OR
section of the clause, as shown in the below query:
SELECT
"categories"."id"
FROM
"categories"
WHERE
(
categories.id NOT IN
(
SELECT
category_id
FROM
category_groups
)
);
This query finishes very quickly, and it's "cost" when EXPLAIN-ing it is much lower (6283.94..10190.09).
Similarly, if I replace the ('f') OR
with ('t') OR
, the query completes quickly, and the cost goes down (back to 6283.94..10190.09).
Why would adding that ('f') OR
clause damage this query's performance so much?
Edit:
Here's the full EXPLAIN (VERBOSE) for the query with ('f') OR
Gather (cost=1000.29..109033834.49 rows=32952 width=4)
Output: categories.id
Workers Planned: 1
-> Parallel Index Only Scan using categories_pkey on public.categories (cost=0.29..109029539.29 rows=19384 width=4)
Output: categories.id
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..5295.43 rows=131762 width=4)
Output: category_groups.category_id
-> Seq Scan on public.category_groups (cost=0.00..4121.62 rows=131762 width=4)
Output: category_groups.category_id
And here's the full explain for the query without ('f') OR
:
Hash Anti Join (cost=6283.94..10190.09 rows=1131 width=4)
Output: categories.id
Hash Cond: (categories.id = category_groups.category_id)
-> Index Only Scan using categories_pkey on public.categories (cost=0.29..2213.44 rows=65903 width=4)
Output: categories.id
-> Hash (cost=4121.62..4121.62 rows=131762 width=4)
Output: category_groups.category_id
-> Seq Scan on public.category_groups (cost=0.00..4121.62 rows=131762 width=4)
Output: category_groups.category_id
CodePudding user response:
The FALSE OR
prevents PostgreSQL from optimizing your NOT IN
condition to an anti-join. The reason is that the PostgreSQL optimizer doesn't think hard enough for that. However, since it is easy to rewrite the query to do better (simple remove the FALSE OR
), there is little reason to make the optimizer smarter. A smart optimizer is slow, and the gain of optimizing an ill-written query better does not outweigh the price that everyone would have to pay.