There are 2 tables, main_table
and other_table
. A user is allowed access to main_table
items if his user_id
matches the main_table.user_id
column or if there is an entry in other_table
where his user_id
is in the read_acl
column (array using gin index). Both tables have about 2 million rows.
The query is pretty slow especially, it seems Postgresql is doing an index scan on all entries in the main_table
. If I remove the or
clause and rewrite it to 2 queries then the speed is much better. Can this be solved somehow?
Current query is this:
select *
from main_table
where user_id = 123 or exists (select * from other_table f
where main_table.main_table_id = f.main_table_id
and '{123}' && read_acl)
order by main_table_id limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..170.97 rows=10 width=2590) (actual time=172.734..389.819 rows=10 loops=1)
-> Index Scan using main_table_pk on main_table (cost=0.43..14158795.44 rows=830198 width=2590) (actual time=172.733..389.811 rows=10 loops=1)
Filter: ((user_id = 123) OR (alternatives: SubPlan 1 or hashed SubPlan 2))
Rows Removed by Filter: 776709
SubPlan 1
-> Index Scan using other_table_main_table_id_idx on other_table f (cost=0.43..8.45 rows=1 width=0) (never executed)
Index Cond: (main_table.main_table_id = main_table_id)
Filter: ('{123}'::text[] && read_acl)
SubPlan 2
-> Bitmap Heap Scan on other_table f_1 (cost=1678.58..2992.04 rows=333 width=8) (actual time=9.413..9.432 rows=12 loops=1)
Recheck Cond: ('{123}'::text[] && read_acl)
Heap Blocks: exact=12
-> Bitmap Index Scan on other_table_read_acl (cost=0.00..1678.50 rows=333 width=0) (actual time=9.401..9.401 rows=12 loops=1)
Index Cond: ('{123}'::text[] && read_acl)
Planning Time: 0.395 ms
Execution Time: 389.877 ms
(16 rows)
Rewritten query 1 first part of OR clause:
select *
from main_table
where user_id = 123
order by main_table_id limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Limit (cost=482.27..482.29 rows=10 width=2590) (actual time=0.039..0.040 rows=8 loops=1)
-> Sort (cost=482.27..482.58 rows=126 width=2590) (actual time=0.038..0.039 rows=8 loops=1)
Sort Key: main_table_id
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on main_table (cost=5.40..479.54 rows=126 width=2590) (actual time=0.020..0.031 rows=8 loops=1)
Recheck Cond: (user_id = 123)
Heap Blocks: exact=8
-> Bitmap Index Scan on test500 (cost=0.00..5.37 rows=126 width=0) (actual time=0.015..0.015 rows=8 loops=1)
Index Cond: (user_id = 123)
Planning Time: 0.130 ms
Execution Time: 0.066 ms
(11 rows)
Rewritten query 2:
select *
from main_table
where exists (select * from other_table f
where main_table.main_table_id = f.main_table_id
and '{123}' && read_acl)
order by main_table_id limit 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=5771.59..5771.62 rows=10 width=2590) (actual time=8.083..8.086 rows=10 loops=1)
-> Sort (cost=5771.59..5772.42 rows=333 width=2590) (actual time=8.082..8.083 rows=10 loops=1)
Sort Key: main_table.main_table_id
Sort Method: quicksort Memory: 26kB
-> Nested Loop (cost=2985.30..5764.40 rows=333 width=2590) (actual time=8.018..8.072 rows=12 loops=1)
-> HashAggregate (cost=2984.88..2988.21 rows=333 width=8) (actual time=7.999..8.004 rows=12 loops=1)
Group Key: f.main_table_id
-> Bitmap Heap Scan on other_table f (cost=1670.58..2984.04 rows=333 width=8) (actual time=7.969..7.990 rows=12 loops=1)
Recheck Cond: ('{123}'::text[] && read_acl)
Heap Blocks: exact=12
-> Bitmap Index Scan on other_table_read_acl (cost=0.00..1670.50 rows=333 width=0) (actual time=7.957..7.958 rows=12 loops=1)
Index Cond: ('{123}'::text[] && read_acl)
-> Index Scan using main_table_pk on main_table (cost=0.43..8.34 rows=1 width=2590) (actual time=0.005..0.005 rows=1 loops=12)
Index Cond: (main_table_id = f.main_table_id)
Planning Time: 0.431 ms
Execution Time: 8.137 ms
(16 rows)
CodePudding user response:
Well, it looks like this is how the optimizer works with the OR
condition. It is not smart enough to split the single query into two automatically. Or, maybe, it knows about such transformation, but decided that in this case it won't help. You know better, so guide the optimizer.
You can run two queries in your code explicitly, or use two CTEs and UNION
results (UNION
, not UNION ALL
). Something like this:
WITH
CTE1
AS
(
select *
from main_table
where user_id = 123
order by main_table_id limit 10
)
,CTE2
AS
(
select *
from main_table
where exists (select * from other_table f
where main_table.main_table_id = f.main_table_id
and '{123}' && read_acl)
order by main_table_id limit 10
)
SELECT *
FROM CTE1
UNION
SELECT *
FROM CTE2
order by main_table_id limit 10;
If there was no limit 10
in each of the simple queries you could just put everything in a big UNION
query without CTEs. You'd better try with and without CTEs and compare execution plans and performance.
CodePudding user response:
That is as expected. OR
in a WHERE
condition often is a problem for the optimizer. Yes, you can usually rewrite the query to a UNION
or UNION ALL
of the partial queries you show in your question, but this is not something that the optimizer can do automatically.
Consider this example:
CREATE TABLE a (
id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
x integer,
p integer
);
CREATE TABLE b (
id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
x integer,
q integer
);
INSERT INTO a (x, p) VALUES
(1, 1),
(1, 1),
(2, 1);
INSERT INTO b (x, q) VALUES
(1, 3),
(2, 3);
Now the query with OR
gives you:
SELECT x, a.p, b.q
FROM a JOIN b USING (x)
WHERE a.p = 1 OR b.q = 3;
x │ p │ q
═══╪═══╪═══
1 │ 1 │ 3
1 │ 1 │ 3
2 │ 1 │ 3
(3 rows)
Rewriting the query with UNION
or UNION ALL
produces a different result:
SELECT x, a.p, b.q
FROM a JOIN b USING (x)
WHERE a.p = 1
UNION
SELECT x, a.p, b.q
FROM a JOIN b USING (x)
WHERE b.q = 3;
x │ p │ q
═══╪═══╪═══
2 │ 1 │ 3
1 │ 1 │ 3
(2 rows)
SELECT x, a.p, b.q
FROM a JOIN b USING (x)
WHERE a.p = 1
UNION ALL
SELECT x, a.p, b.q
FROM a JOIN b USING (x)
WHERE b.q = 3;
x │ p │ q
═══╪═══╪═══
1 │ 1 │ 3
1 │ 1 │ 3
2 │ 1 │ 3
1 │ 1 │ 3
1 │ 1 │ 3
2 │ 1 │ 3
(6 rows)
The case in your question uses SELECT *
, which will include the primary key, so in this case it would be safe to use UNION
. But that is reasoning that goes beyond the capabilities of the optimizer.