I have 2 tables with one-to-many
relationship.
- Users-> 1 million (1)
- Requests-> 10 millions (n)
What I'm trying to do, is to fetch the user alongside with the latest request
made; and be able to filter the whole dataset based on the (last) request
columns.
The current query is fetching the correct results but it is painfully slow. ~7-9 seconds
SELECT *
FROM users AS u
INNER JOIN requests AS r
ON u.id = r.user_id
WHERE (r.created_at = u.last_request_date AND r.ignored = false)
ORDER BY u.last_request_date DESC
LIMIT 10 OFFSET 0
I have also tried to JOIN
the r.created_at
as a second ON
condition instead of filtering on the WHERE
statement, but without a difference in performance.
UPDATE:
Indexes:
- Users: last_request_date
- Requests: created_at, user_id(foreign)
Execution plan: https://explain.depesz.com/s/JsLr#source
Execution plan:
Limit (cost=1000.88..21080.19 rows=10 width=139) (actual time=15966.670..15990.322 rows=10 loops=1)
Buffers: shared hit=3962420 read=152361
-> Gather Merge (cost=1000.88..757990.77 rows=377 width=139) (actual time=15966.653..15990.138 rows=10 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=3962420 read=152361
-> Nested Loop (cost=0.86..756947.24 rows=157 width=139) (actual time=9456.384..10622.180 rows=7 loops=3)
Buffers: shared hit=3962420 read=152361
" -> Parallel Index Scan Backward using users_last_request_date on users ""User"" (cost=0.42..55742.72 rows=420832 width=75) (actual time=0.061..2443.484 rows=333340 loops=3)"
Buffers: shared hit=5102 read=15849
-> Index Scan using requests_user_id on requests (cost=0.43..1.66 rows=1 width=64) (actual time=0.010..0.010 rows=0 loops=1000019)
" Index Cond: (user_id = ""User"".id)"
" Filter: ((NOT ignored) AND (""User"".last_request_date = created_at))"
Rows Removed by Filter: 10
Buffers: shared hit=3957318 read=136512
Planning Time: 0.745 ms
Execution Time: 15990.489 ms
CodePudding user response:
Try creating this BTREE index to handle the requests table lookup more efficiently.
CREATE INDEX id_ignored_date ON requests (user_id, ignored, created_at);
Your plan says
-> Index Scan using requests_user_id on requests
(cost=0.43..1.66 rows=1 width=64) (actual time=0.010..0.010 rows=0 loops=1000019)
"Index Cond: (user_id = ""User"".id)"
"Filter: ((NOT ignored) AND (""User"".last_request_date = created_at))"
and this index will move the Filter conditions into the Index Cond, which should be faster.
Pro tip: @Kendle is right. Don't use SELECT *
in production software, especially performance-sensitive software, unless you have a good reason. It makes your RDBMS server, network, and client program work harder for no good reason.
CodePudding user response:
As you only want to last 10 users I would suggest that we only fetch the last 100 records from requests. This lmay avoid a million join comparisons, to test as the query optimiser may already be doing this.
This number should be modified according to your application. It may be that the last 10 records will always be 10 different users or that we need to fetch more than 100 to be sure of having 10 users.
SELECT *
FROM users AS u
INNER JOIN (select * from requests
where r.ignored = false
order by created_at desc
limit 100) AS r
ON u.id = r.user_id
WHERE (r.created_at = u.last_request_date)
ORDER BY u.last_request_date DESC
LIMIT 10 OFFSET 0
CodePudding user response:
The biggest bottleneck from your execution plan was this part, requests
might add one more column created_at
to the index (because there is a filter cost)
-> Index Scan using requests_user_id on requests (cost=0.43..1.66 rows=1 width=64) (actual time=0.010..0.010 rows=0 loops=1000019)
" Index Cond: (user_id = ""User"".id)"
" Filter: ((NOT ignored) AND (""User"".last_request_date = created_at))"
Rows Removed by Filter: 10
Buffers: shared hit=3957318 read=136512
so you might try to create an index like the below.
CREATE INDEX IX_requests ON requests (
user_id,
created_at
);
If the ignored = false
is a small amount on the requests
table you can try to use Partial Indexes which might help you reduce your storage and improve your index performance.
CREATE INDEX FIX_requests ON requests (
user_id,
created_at
)
WHERE ignored = false;
On the other thing, I would use an index as below for users
table because there is an order by on the last_request_date
column and users
table join request
table by id
CREATE INDEX IX_users ON users (
last_request_date,
id
);
NOTE
I would avoid using SELECT *
because it might cost more than IO
we might not need use select all columns from the table in most scenes.