Query Optimization with WHERE condition and a single JOIN-CodePudding

I have 2 tables with one-to-many relationship.

Users-> 1 million (1)
Requests-> 10 millions (n)

What I'm trying to do, is to fetch the user alongside with the latest request made; and be able to filter the whole dataset based on the (last) request columns.

The current query is fetching the correct results but it is painfully slow. ~7-9 seconds

SELECT *
FROM users AS u
         INNER JOIN requests AS r
                    ON u.id = r.user_id
WHERE (r.created_at = u.last_request_date AND r.ignored = false)
ORDER BY u.last_request_date DESC
LIMIT 10 OFFSET 0

I have also tried to JOIN the r.created_at as a second ON condition instead of filtering on the WHERE statement, but without a difference in performance.

UPDATE:

Indexes:

Users: last_request_date
Requests: created_at, user_id(foreign)

Execution plan: https://explain.depesz.com/s/JsLr#source

Execution plan:
    Limit  (cost=1000.88..21080.19 rows=10 width=139) (actual time=15966.670..15990.322 rows=10 loops=1)
      Buffers: shared hit=3962420 read=152361
      ->  Gather Merge  (cost=1000.88..757990.77 rows=377 width=139) (actual time=15966.653..15990.138 rows=10 loops=1)
            Workers Planned: 2
            Workers Launched: 2
            Buffers: shared hit=3962420 read=152361
            ->  Nested Loop  (cost=0.86..756947.24 rows=157 width=139) (actual time=9456.384..10622.180 rows=7 loops=3)
                  Buffers: shared hit=3962420 read=152361
    "              ->  Parallel Index Scan Backward using users_last_request_date on users ""User""  (cost=0.42..55742.72 rows=420832 width=75) (actual time=0.061..2443.484 rows=333340 loops=3)"
                        Buffers: shared hit=5102 read=15849
                  ->  Index Scan using requests_user_id on requests  (cost=0.43..1.66 rows=1 width=64) (actual time=0.010..0.010 rows=0 loops=1000019)
    "                    Index Cond: (user_id = ""User"".id)"
    "                    Filter: ((NOT ignored) AND (""User"".last_request_date = created_at))"
                        Rows Removed by Filter: 10
                        Buffers: shared hit=3957318 read=136512
    Planning Time: 0.745 ms
    Execution Time: 15990.489 ms

CodePudding user response：

Try creating this BTREE index to handle the requests table lookup more efficiently.

CREATE INDEX id_ignored_date ON requests (user_id, ignored, created_at);

Your plan says

->  Index Scan using requests_user_id on requests 
       (cost=0.43..1.66 rows=1 width=64) (actual time=0.010..0.010 rows=0 loops=1000019)
    "Index Cond: (user_id = ""User"".id)"
    "Filter: ((NOT ignored) AND (""User"".last_request_date = created_at))"

and this index will move the Filter conditions into the Index Cond, which should be faster.

Pro tip: @Kendle is right. Don't use SELECT * in production software, especially performance-sensitive software, unless you have a good reason. It makes your RDBMS server, network, and client program work harder for no good reason.

CodePudding user response：

As you only want to last 10 users I would suggest that we only fetch the last 100 records from requests. This lmay avoid a million join comparisons, to test as the query optimiser may already be doing this.
This number should be modified according to your application. It may be that the last 10 records will always be 10 different users or that we need to fetch more than 100 to be sure of having 10 users.

SELECT *
FROM users AS u
         INNER JOIN (select * from requests
                    where r.ignored = false
                    order by created_at desc
                    limit 100) AS r
                    ON u.id = r.user_id
WHERE (r.created_at = u.last_request_date)
ORDER BY u.last_request_date DESC
LIMIT 10 OFFSET 0

CodePudding user response：

The biggest bottleneck from your execution plan was this part, requests might add one more column created_at to the index (because there is a filter cost)

->  Index Scan using requests_user_id on requests  (cost=0.43..1.66 rows=1 width=64) (actual time=0.010..0.010 rows=0 loops=1000019)
"                    Index Cond: (user_id = ""User"".id)"
"                    Filter: ((NOT ignored) AND (""User"".last_request_date = created_at))"
    Rows Removed by Filter: 10
    Buffers: shared hit=3957318 read=136512

so you might try to create an index like the below.

CREATE INDEX IX_requests ON requests (
   user_id,
   created_at 
);

If the ignored = false is a small amount on the requests table you can try to use Partial Indexes which might help you reduce your storage and improve your index performance.

CREATE INDEX FIX_requests ON requests (
   user_id,
   created_at 
)
WHERE ignored = false;

On the other thing, I would use an index as below for users table because there is an order by on the last_request_date column and users table join request table by id

CREATE INDEX IX_users ON users (
   last_request_date,
   id 
);

NOTE

I would avoid using SELECT * because it might cost more than IO we might not need use select all columns from the table in most scenes.