Django: Performance issues with query sets using m2m
I asked this question here, but didn't get an answer, so I'm reposting a more detailed question.
When I use ORDER BY
with Count
aggregated values, for some reason index is not used and the query takes a long time to execute.
The videos_video_tags
column has about 1.3 million rows.
The following will take approximately 500-800ms.
SELECT "videos_tag"."id",
"videos_tag"."name",
COUNT("videos_video_tags"."video_id") AS "count"
FROM "videos_tag"
LEFT OUTER JOIN "videos_video_tags" ON ("videos_tag"."id" = "videos_video_tags"."tag_id")
GROUP BY "videos_tag"."id"
ORDER BY "count" DESC
LIMIT 100;
Removing ORDER BY "count" DESC
from this SQL statement will only take about 2-10ms
.
If you check the details in the execution plan with EXPLAIN
, you will see that the query that uses ORDER BY
does not use index is not being used.
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=35198.66..35198.91 rows=100 width=37) (actual time=770.355..770.376 rows=100 loops=1)
Output: videos_tag.id, videos_tag.name, (count(videos_video_tags.video_id))
Buffers: shared hit=6928 read=4311
-> Sort (cost=35198.66..35212.53 rows=5548 width=37) (actual time=770.354..770.366 rows=100 loops=1)
Output: videos_tag.id, videos_tag.name, (count(videos_video_tags.video_id))
Sort Key: (count(videos_video_tags.video_id)) DESC
Sort Method: top-N heapsort Memory: 37kB
Buffers: shared hit=6928 read=4311
-> HashAggregate (cost=34931.14..34986.62 rows=5548 width=37) (actual time=766.050..768.090 rows=5548 loops=1)
Output: videos_tag.id, videos_tag.name, count(videos_video_tags.video_id)
Group Key: videos_tag.id
Batches: 1 Memory Usage: 977kB
Buffers: shared hit=6928 read=4311
-> Hash Right Join (cost=221.83..28246.14 rows=1337000 width=45) (actual time=2.840..497.697 rows=1337000 loops=1)
Output: videos_tag.id, videos_tag.name, videos_video_tags.video_id
Inner Unique: true
Hash Cond: (videos_video_tags.tag_id = videos_tag.id)
Buffers: shared hit=6928 read=4311
-> Seq Scan on public.videos_video_tags (cost=0.00..24512.00 rows=1337000 width=32) (actual time=0.008..109.061 rows=1337000 loops=1)
Output: videos_video_tags.id, videos_video_tags.video_id, videos_video_tags.tag_id
Buffers: shared hit=6831 read=4311
-> Hash (cost=152.48..152.48 rows=5548 width=29) (actual time=2.795..2.796 rows=5548 loops=1)
Output: videos_tag.id, videos_tag.name
Buckets: 8192 Batches: 1 Memory Usage: 399kB
Buffers: shared hit=97
-> Seq Scan on public.videos_tag (cost=0.00..152.48 rows=5548 width=29) (actual time=0.008..1.048 rows=5548 loops=1)
Output: videos_tag.id, videos_tag.name
Buffers: shared hit=97
Planning:
Buffers: shared hit=14
Planning Time: 0.497 ms
Execution Time: 770.812 ms
(32 rows)
Time: 772.336 ms
If you are not using ORDER BY, you will see the following
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.71..1689.61 rows=100 width=37) (actual time=0.069..9.664 rows=100 loops=1)
Output: videos_tag.id, videos_tag.name, (count(videos_video_tags.video_id))
Buffers: shared hit=7761
-> GroupAggregate (cost=0.71..93700.72 rows=5548 width=37) (actual time=0.069..9.647 rows=100 loops=1)
Output: videos_tag.id, videos_tag.name, count(videos_video_tags.video_id)
Group Key: videos_tag.id
Buffers: shared hit=7761
-> Merge Left Join (cost=0.71..86960.24 rows=1337000 width=45) (actual time=0.060..8.222 rows=11375 loops=1)
Output: videos_tag.id, videos_tag.name, videos_video_tags.video_id
Merge Cond: (videos_tag.id = videos_video_tags.tag_id)
Buffers: shared hit=7761
-> Index Scan using videos_tag_pkey on public.videos_tag (cost=0.28..635.50 rows=5548 width=29) (actual time=0.011..0.066 rows=101 loops=1)
Output: videos_tag.id, videos_tag.name, videos_tag.is_actress, videos_tag.created_at
Buffers: shared hit=102
-> Index Scan using videos_video_tags_tag_id_2673cfc8 on public.videos_video_tags (cost=0.43..69598.37 rows=1337000 width=32) (actual time=0.012..5.928 rows=11375 loops=1)
Output: videos_video_tags.id, videos_video_tags.video_id, videos_video_tags.tag_id
Buffers: shared hit=7659
Planning:
Buffers: shared hit=14
Planning Time: 0.364 ms
Execution Time: 9.734 ms
(21 rows)
Time: 10.639 ms
I think index is also present without any problem.
public | videos_tag_name_key | index | postgres | videos_tag
public | videos_tag_pkey | index | postgres | videos_tag
public | videos_video_tags_pkey | index | postgres | videos_video_tags
public | videos_video_tags_tag_id_2673cfc8 | index | postgres | videos_video_tags
public | videos_video_tags_video_id_8220dbb8 | index | postgres | videos_video_tags
public | videos_video_tags_video_id_tag_id_f8d6ba70_uniq | index | postgres | videos_video_tags
I have spent quite a bit of time on this and still have not been able to solve it. What do you think could be the cause?
CodePudding user response:
The indexes are not used directly for selectivity. They are used to produce the rows already ordered by a field which is useful both for joining, and for grouping. Once 100 groups have extruded in some convenient (to the system) order it can then stop, substantially early.
But with the ORDER BY, you can't order by a count for all groups unless you know the count for all the groups. There is no stopping-early opportunity. Since that is the main advantage of using the indexes, once that opportunity is gone there is no reason to use the indexes anymore. The hash join is probably more efficient when it has to run to completion anyway.
So how do I do this? Since we are implementing pagination, we need to sort by count and get the top 100 or so
Don't implement pagination in the database. 5548 is not that many, compute it once and send them all to the client or app server and let it deal with pagination on its side. And this won't change very quickly, so use a materialized view to store the summaries and re-compute once an hour or so.