Home > front end >  Sql: Avoiding external merge from group by in postgres
Sql: Avoiding external merge from group by in postgres

Time:04-13

I have a fairly simple query which gets the top ten scores of a large table (just under 10 million rows) and returns them in descending order. The scores are made up of sums aggregated using a group by clause, and the group by seems especially costly, using Sort Method: external merge Disk: 190080k

Is there some way I can speed this up? I already have indices on user.test_id and user.score (descending). I'd prefer not to change work_mem as I have limited control over our postgres settings.

Query:

select 
    (select test from test where top_scores.test_id = test.id), 
    (select type from test where top_scores.test_id = test.id), 
    sum_score
    from (select sum(score) as sum_score, 
          test_id
          from user
          group by test_id
          order by sum_score desc
          limit 10
    ) top_scores

Query Plan:

Subquery Scan on top_scores  (cost=1412662.62..1412831.69 rows=10 width=16) (actual time=164098.107..164098.714 rows=10 loops=1)"
  ->  Limit  (cost=1412662.62..1412662.64 rows=10 width=16) (actual time=164098.042..164098.144 rows=10 loops=1)"
        ->  Sort  (cost=1412662.62..1419366.96 rows=2681736 width=16) (actual time=164098.033..164098.067 rows=10 loops=1)"
              Sort Key: (sum(user.score)) DESC"
              Sort Method: top-N heapsort  Memory: 25kB"
              ->  GroupAggregate  (cost=1271799.65..1354711.27 rows=2681736 width=16) (actual time=72815.313..152605.093 rows=2499234 loops=1)"
                    Group Key: user.test_id"
                    ->  Sort  (cost=1271799.65..1290497.74 rows=7479234 width=16) (actual time=72815.273..107823.507 rows=7479234 loops=1)"
                          Sort Key: user.test_id"
                          Sort Method: external merge  Disk: 190080kB"
                          ->  Seq Scan on user  (cost=0.00..162238.34 rows=7479234 width=16) (actual time=0.009..33795.669 rows=7479234 loops=1)"
  SubPlan 1"
    ->  Index Scan using test_id_idx on test  (cost=0.43..8.45 rows=1 width=14) (actual time=0.012..0.016 rows=1 loops=10)"
          Index Cond: (top_scores.test_id = id)"
  SubPlan 2"
    ->  Index Scan using test_id_idx on test test_1  (cost=0.43..8.45 rows=1 width=3) (actual time=0.006..0.010 rows=1 loops=10)"
          Index Cond: (top_scores.test_id = id)"
Planning time: 0.724 ms"
Execution time: 164135.458 ms"

CodePudding user response:

Try:

select
    test, type, sum_score
from 
    test 
join
    (select 
        sum(score) as sum_score, 
        test_id
     from 
        user
     group by 
        test_id
     order by 
        sum_score desc
     limit 
        10
      ) top_scores
on 
   test.test_id = top_scores.test_id

CodePudding user response:

I wouldn't expect separate indexes on test_id and score to help here. But a multicolumn index on (test_id, score) should get use of an index-only scan and so avoid the sort. If it doesn't immediately help, then VACUUM the table to get the visibility map bits set.

Also, your hardware appears to be spectacularly bad, or maybe just extremely overloaded.

  • Related