Home > Enterprise >  Postgres: Group by is extremely slow
Postgres: Group by is extremely slow

Time:12-02

I have a 800 gb table. I have tried to create a minimal example to replicate things.
You can also create the same using.

create table test(id int, date timestamp without time zone);
insert into test  (select trunc(random() * 10000   1), NOW()   (random() * (NOW() '90 days' - NOW()))   '30 days' from  generate_series(1, 1000000));
create index on test(id, date);

I have a table with the following schema.

                          Table "public.test"
 Column |            Type             | Collation | Nullable | Default 
-------- ----------------------------- ----------- ---------- ---------
 id     | integer                     |           |          | 
 date   | timestamp without time zone |           |          | 
Indexes:
    "test_id_date_idx" btree (id, date)

I am trying to find the latest timestamp for each id using the following.

select id, max(date) from test group by id;

Following is the execution plan

 Finalize HashAggregate  (cost=14852.44..14952.25 rows=9981 width=12) (actual time=136.638..137.846 rows=10000 loops=1)
   Group Key: id
   Buffers: shared hit=5406
   ->  Gather  (cost=12656.62..14752.63 rows=19962 width=12) (actual time=125.804..130.419 rows=30000 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=5406
         ->  Partial HashAggregate  (cost=11656.62..11756.43 rows=9981 width=12) (actual time=120.528..121.882 rows=10000 loops=3)
               Group Key: id
               Buffers: shared hit=5406
               ->  Parallel Seq Scan on test  (cost=0.00..9573.08 rows=416708 width=12) (actual time=0.015..31.523 rows=333367 loops=3)
                     Buffers: shared hit=5406

My first question is that why doesn't the query use an index? (since the index will already have everything sorted)

If I filter on id, the planner shows it started using an index. For example.

select id, max(date) from test where id in (1000 random ids) group by id;

Following is the query plan.


 GroupAggregate  (cost=0.42..27941.65 rows=9981 width=12) (actual time=0.902..98.836 rows=947 loops=1)
   Group Key: id
   ->  Index Only Scan using test_id_date_idx on test  (cost=0.42..27342.16 rows=99936 width=12) (actual time=0.649..86.269 rows=95712 loops=1)
         Index Cond: (id = ANY ('{thousand random ids}'::integer[]))
         Heap Fetches: 95712
 Planning Time: 2.948 ms
 Execution Time: 98.961 ms

But this query is doing a lot of heap fetches. I don't understand why is this happening because if I get the max(date) for a particular id it is pretty fast and takes very low IO. Example

postgres=# explain analyze select max(date) from test where id = 3072;
                                                                      QUERY PLAN                                                                       
-------------------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=4.48..4.49 rows=1 width=8) (actual time=0.177..0.179 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.42..4.48 rows=1 width=8) (actual time=0.168..0.169 rows=1 loops=1)
           ->  Index Only Scan Backward using test_id_date_idx on test  (cost=0.42..406.42 rows=100 width=8) (actual time=0.165..0.166 rows=1 loops=1)
                 Index Cond: ((id = 3072) AND (date IS NOT NULL))
                 Heap Fetches: 1
 Planning Time: 0.327 ms
 Execution Time: 0.241 ms
(8 rows)


I ananlyzed the table before any queries. All these numbers get amplified on my 800 gb table.

CodePudding user response:

You didn't disclose your PostgreSQL version, but it is probably an older one.

To avoid the heap fetches, VACUUM the table. Perhaps then PostgreSQL will choose an index-only scan. But since the table doesn't have more columns than the index, it might also prefer a sequential scan.

For a hash aggregate, the table doesn't have to be sorted.

  • Related