Postgres: Group by is extremely slow-CodePudding

I have a 800 gb table. I have tried to create a minimal example to replicate things.
You can also create the same using.

create table test(id int, date timestamp without time zone);
insert into test  (select trunc(random() * 10000   1), NOW()   (random() * (NOW() '90 days' - NOW()))   '30 days' from  generate_series(1, 1000000));
create index on test(id, date);

I have a table with the following schema.

                          Table "public.test"
 Column |            Type             | Collation | Nullable | Default 
-------- ----------------------------- ----------- ---------- ---------
 id     | integer                     |           |          | 
 date   | timestamp without time zone |           |          | 
Indexes:
    "test_id_date_idx" btree (id, date)

I am trying to find the latest timestamp for each id using the following.

select id, max(date) from test group by id;

Following is the execution plan

 Finalize HashAggregate  (cost=14852.44..14952.25 rows=9981 width=12) (actual time=136.638..137.846 rows=10000 loops=1)
   Group Key: id
   Buffers: shared hit=5406
   ->  Gather  (cost=12656.62..14752.63 rows=19962 width=12) (actual time=125.804..130.419 rows=30000 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=5406
         ->  Partial HashAggregate  (cost=11656.62..11756.43 rows=9981 width=12) (actual time=120.528..121.882 rows=10000 loops=3)
               Group Key: id
               Buffers: shared hit=5406
               ->  Parallel Seq Scan on test  (cost=0.00..9573.08 rows=416708 width=12) (actual time=0.015..31.523 rows=333367 loops=3)
                     Buffers: shared hit=5406

My first question is that why doesn't the query use an index? (since the index will already have everything sorted)

If I filter on id, the planner shows it started using an index. For example.

select id, max(date) from test where id in (1000 random ids) group by id;

Following is the query plan.


 GroupAggregate  (cost=0.42..27941.65 rows=9981 width=12) (actual time=0.902..98.836 rows=947 loops=1)
   Group Key: id
   ->  Index Only Scan using test_id_date_idx on test  (cost=0.42..27342.16 rows=99936 width=12) (actual time=0.649..86.269 rows=95712 loops=1)
         Index Cond: (id = ANY ('{thousand random ids}'::integer[]))
         Heap Fetches: 95712
 Planning Time: 2.948 ms
 Execution Time: 98.961 ms

But this query is doing a lot of heap fetches. I don't understand why is this happening because if I get the max(date) for a particular id it is pretty fast and takes very low IO. Example

postgres=# explain analyze select max(date) from test where id = 3072;
                                                                      QUERY PLAN                                                                       
-------------------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=4.48..4.49 rows=1 width=8) (actual time=0.177..0.179 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.42..4.48 rows=1 width=8) (actual time=0.168..0.169 rows=1 loops=1)
           ->  Index Only Scan Backward using test_id_date_idx on test  (cost=0.42..406.42 rows=100 width=8) (actual time=0.165..0.166 rows=1 loops=1)
                 Index Cond: ((id = 3072) AND (date IS NOT NULL))
                 Heap Fetches: 1
 Planning Time: 0.327 ms
 Execution Time: 0.241 ms
(8 rows)

I ananlyzed the table before any queries. All these numbers get amplified on my 800 gb table.

CodePudding user response：

You didn't disclose your PostgreSQL version, but it is probably an older one.

To avoid the heap fetches, VACUUM the table. Perhaps then PostgreSQL will choose an index-only scan. But since the table doesn't have more columns than the index, it might also prefer a sequential scan.

For a hash aggregate, the table doesn't have to be sorted.