I have a 800 gb table. I have tried to create a minimal example to replicate things.
You can also create the same using.
create table test(id int, date timestamp without time zone);
insert into test (select trunc(random() * 10000 1), NOW() (random() * (NOW() '90 days' - NOW())) '30 days' from generate_series(1, 1000000));
create index on test(id, date);
I have a table with the following schema.
Table "public.test"
Column | Type | Collation | Nullable | Default
-------- ----------------------------- ----------- ---------- ---------
id | integer | | |
date | timestamp without time zone | | |
Indexes:
"test_id_date_idx" btree (id, date)
I am trying to find the latest timestamp for each id using the following.
select id, max(date) from test group by id;
Following is the execution plan
Finalize HashAggregate (cost=14852.44..14952.25 rows=9981 width=12) (actual time=136.638..137.846 rows=10000 loops=1)
Group Key: id
Buffers: shared hit=5406
-> Gather (cost=12656.62..14752.63 rows=19962 width=12) (actual time=125.804..130.419 rows=30000 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=5406
-> Partial HashAggregate (cost=11656.62..11756.43 rows=9981 width=12) (actual time=120.528..121.882 rows=10000 loops=3)
Group Key: id
Buffers: shared hit=5406
-> Parallel Seq Scan on test (cost=0.00..9573.08 rows=416708 width=12) (actual time=0.015..31.523 rows=333367 loops=3)
Buffers: shared hit=5406
My first question is that why doesn't the query use an index? (since the index will already have everything sorted)
If I filter on id
, the planner shows it started using an index. For example.
select id, max(date) from test where id in (1000 random ids) group by id;
Following is the query plan.
GroupAggregate (cost=0.42..27941.65 rows=9981 width=12) (actual time=0.902..98.836 rows=947 loops=1)
Group Key: id
-> Index Only Scan using test_id_date_idx on test (cost=0.42..27342.16 rows=99936 width=12) (actual time=0.649..86.269 rows=95712 loops=1)
Index Cond: (id = ANY ('{thousand random ids}'::integer[]))
Heap Fetches: 95712
Planning Time: 2.948 ms
Execution Time: 98.961 ms
But this query is doing a lot of heap fetches. I don't understand why is this happening because if I get the max(date) for a particular id it is pretty fast and takes very low IO. Example
postgres=# explain analyze select max(date) from test where id = 3072;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=4.48..4.49 rows=1 width=8) (actual time=0.177..0.179 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.42..4.48 rows=1 width=8) (actual time=0.168..0.169 rows=1 loops=1)
-> Index Only Scan Backward using test_id_date_idx on test (cost=0.42..406.42 rows=100 width=8) (actual time=0.165..0.166 rows=1 loops=1)
Index Cond: ((id = 3072) AND (date IS NOT NULL))
Heap Fetches: 1
Planning Time: 0.327 ms
Execution Time: 0.241 ms
(8 rows)
I ananlyzed
the table before any queries. All these numbers get amplified on my 800 gb table.
CodePudding user response:
You didn't disclose your PostgreSQL version, but it is probably an older one.
To avoid the heap fetches, VACUUM
the table. Perhaps then PostgreSQL will choose an index-only scan. But since the table doesn't have more columns than the index, it might also prefer a sequential scan.
For a hash aggregate, the table doesn't have to be sorted.