Performance on aggregate over 10mio records slow in Postgres-CodePudding

I'm trying to find out, why a simple aggregate query performs really slow on Postgres 15 from my point of view.

I have a growing data table with 10 mio records of sensor data

CREATE TABLE public.ts_number (
    id int4 NOT NULL,
    ts int8 NOT NULL,
    val float4 NULL,
    CONSTRAINT ts_number_pkey PRIMARY KEY (id, ts)
);
CREATE INDEX ts_number_id_idx ON public.ts_number USING btree (id, ts, val);

I want to aggregate all data into 1-minute interval and join it with the sensor name from another table. But I'm wondering why the basic aggregate has so big differences in execution speed? Timestamp is in milliseconds since UNIX epoch.

From my understanding, an additional index won't help? There are only about 1000 rows with val = NULL in the whole set of 10 mio records.

select
    date_trunc('minute', to_timestamp(tn.ts / 1000)) as "time",
    tn.id,
    round(avg(tn.val::numeric), 1) as value
from
    ts_number tn
where
    tn.val is not null
group by
    tn.id,
    date_trunc('minute', to_timestamp(tn.ts / 1000))
order by
    date_trunc('minute', to_timestamp(tn.ts / 1000));

I've tried different indexes and subqueries to narrow down but I cannot find the answer.

Execution plan for Postgres:

jit=off, track_io_timing=true

EXPLAIN (ANALYZE, BUFFERS, TIMING) `

QUERY PLAN                                                                                                                              |
---------------------------------------------------------------------------------------------------------------------------------------- 
Sort  (cost=1438043.51..1448107.59 rows=4025633 width=44) (actual time=45605.997..46283.850 rows=3507690 loops=1)                       |
  Sort Key: (date_trunc('minute'::text, to_timestamp(((ts / 1000))::double precision)))                                                 |
  Sort Method: external merge  Disk: 92136kB                                                                                            |
  Buffers: shared hit=32 read=79640, temp read=56259 written=100732                                                                     |
  I/O Timings: shared/local read=12067.837, temp read=1178.562 write=3915.268                                                           |
  ->  HashAggregate  (cost=707851.61..923882.68 rows=4025633 width=44) (actual time=29691.221..42565.892 rows=3507690 loops=1)          |
        Group Key: date_trunc('minute'::text, to_timestamp(((ts / 1000))::double precision)), id                                        |
        Planned Partitions: 32  Batches: 33  Memory Usage: 33041kB  Disk Usage: 368648kB                                                |
        Buffers: shared hit=32 read=79640, temp read=44742 written=89197                                                                |
        I/O Timings: shared/local read=12067.837, temp read=1147.856 write=3727.306                                                     |
        ->  Seq Scan on ts_number tn  (cost=0.00..295394.36 rows=10785399 width=16) (actual time=6.016..18958.130 rows=10785155 loops=1)|
              Filter: (val IS NOT NULL)                                                                                                 |
              Rows Removed by Filter: 1073                                                                                              |
              Buffers: shared hit=32 read=79640                                                                                         |
              I/O Timings: shared/local read=12067.837                                                                                  |
Planning Time: 0.289 ms                                                                                                                 |
Execution Time: 46617.899 ms                                                                                                            |

pg_tune settings: `

# DB Version: 15
# OS Type: linux
# DB Type: dw
# Total Memory (RAM): 4 GB
# CPUs num: 4
# Connections num: 25
# Data Storage: ssd

max_connections = 25
shared_buffers = 1GB
effective_cache_size = 3GB
maintenance_work_mem = 512MB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 500
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 10485kB
min_wal_size = 4GB
max_wal_size = 16GB
max_worker_processes = 4
max_parallel_workers_per_gather = 2
max_parallel_workers = 4
max_parallel_maintenance_workers = 2

Host is a J4125 quad core with 8GB RAM and 1TB SSD running Proxmox. Proxmox runs in its own container (plain debian bullseye) with virtual resources: 4 cores, 4GB RAM, 2GB SWAP. No others containers are currently active.

CodePudding user response：

The index on all three columns isn't going to help as your query can't make use of it.

Casting the average to be able to round it, rather then casting every value should already make a difference, i.e. round(avg(tn.val)::numeric, 1) instead of round(avg(tn.val::numeric), 1)

However the query still needs the costly hash aggregate and sorting (and with your extremely small work_mem this is spilling to your apparently slow disk).

One way to avoid that, is to support this using an index. Building the index is easier if you do the conversion from the unix epoch to a proper timestamp using a generated column (which might be helpful in other situations as well):

create table ts_number 
(
  id bigint, 
  ts bigint, 
  val double precision, 
  ts_time timestamp generated always as (timestamp '1970-01-01 00:00:00'   make_interval(secs => ts)) stored,
  primary key (id, ts)
);

The somewhat strange looking expression for the generated column is necessary because using to_timestamp() is not an immutable function and thus can't be used in a generated column (or index).

Then you can create such an index:

create index on ts_number(date_trunc('minute', ts_time), id) 
   include (val)
where val is not null; --<< won't make a big difference but doesn't hurt either

Which leads to this query:

select date_trunc('minute', ts_time),
       tn.id,
       avg(tn.val) as value
from ts_number tn
where  tn.val is not null
group by tn.id, date_trunc('minute', ts_time)
order by date_trunc('minute', ts_time);

This should already be faster than your original query. If you always aggregate on "minute level" you could already do that in the generated column:

create table ts_number 
(
  id bigint, 
  ts bigint, 
  val double precision, 
  ts_time timestamp generated always as (date_trunc('minute', timestamp '1970-01-01 00:00:00'   make_interval(secs => ts))) stored,
  primary key (id, ts)
);

which then leads to the simplified index:

create index on ts_number(ts_time, id) 
  include (val)
where val is not null;

and the simplified query:

select ts_time,
       tn.id,
       avg(tn.val) as value
from ts_number tn
where  tn.val is not null
group by tn.id, ts_time
order by ts_time;

This brought down the runtime from initially ~45 seconds to ~5 seconds on my old laptop with a spinning harddisk.

This kind of table is one of the rare cases where PostgreSQL's lack of "index organized tables" (aka "clustered index" in SQL Server) is indeed a disadvantage because the index to support this kind of queries needlessly duplicates the storage (index table).

Note that a work_mem of only 10MB is way too small if you are serious about doing larger aggregations with Postgres.

I don't really know if your SQL Server installation is also that severely limited with regards to the memory it can use for grouping and hashing, but given the numbers I'd suspect it can use substantially more memory.

CodePudding user response：

Comparisons between a professional tool that required billions of dollars of investment (SQL Server) and do-it-yourself DBMSs are generally out of proportion. This is why companies pay dearly for professional tools...

In this case the reasons why there are such differences are based on the following facts:

parallelism: SQL Server does automatic parallelism and can use several threads to calculate an aggregate. Neither MySQL nor PostgreSQL are able to do this.
MVCC: SQL Server keeps the number of "alive" rows in the page header. So it doesn't need to iterate through all the rows of a table to compute the aggregate. Other RDBMS, due to MVCC (optimistic locking) must scans all rows to distinguish between phantom rows and alive rows into pages, which cost a lot
optimizer: Microsoft has been investing massively in research and development for more than 30 years and its optimizer is currently far ahead of even Oracle... The level of optimization of MySQL / PostgreSQL is barely higher than what the first RDBMS of the 80 years
compression: SQL Server makes it possible to compress the data without having to decompress them to exploit them logically. This doesn't exist in MySQL or PostgreSQL
columnar: it is possible to use vertical indexes (columnstore in SQL Server terminology) to go even faster to calculate aggregates in SQL Server. This is obviously not possible in MySQL or PostgreSQL.

There would still be a lot of other things to say on these subjects, and I'll let you take a look at the article I wrote and which compares PostgreSQL and SQL Server on this subject, we note discrepancies going as far as to more than 1000!