How to improve PostgreSQL query performance when there are millions of records-CodePudding

We are trying to query millions of rows of data from 1 table (14 columns) in under 2 seconds if possible. This table is composed of more than 10 million records and is growing exponentially as live data is being written into it every second, every day.

The query’s we write to this table typically look like:

SELECT "Scale",
COUNT("Weight") AS "Total Punnets",
MAX("Time") - MIN("Time") AS "Delta Time", database
FROM global_packing_data
WHERE "Date" >= '2022-05-22'AND "Date" <= '2022-08-22'
AND global_packing_data.database IN ('line_1_computer','line_2_computer')
GROUP BY "Scale", database

The execution plan for this specific query is as follows:

Finalize GroupAggregate  (cost=1499111.79..1499128.59 rows=120 width=44) (actual time=44754.797..44769.894 rows=30 loops=1)"
  Group Key: ""Scale"", database"
  Buffers: shared hit=694 read=915799"
  ->  Gather Merge  (cost=1499111.79..1499125.59 rows=120 width=44) (actual time=44754.779..44769.854 rows=60 loops=1)"
        Workers Planned: 1"
        Workers Launched: 1"
        Buffers: shared hit=694 read=915799"
        ->  Sort  (cost=1498111.78..1498112.08 rows=120 width=44) (actual time=44722.956..44722.967 rows=30 loops=2)"
              Sort Key: ""Scale"", database"
              Sort Method: quicksort  Memory: 27kB"
              Buffers: shared hit=694 read=915799"
              Worker 0:  Sort Method: quicksort  Memory: 27kB"
              ->  Partial HashAggregate  (cost=1498106.44..1498107.64 rows=120 width=44) (actual time=44722.875..44722.888 rows=30 loops=2)"
                    Group Key: ""Scale"", database"
                    Batches: 1  Memory Usage: 40kB"
                    Buffers: shared hit=679 read=915799"
                    Worker 0:  Batches: 1  Memory Usage: 40kB"
                    ->  Parallel Seq Scan on global_packing_data  (cost=0.00..1379067.81 rows=9523090 width=32) (actual time=374.969..42710.007 rows=8140736 loops=2)"
                          Filter: ((""Date"" >= '2022-05-22'::date) AND (""Date"" <= '2022-08-22'::date) AND (database = ANY ('{line_1_computer,line_2_computer}'::text[])))"
                          Rows Removed by Filter: 14327912"
                          Buffers: shared hit=679 read=915799"
Planning:"
  Buffers: shared hit=7 read=4"
Planning Time: 1.484 ms"
JIT:"
  Functions: 21"
  Options: Inlining true, Optimization true, Expressions true, Deforming true"
  Timing: Generation 3.144 ms, Inlining 117.817 ms, Optimization 386.307 ms, Emission 249.100 ms, Total 756.369 ms"
Execution Time: 44771.410 ms"

Attempts to improve performance:

Created a materialized view based on the table. This worked amazingly initially, but now the performance is no longer stellar. Not sure if it has to do with the increased size of the view since it was initially created or something else.
Created a multi-column indexes base on the 2 columns that will always be part of the WHERE in all our query’s.

    CREATE INDEX global_packing_data_index
    ON global_packing_data("Date", database);

Made changes to the PostgreSQL server config file:

    max_connections = 100
    shared_buffers = 2GB
    wal_buffers = 16MB
    work_mem = 20971kB

The PostgreSQL server is hosted on an Amazon EC2 Instance. The virtual machine specs are: ubuntu, 2 vCPU, 8 GB RAM

Any ideas on how we can improve performance further so that we can query all this data under 2 seconds?

CodePudding user response：

Maybe a so-called covering index will help.

You can think of a BTREE index as an old-school telephone book sorted in order. PostgreSQL can do two things quickly: random-access the telephone book, and scan it in order.

In your WHERE clause you do an equality match on database and a range scan on Date. So the index you create to support your query should start with those two columns in that order: equality matches first, then the range scan match. PostgreSQL random-accesses the index to the first matching entry, then scans over the Date range.

Next you do GROUP BY on Scale and on database. If your index contains those two columns, PostgreSQL may be able to do the GROUP BY operation in order as it scans the index, making it faster.

Finally, you use Time and Weight in your query.

So, this index might help satisfy your query efficiently.

CREATE INDEX CONCURRENTLY alanders_index ON global_packing_data USING BTREE  
    ("database", "Date", "Scale", "database") INCLUDE ("Time", "Weight");

Because the database column appears twice in the index, this will be worth a comment in your codebase explaining why. Else somebody -- maybe your future self -- might remove it because it looks redundant.

Keep in mind that it's very difficult to guarantee performance for queries on a large table, especially one that grows fast.

Pro Tip When designing tables, try to avoid using reserved words like DATE or DATABASE for table or column names.

CodePudding user response：

There is no possibilities to optimize such queries in PostGreSQL due to the internal structure of the data storage into rows inside pages. This is a flaw because of the design of the MVCC that leave fantom records inside pages of the table...

All queries involving an aggregate in PostGreSQL such as COUNT, COUNT DISTINCT or DISTINCT must read all rows inside the table pages to produce the result, and read the internal tag version to know if the row is a past or actual one...

Let'us take a look over the paper I wrote about this problem : PostGreSQL vs Microsoft SQL Server – Comparison part 2 : COUNT performances

Only the materialized view can be usefull in this case, on the condition of your data has few changes... because of the time spent for the refresh !