Why is Postgres full table scan so slow?-CodePudding

I am testing the performance of Postgres full table scans (no index), and it's surprisingly slow.

The following is run on a fresh db.m5.8xlarge box from AWS:

CREATE TABLE test100m AS SELECT * FROM GENERATE_SERIES(1, 100000000) AS id;

SET max_parallel_workers_per_gather = 6;

EXPLAIN ANALYZE SELECT max(id) FROM test100m;

Result:

                                                                    QUERY PLAN                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=651812.03..651812.04 rows=1 width=4) (actual time=1817.850..1819.931 rows=1 loops=1)
   ->  Gather  (cost=651811.40..651812.01 rows=6 width=4) (actual time=1817.788..1819.921 rows=7 loops=1)
         Workers Planned: 6
         Workers Launched: 6
         ->  Partial Aggregate  (cost=650811.40..650811.41 rows=1 width=4) (actual time=1814.193..1814.194 rows=1 loops=7)
               ->  Parallel Seq Scan on test100m  (cost=0.00..609144.72 rows=16666672 width=4) (actual time=0.003..902.986 rows=14285714 loops=7)
 Planning Time: 0.055 ms
 Execution Time: 1819.953 ms

So, 1800ms to scan 100M rows. On my laptop (limited to 6 cores, with similar performance as db.m5.8xlarge), scanning 100M array entries takes 38ms:

func TestTiming(t *testing.T) {
    {
        data := make([]int, 100000000)
        for i := 0; i < len(data); i   {
            data[i] = i
        }
        start := time.Now()
        max := data[0]
        for i := 0; i < len(data); i   {
            if max < data[i] {
                max = data[i]
            }
        }
        fmt.Printf("Timing: 100,000,000 %s\n", time.Since(start))
    }
}

That's a difference of ~50 times. Granted I am not comparing apples to apples here, but still I would've expected a much smaller difference in performance. All data easily fits in memory.

Can Postgres performance on full table scans be significantly improved somehow? (besides increasing max_parallel_workers_per_gather) What is it doing to be 50x slower?

UPDATE:

Including a more detailed query plan:

> EXPLAIN (ANALYZE, BUFFERS, TIMING) SELECT max(id) FROM test100m;


                                                                    QUERY PLAN                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=651812.03..651812.04 rows=1 width=4) (actual time=1953.561..1955.891 rows=1 loops=1)
   Buffers: shared hit=442478
   ->  Gather  (cost=651811.40..651812.01 rows=6 width=4) (actual time=1953.505..1955.885 rows=7 loops=1)
         Workers Planned: 6
         Workers Launched: 6
         Buffers: shared hit=442478
         ->  Partial Aggregate  (cost=650811.40..650811.41 rows=1 width=4) (actual time=1950.497..1950.497 rows=1 loops=7)
               Buffers: shared hit=442478
               ->  Parallel Seq Scan on test100m  (cost=0.00..609144.72 rows=16666672 width=4) (actual time=0.004..916.197 rows=14285714 loops=7)
                     Buffers: shared hit=442478
 Planning Time: 0.059 ms
 Execution Time: 1955.916 ms

CodePudding user response：

50 times difference, as compared to just scanning an array in memory, is not that strange. Just imagine what the database needs to do:

check if the needed block of data is in the cache (supposedly using some complicated algorithm, some locking to avoid race conditions etc.);
parse each 8kiB block of data and convert it to some in-memory storage format;
check each row of this in-memory storage format, if it is visible by current transaction (another complicated algorithm, needed to avoid problems caused by wrap-around of transaction id);
apply some internal representation of a query plan on each row, which is way more complicated than just comparing two integers;
gather results from all parallel workers, which requires quite a lot of locking and synchronization to avoid corrupting memory;
merge results of parallel query, using another complicated algorithm with locking etc.

And this is still way simplified.