Home > Software design >  posgresql query optimization when fetching a large number of records
posgresql query optimization when fetching a large number of records

Time:01-13

please tell me how to optimize the query. there are two tables:

  • "Trains" (always 209 entries)
  • "Passages" (currently 11620 entries, ~300 added per day) there is a request to get statistics for each train (getting the last passage date and the total number of passages)

train table structure:

 ---- -------- ---------- --------------- ------------------ 
| id | number | depot_id | train_type_id | wheelset_type_id |
 ---- -------- ---------- --------------- ------------------ 

passages table structure:

 ---- ------- -------------- ----------- 
| id | speed | train_number | date_time |
 ---- ------- -------------- ----------- 

the request looks like this:

SELECT
  DISTINCT ON (train.id)
  train.id,
  train.number,
  train.train_type_id,
  train.wheelset_type_id,
  passages.id as passage_id,
  passages.date_time,
  (
    SELECT count(id) FROM passages WHERE train_number = train.number
  ) AS passage_count
FROM train
INNER JOIN passages ON passages.train_number=train.number
ORDER BY train.id, passages.date_time DESC;

through EXPLAIN shows ~ 6.5 seconds, which is very long IMHO. and almost all the waiting time is spent on a piece:

(select count(id) from passages where train_number = train.number) as passage_count

please tell me how to improve the query in terms of optimization.

explain(analyze, verbose, buffers) for request above:

"QUERY PLAN"
"Unique  (cost=2547558.08..2547616.12 rows=211 width=36) (actual time=6845.152..6846.093 rows=109 loops=1)"
"  Output: train.id, train.number, train.train_type_id, train.wheelset_type_id, passages.id, passages.date_time, ((SubPlan 1))"
"  Buffers: shared hit=835018"
"  ->  Sort  (cost=2547558.08..2547587.10 rows=11608 width=36) (actual time=6845.151..6845.547 rows=11283 loops=1)"
"        Output: train.id, train.number, train.train_type_id, train.wheelset_type_id, passages.id, passages.date_time, ((SubPlan 1))"
"        Sort Key: train.id, passages.date_time DESC"
"        Sort Method: quicksort  Memory: 1266kB"
"        Buffers: shared hit=835018"
"        ->  Hash Join  (cost=6.75..2546774.38 rows=11608 width=36) (actual time=0.840..6839.787 rows=11283 loops=1)"
"              Output: train.id, train.number, train.train_type_id, train.wheelset_type_id, passages.id, passages.date_time, (SubPlan 1)"
"              Hash Cond: (passages.train_number = train.number)"
"              Buffers: shared hit=835018"
"              ->  Seq Scan on public.passages  (cost=0.00..190.08 rows=11608 width=16) (actual time=0.010..0.827 rows=11608 loops=1)"
"                    Output: passages.id, passages.speed, passages.train_number, passages.system_id, passages.orientation, passages.date_time"
"                    Buffers: shared hit=74"
"              ->  Hash  (cost=4.11..4.11 rows=211 width=16) (actual time=0.049..0.050 rows=211 loops=1)"
"                    Output: train.id, train.number, train.train_type_id, train.wheelset_type_id"
"                    Buckets: 1024  Batches: 1  Memory Usage: 18kB"
"                    Buffers: shared hit=2"
"                    ->  Seq Scan on public.train  (cost=0.00..4.11 rows=211 width=16) (actual time=0.009..0.027 rows=211 loops=1)"
"                          Output: train.id, train.number, train.train_type_id, train.wheelset_type_id"
"                          Buffers: shared hit=2"
"              SubPlan 1"
"                ->  Aggregate  (cost=219.36..219.37 rows=1 width=8) (actual time=0.605..0.605 rows=1 loops=11283)"
"                      Output: count(passages_1.id)"
"                      Buffers: shared hit=834942"
"                      ->  Seq Scan on public.passages passages_1  (cost=0.00..219.10 rows=103 width=4) (actual time=0.051..0.594 rows=162 loops=11283)"
"                            Output: passages_1.id, passages_1.speed, passages_1.train_number, passages_1.system_id, passages_1.orientation, passages_1.date_time"
"                            Filter: (passages_1.train_number = train.number)"
"                            Rows Removed by Filter: 11446"
"                            Buffers: shared hit=834942"
"Planning Time: 0.166 ms"
"Execution Time: 6846.292 ms"

explain(analyze, verbose, buffers) select * from train:

"QUERY PLAN"
"Seq Scan on public.train  (cost=0.00..4.11 rows=211 width=20) (actual time=0.014..0.024 rows=211 loops=1)"
"  Output: id, number, depot_id, train_type_id, wheelset_type_id"
"  Buffers: shared hit=2"
"Planning Time: 0.081 ms"
"Execution Time: 0.040 ms"

explain(analyze, verbose, buffers) select * from passages:

"QUERY PLAN"
"Seq Scan on public.passages  (cost=0.00..190.08 rows=11608 width=32) (actual time=0.009..0.598 rows=11608 loops=1)"
"  Output: id, speed, train_number, system_id, orientation, date_time"
"  Buffers: shared hit=74"
"Planning Time: 0.050 ms"
"Execution Time: 0.891 ms"

CodePudding user response:

You could rewrite the query to something like this, to get rid of the subquery:

WITH temp_table AS (
  SELECT
    DISTINCT ON (train.id)
    train.id,
    train.number,
    train.train_type_id,
    train.wheelset_type_id,
    passages.id as passage_id,
    passages.date_time
  FROM train
  INNER JOIN passages ON passages.train_number=train.number
  ORDER BY train.id, passages.date_time DESC
),passage_counts as (
  SELECT train_number, count(id) as passage_count FROM passages group by train_number
)
SELECT
  temp_table.id,
  temp_table.number,
  temp_table.train_type_id,
  temp_table.wheelset_type_id,
  temp_table.passage_id,
  temp_table.date_time,
  passage_counts.passage_count
FROM temp_table
  LEFT JOIN passage_counts
    ON temp_table.number = passage_counts.train_number;

Is there any difference in the query plan?

Edit: Another option could be the usage of a window function:

  SELECT
    train.id,
    train.number,
    train.train_type_id,
    train.wheelset_type_id,
    passages.id as passage_id,
    passages.date_time,
    count(passages.id) OVER (PARTITION BY train.number) as passage_count
  FROM train
  INNER JOIN passages ON passages.train_number=train.number
  ORDER BY train.id, passages.date_time DESC;

Give it a try!

  • Related