please tell me how to optimize the query. there are two tables:
- "Trains" (always 209 entries)
- "Passages" (currently 11620 entries, ~300 added per day) there is a request to get statistics for each train (getting the last passage date and the total number of passages)
train table structure:
---- -------- ---------- --------------- ------------------
| id | number | depot_id | train_type_id | wheelset_type_id |
---- -------- ---------- --------------- ------------------
passages table structure:
---- ------- -------------- -----------
| id | speed | train_number | date_time |
---- ------- -------------- -----------
the request looks like this:
SELECT
DISTINCT ON (train.id)
train.id,
train.number,
train.train_type_id,
train.wheelset_type_id,
passages.id as passage_id,
passages.date_time,
(
SELECT count(id) FROM passages WHERE train_number = train.number
) AS passage_count
FROM train
INNER JOIN passages ON passages.train_number=train.number
ORDER BY train.id, passages.date_time DESC;
through EXPLAIN shows ~ 6.5 seconds, which is very long IMHO. and almost all the waiting time is spent on a piece:
(select count(id) from passages where train_number = train.number) as passage_count
please tell me how to improve the query in terms of optimization.
explain(analyze, verbose, buffers) for request above:
"QUERY PLAN"
"Unique (cost=2547558.08..2547616.12 rows=211 width=36) (actual time=6845.152..6846.093 rows=109 loops=1)"
" Output: train.id, train.number, train.train_type_id, train.wheelset_type_id, passages.id, passages.date_time, ((SubPlan 1))"
" Buffers: shared hit=835018"
" -> Sort (cost=2547558.08..2547587.10 rows=11608 width=36) (actual time=6845.151..6845.547 rows=11283 loops=1)"
" Output: train.id, train.number, train.train_type_id, train.wheelset_type_id, passages.id, passages.date_time, ((SubPlan 1))"
" Sort Key: train.id, passages.date_time DESC"
" Sort Method: quicksort Memory: 1266kB"
" Buffers: shared hit=835018"
" -> Hash Join (cost=6.75..2546774.38 rows=11608 width=36) (actual time=0.840..6839.787 rows=11283 loops=1)"
" Output: train.id, train.number, train.train_type_id, train.wheelset_type_id, passages.id, passages.date_time, (SubPlan 1)"
" Hash Cond: (passages.train_number = train.number)"
" Buffers: shared hit=835018"
" -> Seq Scan on public.passages (cost=0.00..190.08 rows=11608 width=16) (actual time=0.010..0.827 rows=11608 loops=1)"
" Output: passages.id, passages.speed, passages.train_number, passages.system_id, passages.orientation, passages.date_time"
" Buffers: shared hit=74"
" -> Hash (cost=4.11..4.11 rows=211 width=16) (actual time=0.049..0.050 rows=211 loops=1)"
" Output: train.id, train.number, train.train_type_id, train.wheelset_type_id"
" Buckets: 1024 Batches: 1 Memory Usage: 18kB"
" Buffers: shared hit=2"
" -> Seq Scan on public.train (cost=0.00..4.11 rows=211 width=16) (actual time=0.009..0.027 rows=211 loops=1)"
" Output: train.id, train.number, train.train_type_id, train.wheelset_type_id"
" Buffers: shared hit=2"
" SubPlan 1"
" -> Aggregate (cost=219.36..219.37 rows=1 width=8) (actual time=0.605..0.605 rows=1 loops=11283)"
" Output: count(passages_1.id)"
" Buffers: shared hit=834942"
" -> Seq Scan on public.passages passages_1 (cost=0.00..219.10 rows=103 width=4) (actual time=0.051..0.594 rows=162 loops=11283)"
" Output: passages_1.id, passages_1.speed, passages_1.train_number, passages_1.system_id, passages_1.orientation, passages_1.date_time"
" Filter: (passages_1.train_number = train.number)"
" Rows Removed by Filter: 11446"
" Buffers: shared hit=834942"
"Planning Time: 0.166 ms"
"Execution Time: 6846.292 ms"
explain(analyze, verbose, buffers) select * from train:
"QUERY PLAN"
"Seq Scan on public.train (cost=0.00..4.11 rows=211 width=20) (actual time=0.014..0.024 rows=211 loops=1)"
" Output: id, number, depot_id, train_type_id, wheelset_type_id"
" Buffers: shared hit=2"
"Planning Time: 0.081 ms"
"Execution Time: 0.040 ms"
explain(analyze, verbose, buffers) select * from passages:
"QUERY PLAN"
"Seq Scan on public.passages (cost=0.00..190.08 rows=11608 width=32) (actual time=0.009..0.598 rows=11608 loops=1)"
" Output: id, speed, train_number, system_id, orientation, date_time"
" Buffers: shared hit=74"
"Planning Time: 0.050 ms"
"Execution Time: 0.891 ms"
CodePudding user response:
You could rewrite the query to something like this, to get rid of the subquery:
WITH temp_table AS (
SELECT
DISTINCT ON (train.id)
train.id,
train.number,
train.train_type_id,
train.wheelset_type_id,
passages.id as passage_id,
passages.date_time
FROM train
INNER JOIN passages ON passages.train_number=train.number
ORDER BY train.id, passages.date_time DESC
),passage_counts as (
SELECT train_number, count(id) as passage_count FROM passages group by train_number
)
SELECT
temp_table.id,
temp_table.number,
temp_table.train_type_id,
temp_table.wheelset_type_id,
temp_table.passage_id,
temp_table.date_time,
passage_counts.passage_count
FROM temp_table
LEFT JOIN passage_counts
ON temp_table.number = passage_counts.train_number;
Is there any difference in the query plan?
Edit: Another option could be the usage of a window function:
SELECT
train.id,
train.number,
train.train_type_id,
train.wheelset_type_id,
passages.id as passage_id,
passages.date_time,
count(passages.id) OVER (PARTITION BY train.number) as passage_count
FROM train
INNER JOIN passages ON passages.train_number=train.number
ORDER BY train.id, passages.date_time DESC;
Give it a try!