Given the following tables
buckets
metric_id|start_date |bucket
------------------------------------
a |2019-12-05 00:00:00|1
a |2019-12-06 00:00:00|2
b |2021-10-31 00:00:00|1
b |2021-11-01 00:00:00|2
points
point_id|metric_id|timestamp
----------------------------
1 |a |2019-12-05 00:00:00
2 |a |2019-12-06 00:00:00
3 |b |2021-10-31 00:00:00
4 |b |2021-11-01 00:00:00
And the following query
select
p.metric_id,
bucket
from points p
left join width_bucket(p.timestamp, (select array(select start_date
from buckets b
where b.metric_id = p.metric_id -- correlated sub-query
))) as bucket on true
Output
metric_id|bucket
-----------------
a |1
a |2
b |1
b |2
How can I remove the correlated sub-query to improve the performance?
Currently ~280,000 points * ~650 buckets = ~180,000,000 loops = very slow!
Basically I want to remove the correlated sub-query and apply the width_bucket function only once per unique metric_id in buckets, so that the performance is improved and the function is still given the correct time series data.
How can this be done in Postgres 13?
CodePudding user response:
you can rewrite your query :
select
p.metric_id,
width_bucket(p.timestamp,array_agg(b.start_date)) bucket
from points p
left join buckets b on b.metric_id = p.metric_id
group by p.metric_id, p.timestamp
also adding index on buckets.start_date & points ( metric_id, timestamp) would help alot.
CodePudding user response:
You can use a cte to aggregate buckets first
with buckets_arr as (
select metric_id, array_agg(start_date order by start_date) arrb
from buckets
group by metric_id
)
select
p.metric_id,
width_bucket(p.timestamp, ba.arrb) bucket
from points p
join buckets_arr ba on p.metric_id = ba.metric_id