I am having trouble with something that I think timescaledb
can help with.
Suppose I have this table:
CREATE TABLE purchase
(
id integer NOT NULL DEFAULT nextval('purchase_id_seq'::regclass),
"timestamp" timestamp without time zone NOT NULL,
country character varying(128) COLLATE pg_catalog."default",
product character varying(128) COLLATE pg_catalog."default",
quantity numeric(64,32),
price numeric(64,32)
)
Which represents Purchases:
- id
- timestamp of purchase
- country it occurred in
- product being purchased
- quantity purchased
- price paid per 1 quantity
For each purchase, I want to calculate (pseudo code):
price - last price for given (country, product) where timestamp - timestamp of old record > 5 minutes
For example, if I have these purchases:
id timestamp country product quantity price
1 2021-12-09 07:12:11.13 US apple 1 1.2
2 2021-12-09 07:13:11.13 US apple 2 1.3
3 2021-12-09 07:19:12.13 US apple 2 1.4
4 2021-12-09 07:20:19.13 US apple 2 0.9
Then I would have these deltas
id timestamp country product quantity price last_price_at_least_five_minutes_ago
1 2021-12-09 07:12:11.13 US apple 1 1.2 NULL
2 2021-12-09 07:13:11.13 US apple 2 1.3 NULL
3 2021-12-09 07:17:12.13 US apple 2 1.4 1.2
4 2021-12-09 07:20:19.13 US apple 2 0.9 1.3
For each CURRENTROW, what is the easiest way to ask for "the price of the row with the highest timestamp that is less than CURRENTROW - '5 minutes' ?
Stupidly I have tried this:
SELECT
t1.country,
t1.product,
t1.timestamp,
t1.id,
t1.price,
t2.id AS last_id,
t2.timestamp AS last_timestamp,
t2.price AS last_price
FROM
purchase t1
LEFT JOIN purchase t2
ON
t2.timestamp < t1.timestamp - INTERVAL '5m' AND
t1.country = t2.country AND
t1.product = t2.product
GROUP BY
t1.country,
t1.product,
t1.id,
t1.price,
t1.timestamp,
t2.id,
t2.price,
t2.timestamp
But this hangs and I am sure it is doing a lot of unnecessary work, as the result set of the join is massive (each row * N where N is number of rows that are 5m ago)
CodePudding user response:
You can use a scalar subquery reading literally "the price of the row with the highest timestamp that is less than CURRENTROW - 5 minutes".
select *,
(
select price
from purchase
where product = currentrow.product -- more conditions can be added here
and "timestamp" < currentrow."timestamp" - interval '5 minutes' -- "less than CURRENTROW - 5 minutes"
order by "timestamp" desc limit 1 -- "the highest timestamp"
) as last_price_at_least_five_minutes_ago
from purchase as currentrow;
currentrow
is not a good enough alias for purchase
table but fits well in the logic of the subquery.
CodePudding user response:
You could also use a window function, although you'd have to determine any performance indications on larger datasets and the range you're querying.
This example selects the last value in the partition (by product in this case) with time ordered ASC. In order to get a value that is at least 5 minutes ago, you have to setup a window range that starts longer ago than 5 minutes (this uses UNBOUNDED PRECEDING
) and 5 minutes ago ('5 minutes'::interval PRECEDING
).
Using UNBOUNDED PRECEDING
could end up being costly depending on how many rows you have, so you could set some kind of starting interval if you knew/expect some kind of regularity to your data (something like range between '20 minute'::interval PRECEDING and '5 minute'::interval PRECEDING
)
select ts, country, product,quantity,price,
last_value(purchase.price) over w as last_price_at_least_five_minutes_ago
from purchase
window w as (partition by product order by ts range between unbounded PRECEDING and '5 minutes'::interval PRECEDING)
order by ts asc;
Results in:
ts |country|product|quantity|price|last_price_at_least_five_minutes_ago|
----------------------- ------- ------- -------- ----- ------------------------------------
2021-12-09 07:12:11.130|US |apple | 1.0| 1.2| |
2021-12-09 07:13:11.130|US |apple | 2.0| 1.3| |
2021-12-09 07:17:12.130|US |apple | 2.0| 1.4| 1.2|
2021-12-09 07:20:19.130|US |apple | 2.0| 0.9| 1.3|