I have a table that looks like this:
id | slot | total |
---|---|---|
1 | 2022-12-01T12:00 | 100 |
2 | 2022-12-01T12:30 | 150 |
3 | 2022-12-01T13:00 | 200 |
There's an index on slot already. The table has ~100mil rows (and a bunch more columns not shown here)
I want to sum the total up to the current moment in time. Let's say the time is currently 2022-12-01T12:45
. If I run select * from my_table where slot < CURRENT_TIMESTAMP()
,
then I get back records 1
and 2
.
However, in my data, the records represent forecasted sales within a time slot. I want to find the forecasts as of 2022-12-01T12:45
, and so I want to find the proportion of the half hour slot of record 2
that has elapsed, and return that proportion of the total.
As of 2022-12-01T12:45
(assuming minute granularity), 50%
of row 2
has elapsed, so I would expect the total to return as 150 / 2 = 75
.
My current query works, but is slow. What are some ways I can optimise this, or other approaches I can take?
Also, how can we extend this solution to be generalised to any interval frequency? Maybe tomorrow we change our forecasting model and the data comes in sporadically. The hardcoded 30
would not work in that case.
select sum(fraction * total) as t from
select total,
LEAST(
timestampdiff(
minute,
datetime,
current_timestamp()
),
30
) / 30 as fraction
from my_table
where slot <= current_timestamp()
CodePudding user response:
Consider computing your sum first, then remove the last element partial total. In order to keep the last element total, I'd prefer applying window functions instead of aggregations, and limit the output to the last row.
SET @current_time = CURRENT_TIMESTAMP();
WITH cte AS (
SELECT slot,
SUM(total) OVER(ORDER BY slot) AS total,
total AS rowtotal
FROM my_table
WHERE slot < @current_time
ORDER BY slot DESC
LIMIT 1
)
SELECT slot,
total - (30 - TIMESTAMPDIFF(MINUTE,
slot,
@current_time))
/30 * rowtotal AS total
FROM cte
Check the demo here.
Note1: Adding an index on the slot field is likely to boost this query performance.
Note2: If your query is running on millions of data, your timestamp may be likely to change during the query. You could store it into a variable before the query is run (or into another cte).
CodePudding user response:
create an ondex in slot
column btree
as it is having high selectivity
;