Trying to write a query from the following table that calculates a count of records for every day date(timestamp) and the prior 4 days. So basically a rolling count of records for every previous 5 days. Every time i do the calculation returns slightly off.
Symbol | timestamp | high | number | date(timestamp) | date(timestamp - interval 1 day) | date(timestamp - interval 2 day) | date(timestamp - interval 3 day) | date(timestamp - interval 4 day) |
---|---|---|---|---|---|---|---|---|
SPY | 2021-04-26 04:00:00 00:00 | 416.97 | 1 | 2021-04-26 | 2021-04-25 | 2021-04-24 | 2021-04-23 | 2021-04-22 |
SPY | 2021-04-26 06:20:00 00:00 | 416.91 | 2 | 2021-04-26 | 2021-04-25 | 2021-04-24 | 2021-04-23 | 2021-04-22 |
SPY | 2021-04-26 08:00:00 00:00 | 416.84 | 3 | 2021-04-26 | 2021-04-25 | 2021-04-24 | 2021-04-23 | 2021-04-22 |
SPY | 2021-04-26 08:05:00 00:00 | 416.8 | 4 | 2021-04-26 | 2021-04-25 | 2021-04-24 | 2021-04-23 | 2021-04-22 |
SPY | 2021-04-26 08:10:00 00:00 | 416.81 | 5 | 2021-04-26 | 2021-04-25 | 2021-04-24 | 2021-04-23 | 2021-04-22 |
SPY | 2021-04-26 08:15:00 00:00 | 416.78 | 6 | 2021-04-26 | 2021-04-25 | 2021-04-24 | 2021-04-23 | 2021-04-22 |
SPY | 2021-04-26 08:20:00 00:00 | 416.75 | 7 | 2021-04-26 | 2021-04-25 | 2021-04-24 | 2021-04-23 | 2021-04-22 |
SPY | 2021-04-26 08:25:00 00:00 | 416.54 | 8 | 2021-04-26 | 2021-04-25 | 2021-04-24 | 2021-04-23 | 2021-04-22 |
SPY | 2021-04-26 08:30:00 00:00 | 416.51 | 9 | 2021-04-26 | 2021-04-25 | 2021-04-24 | 2021-04-23 | 2021-04-22 |
SPY | 2021-04-26 08:35:00 00:00 | 416.34 | 10 | 2021-04-26 | 2021-04-25 | 2021-04-24 | 2021-04-23 | 2021-04-22 |
SPY | 2021-04-26 08:40:00 00:00 | 416.33 | 11 | 2021-04-26 | 2021-04-25 | 2021-04-24 | 2021-04-23 | 2021-04-22 |
The following query returns the counts but they are slightly off. For example of 2021-4-30 the count should match the max(number), the enumerated rows but it does not. Please help
select date(t.timestamp),date(t.timestamp - interval 4 day),
(count(t.timestamp) count(t.timestamp - interval 4 day) count(t.timestamp - interval 3 day) count(t.timestamp - interval 2 day) count(t.timestamp - interval 1 day)) as cntrecords,
max(number), min(number)
from
(SELECT Symbol,
timestamp,
high,
ROW_NUMBER() OVER (ORDER BY timestamp) AS number,
date(timestamp),
date(timestamp - interval 4 day)
#BETWEEN DATE_SUB(date(timestamp), INTERVAL 4 DAY) AND date(timestamp)
FROM test.rawdata
#WHERE date(timestamp) BETWEEN DATE_SUB('2021-04-30', INTERVAL 4 DAY) AND '2021-04-30'
#group by date(timestamp)
order by timestamp) as t
group by date(t.timestamp);
CodePudding user response:
This will group by symbol and date and then give the sum of the current row's count with the preceding 4 rows -
SELECT *, SUM(`count`) OVER (
PARTITION BY `Symbol`
ORDER BY `date` ASC
ROWS BETWEEN 4 PRECEDING AND CURRENT ROW
)
FROM (
SELECT `Symbol`, DATE(`timestamp`) `date`, COUNT(*) `count`
FROM `test`.`rawdata`
GROUP BY `Symbol`, `date`
) tbl
Or you could use a recursive cte to build a list of ranges to join and group by -
WITH RECURSIVE `cte` (`date`, `start`, `end`) AS (
SELECT
CAST('2021-04-30' AS DATE),
CAST('2021-04-26 00:00:00' AS DATETIME),
CAST('2021-04-30 23:59:59' AS DATETIME)
UNION ALL
SELECT
`date` INTERVAL 1 DAY,
`start` INTERVAL 1 DAY,
`end` INTERVAL 1 DAY
FROM `cte`
WHERE `date` < '2021-05-31'
)
SELECT `rawdata`.`symbol`, `cte`.`date`, COUNT(*)
FROM `cte`
JOIN `test`.`rawdata`
ON `rawdata`.`timestamp` BETWEEN `cte`.`start` AND `cte`.`end`
GROUP BY `rawdata`.`symbol`, `cte`.`date`;
CodePudding user response:
This was not stated in the question, but I'm guessing that columns like number
and date(timestamp - interval 3 day)
were added to try and help solve this problem, or are results of an intermediate query. Either way, they are unnecessary.
Let's solve this problem for a table which only has symbol
, timestamp
, and high
, using a window function with frame specification:
-- Get the counts for each day (factored out as a CTE)
with Counts AS (
SELECT
symbol,
DATE(timestamp) AS day,
COUNT(*) AS day_count
FROM
prices
GROUP BY symbol, day
)
SELECT
symbol,
day,
-- Here we use the window function with a frame spec to sum for the last 4 days per row
SUM(day_count) OVER (
PARTITION BY symbol
ORDER BY day
RANGE BETWEEN INTERVAL '4' DAY PRECEDING AND CURRENT ROW
)
FROM
Counts
Note that I'm using RANGE
instead of ROW
because if there are days for which there are no data, that would throw off the counts.
Also note that this will not work in MariaDb which does not yet support RANGE expressions with TIME like fields and intervals.
You can play around with this example in this DB Fiddle: https://www.db-fiddle.com/f/ctN927ouAMHrWK1QJQD6Z2/0
(Note I only used 1 day in the range there so I didn't have to generate as much test data.)