Calculating a rolling 14 day average when dates are missing-CodePudding

I need to calculate the rolling 14 day average for a large data set. The data set is private, although I can share a small snippet.

The data set comes from an instrument in the field which does not operate every day. For instance, a snippet of the data frame would look like so:

Date, Value
2022-01-28, 196.00000
2022-01-31, 104.00000
2022-02-01, 0.00000
2022-02-02, 98.00000
2022-02-03, 0.00000
2022-02-07, 139.92308
2022-02-08, 114.50000
2022-02-09, 121.64286
2022-02-10, 96.50000
2022-02-11, 151.63636
2022-02-14, 85.87500
2022-02-15, 98.90000
2022-02-18, 209.40000
2022-02-21, 172.18182
2022-02-22, 0.00000
2022-02-23, 0.00000
2022-02-28, 264.00000
2022-03-01, 131.75000
2022-03-03, 119.33333
2022-03-04, 88.80000
2022-03-07, 152.16667
2022-03-08, 24.50000

I have the following plot.

library(zoo)
library(tidyverse)
ggplot(data=df_days, aes(x=Date, y=Value))  
  geom_line(color="black", lwd=0.5)  
  geom_point(lwd=0.5)  
  geom_line(y=rollmean(df_days$Value, 14, na.pad=TRUE), color="red", lwd=0.8)

I realised that I'm actually taking the 14 point average, i.e the average of 14 data points. Is there a way to take the 14 day average, based upon the dates themselves?

CodePudding user response：

There may be more elegant solutions, but you can fill in the missing dates with NA:

df$Date <- as.Date(df$Date)

library(dplyr)
library(tidyr)
df %>% complete(Date = seq(min(Date),max(Date),1), fill = list(Value = NA))

Output:

# A tibble: 40 × 2
#   Date       Value
#   <date>     <dbl>
# 1 2022-01-28   196
# 2 2022-01-29    NA
# 3 2022-01-30    NA
# 4 2022-01-31   104
# 5 2022-02-01     0
# 6 2022-02-02    98
# ...

CodePudding user response：

1) Using the input from the question shown reproducibly in the Note at the end we calculate the number of points to use at each date, w, and then use rollapplyr with that.

library(zoo)

within(DF, {
  w <- seq_along(Date) - findInterval(Date - 14, Date)
  mean14 <- rollapplyr(Value, w, mean)
})

giving the following where mean14 is the mean and w is the number of points used to calculate that mean. This is calculated in such a way that if there were no missing dates then it would give the same result as rollapplyr(DF$Value, 14, mean, partial = TRUE) but if there are missing dates then it uses fewer based on the number of dates in a 14 day window. (Note that using different numbers of points for each mean can affect the variance.)

         Date    Value    mean14 w
1  2022-01-28 196.0000 196.00000 1
2  2022-01-31 104.0000 150.00000 2
3  2022-02-01   0.0000 100.00000 3
4  2022-02-02  98.0000  99.50000 4
5  2022-02-03   0.0000  79.60000 5
6  2022-02-07 139.9231  89.65385 6
7  2022-02-08 114.5000  93.20330 7
8  2022-02-09 121.6429  96.75824 8
9  2022-02-10  96.5000  96.72955 9
10 2022-02-11 151.6364  91.80026 9
11 2022-02-14  85.8750  89.78637 9
12 2022-02-15  98.9000 100.77526 9
13 2022-02-18 209.4000 127.29716 8
14 2022-02-21 172.1818 131.32951 8
15 2022-02-22   0.0000 117.01700 8
16 2022-02-23   0.0000 101.81165 8
17 2022-02-28 264.0000 124.08030 6
18 2022-03-01 131.7500 129.55530 6
19 2022-03-03 119.3333 128.09502 7
20 2022-03-04  88.8000 110.86645 7
21 2022-03-07 152.1667 108.00714 7
22 2022-03-08  24.5000 111.50714 7

2) Another approach is to add the missing dates, fill in Value in those missing dates with NA and then use rollapplyr.

m <- merge(DF, data.frame(Date = seq(min(DF$Date), max(DF$Date), 1)), all = TRUE)
na.omit(transform(m, 
  mean14 = rollapplyr(Value, 14, mean, na.rm = TRUE, partial = TRUE)))

3) A variation of the above is to use zoo objects. Note that fortify.zoo(zz) can be used to create a data frame from a zoo object.

library(zoo)
z <- read.zoo(DF)

# 1
tt <- time(z)
w <- seq_along(tt) - findInterval(tt - 14, tt)
zz <- rollapplyr(z, w, mean)

# 2
m <- merge(z, zoo(, seq(start(z), end(z), 1)))
zz <- na.omit(rollapply(m, 14, mean, na.rm = TRUE))

Note

Lines <- "Date, Value
2022-01-28, 196.00000
2022-01-31, 104.00000
2022-02-01, 0.00000
2022-02-02, 98.00000
2022-02-03, 0.00000
2022-02-07, 139.92308
2022-02-08, 114.50000
2022-02-09, 121.64286
2022-02-10, 96.50000
2022-02-11, 151.63636
2022-02-14, 85.87500
2022-02-15, 98.90000
2022-02-18, 209.40000
2022-02-21, 172.18182
2022-02-22, 0.00000
2022-02-23, 0.00000
2022-02-28, 264.00000
2022-03-01, 131.75000
2022-03-03, 119.33333
2022-03-04, 88.80000
2022-03-07, 152.16667
2022-03-08, 24.50000"
DF <- read.csv(text = Lines)
DF$Date <- as.Date(DF$Date)