Home > Mobile >  Calculate rolling mean with a minimum number of non-na values in the moving window
Calculate rolling mean with a minimum number of non-na values in the moving window

Time:12-04

As you can see from this example it is easy to calculate running mean:

data <- data.frame(dats=c(3,4,NA,4,NA,NA,6,NA,8,1,4,NA,2,NA,NA,6,NA,NA,9,5,NA,8,NA,3))
data <- data %>% mutate(rmean = caTools::runmean(dats, 3, endrule="constant"))

But in some cases the mean calculated just from the only no-na value in the data. How can I prevent this and specify that I get the runmean only when a certain number of non-na values in the running window is used in mean calculation?

CodePudding user response:

If you don't mind using zoo library, then one solution would be to define a custom function:

rolling_mean = function(x) {                                                                 
        ifelse(length(na.omit(x)) > 2, mean(x), "too_many_missing")
}

Then roll over the dataset using rollapply:

library(zoo)
library(dplyr)
data %>% 
    mutate(remean = rollapply(dats, width=3, FUN=rolling_mean,  partial = 2)) %>%
    na.fill(c("extend", NA))

Of course, you can change the value in the custom function to alter the number of non NA values.

Also, you likely want to change the "too_many_missing" string to NA to avoid coercing the whole column to a character variable.

   dats           remean
1     3             <NA>
2     4 too_many_missing
3    NA too_many_missing
4     4 too_many_missing
5    NA too_many_missing
6    NA too_many_missing
7     6 too_many_missing
8    NA too_many_missing
9     8 too_many_missing
10    1 4.33333333333333
11    4 too_many_missing
12   NA too_many_missing
13    2 too_many_missing
14   NA too_many_missing
15   NA too_many_missing
16    6 too_many_missing
17   NA too_many_missing
18   NA too_many_missing
19    9 too_many_missing
20    5 too_many_missing
21   NA too_many_missing
22    8 too_many_missing
23   NA too_many_missing
24    3             <NA>

CodePudding user response:

rmean uses NA if there are not at least 2 non-NAs using rollapply, rmean2 does it using two calls to runmean and rmean3 is the value calculated in the question.

library(zoo)

mean2 <- function(x) if (sum(!is.na(x)) >= 2) mean(x, na.rm = TRUE) else NA
data %>% 
 mutate(
   rmean = rollapply(dats, 3, mean2, partial = TRUE) |> na.fill(c("extend", NA)),
   rmean2 = ifelse(runmean(!is.na(dats), 3, endrule = "constant") > 2/3 - 1e-5, 
                 runmean(dats, 3, endrule = "constant"), NA),
   rmean3 = runmean(dats, 3, endrule = "constant"))

giving:

   dats    rmean   rmean2   rmean3
1     3 3.500000 3.500000 3.500000
2     4 3.500000 3.500000 3.500000
3    NA 4.000000 4.000000 4.000000
4     4       NA       NA 4.000000
5    NA       NA       NA 4.000000
6    NA       NA       NA 6.000000
7     6       NA       NA 6.000000
8    NA 7.000000 7.000000 7.000000
9     8 4.500000 4.500000 4.500000
10    1 4.333333 4.333333 4.333333
11    4 2.500000 2.500000 2.500000
12   NA 3.000000 3.000000 3.000000
13    2       NA       NA 2.000000
14   NA       NA       NA 2.000000
15   NA       NA       NA 6.000000
16    6       NA       NA 6.000000
17   NA       NA       NA 6.000000
18   NA       NA       NA 9.000000
19    9 7.000000 7.000000 7.000000
20    5 7.000000 7.000000 7.000000
21   NA 6.500000 6.500000 6.500000
22    8       NA       NA 8.000000
23   NA 5.500000 5.500000 5.500000
24    3 5.500000 5.500000 5.500000
  • Related