Using rollmean filtering out NA with threshold-CodePudding

I am trying to apply a rollapply mean function to a dataframe with large chunks of missing data and single points interspersed throughout the missing data. Using my current form of rollapply, only one non-NaN value is needed and averaged to all surrounding values. My goal is to keep the values that have at least half of the values averaged within them, and remove all values that are the result of over 50% NaN data. Below is a simple example I made

library(zoo)

a <- c(0:20)
ind <- c(10:15)
ind2 <- c(10:12,14)
b <- a
b[ind] <- NaN
c <- a
c[ind2] <- NaN
df <- data.frame(a=a,b=b,c=c)
dfroll <- as.data.frame(rollapply(df,width=5,mean,na.rm=T,fill=NA))

This results in the following data.frame

dfroll
    a    b        c
1  NA   NA       NA
2  NA   NA       NA
3   2  2.0  2.00000
4   3  3.0  3.00000
5   4  4.0  4.00000
6   5  5.0  5.00000
7   6  6.0  6.00000
8   7  6.5  6.50000
9   8  7.0  7.00000
10  9  7.5  7.50000
11 10  8.0 10.00000
12 11  NaN 12.00000
13 12  NaN 13.00000
14 13 15.0 13.66667
15 14 15.5 14.25000
16 15 16.0 15.50000
17 16 16.5 16.00000
18 17 17.0 17.00000
19 18 18.0 18.00000
20 NA   NA       NA
21 NA   NA       NA

For this example, I would like all values that have >= 3 NaN values to result in NaN. This would be rows 10:15 on column b, and rows 10:13 on column c. How would I go about setting up this threshold to pass NaN rather than the averaged value if there is limited data around the cell?

Desired result:

bdesired <- dfroll$b
bdesired[c(10:15)] <- NaN
cdesired <- dfroll$c
cdesired[c(10:13)] <- NaN
dfdesired <- data.frame(a=dfroll$a,b=bdesired,c=cdesired)

 dfdesired
    a    b        c
1  NA   NA       NA
2  NA   NA       NA
3   2  2.0  2.00000
4   3  3.0  3.00000
5   4  4.0  4.00000
6   5  5.0  5.00000
7   6  6.0  6.00000
8   7  6.5  6.50000
9   8  7.0  7.00000
10  9  NaN      NaN
11 10  NaN      NaN
12 11  NaN      NaN
13 12  NaN      NaN
14 13  NaN 13.50000
15 14  NaN 14.33333
16 15 16.0 16.00000
17 16 16.5 16.50000
18 17 17.0 17.00000
19 18 18.0 18.00000
20 NA   NA       NA
21 NA   NA       NA

CodePudding user response：

With across(everything() you can do the operation over all columns, then with rollsum(is.na()) we count the number of NAs, and only if that is not 3 or higher we calculate the rollmean.

I only noticed some of your values for the c-variable are slightly different in your dfroll compared to in your dfdesired. My results match those in dfroll.

library(tidyverse)

df %>% 
  mutate(across(everything(), 
                ~ifelse(rollsum(is.na(.x), 5, fill = NA) > 2, NaN, rollmean(.x, 5, fill = NA, na.rm = T))))


    a    b        c
1  NA   NA       NA
2  NA   NA       NA
3   2  2.0  2.00000
4   3  3.0  3.00000
5   4  4.0  4.00000
6   5  5.0  5.00000
7   6  6.0  6.00000
8   7  6.5  6.50000
9   8  7.0  7.00000
10  9  NaN      NaN
11 10  NaN      NaN
12 11  NaN      NaN
13 12  NaN      NaN
14 13  NaN 13.66667
15 14  NaN 14.25000
16 15 16.0 15.50000
17 16 16.5 16.00000
18 17 17.0 17.00000
19 18 18.0 18.00000
20 NA   NA       NA
21 NA   NA       NA

CodePudding user response：

1) Define a function which returns NaN if there are 3 or more NA's in its input and returns mean of the non-NA's otherwise. Then use it with rollapply. Convert that to a data frame if desired using as.data.frame but since the data is entirely numeric leaving it as a matrix may be sufficient.

Mean <- function(x) if (sum(is.na(x)) >= 3) NaN else mean(x, na.rm = TRUE)
rollapply(df, 5, Mean, fill = NA)

2) Another possibility is to first run a rolling mean and then add NaN to those output cells for which there were 3 or more NA's in the input window and 0 to the remaining cells. Again use as.data.frame on the result if a data frame is needed.

rollmean(df, 5, na.rm = TRUE, fill = NA)   
  ifelse(rollsum(is.na(df), 5, fill = NA) > 2, NaN, 0)