Replace NA with previous mean (cumean)-CodePudding

I'd like some help to compute lag mean of a column just if a value is NA

Follow the example

library(tidyverse)
 
iname <- rep('a', 10)
itime <- 1:10 
ival <- c(1, 2, 3, NA, 5, 6, NA, 8, 9, 10)
ds <- tibble(iname, itime, ival)
ds
# A tibble: 10 × 3
   iname itime  ival
   <chr> <int> <dbl>
 1 a         1     1
 2 a         2     2
 3 a         3     3
 4 a         4    NA
 5 a         5     5
 6 a         6     6
 7 a         7    NA
 8 a         8     8
 9 a         9     9
10 a        10    10 
ds %>%
    group_by(iname) %>%
    mutate(...) -> result
result
# A tibble: 10 × 3
   iname itime  ival
   <chr> <int> <dbl>
 1 a         1  1   
 2 a         2  2   
 3 a         3  3   
 4 a         4  2   
 5 a         5  5   
 6 a         6  6   
 7 a         7  3.16
 8 a         8  8   
 9 a         9  9   
10 a        10 10

The itime==4 has an empty value in ival, so in the result, this value were calculated using the previous values mean(1,2,3).

In the itime==7 I don't really care if the computation includes or not the values imputed by the mean. But both solutions are welcome, cos it has some difference.

> mean(c(1, 2, 3, 2, 5, 6))
[1] 3.166667
> mean(c(1, 2, 3, 5, 6))
[1] 3.4

I found an equivalent question here Use tidyverse to replace NA with mean of data, by group

Thanks in advance

CodePudding user response：

A solution for the result where each subsequent mean doesn't rely on the previously updated one:

nacummean <- function(x) {
    nas  <- is.na(x)
    cs   <- cumsum(replace(x, nas, 0))
    x[nas] <- cs[nas] / (which(nas) - cumsum(nas)[nas])
    x
}

library(dplyr)
ds %>%
    group_by(iname) %>%
    mutate(ival = nacummean(ival))
## A tibble: 10 x 3
## Groups:   iname [1]
#   iname itime  ival
#   <chr> <int> <dbl>
# 1 a         1   1  
# 2 a         2   2  
# 3 a         3   3  
# 4 a         4   2  
# 5 a         5   5  
# 6 a         6   6  
# 7 a         7   3.4
# 8 a         8   8  
# 9 a         9   9  
#10 a        10  10

The logic is: take a cumulative sum along the variable, after replacing NA values with 0, then replace all the NAs with the cumsum value at that point, divided by its position in the vector minus the cumulative number of NA values.

The other calculation is a bit trickier as it will probably require multiple passes. Here's one attempt using a while loop:

nacummean_upd <- function(x) {
    while(sum(is.na(x)) > 0) {
        fna <- match(NA,x)
        xs <- head(x, fna)
        x[seq(1,fna)] <- nacummean(xs)
    }
    x
}

library(dplyr)
ds %>%
    group_by(iname) %>%
    mutate(ival = nacummean_upd(ival))
## A tibble: 10 x 3
## Groups:   iname [1]
#   iname itime  ival
#   <chr> <int> <dbl>
# 1 a         1  1   
# 2 a         2  2   
# 3 a         3  3   
# 4 a         4  2   
# 5 a         5  5   
# 6 a         6  6   
# 7 a         7  3.17
# 8 a         8  8   
# 9 a         9  9   
#10 a        10 10

CodePudding user response：

scenario 1: doesn't depend on replaced NA values

not_depend <- function(x) {
  na_inds <- which(is.na(x))
  x[na_inds] <- sapply(na_inds, function(y) {
    mean(x[seq_len(y)], na.rm = TRUE)
  })
  x
}

ds %>% mutate(ival = not_depend(ival))
#> # A tibble: 10 × 3
#>    iname itime  ival
#>    <chr> <int> <dbl>
#>  1 a         1   1  
#>  2 a         2   2  
#>  3 a         3   3  
#>  4 a         4   2  
#>  5 a         5   5  
#>  6 a         6   6  
#>  7 a         7   3.4
#>  8 a         8   8  
#>  9 a         9   9  
#> 10 a        10  10

scenario 2: Depends on replaced NA values:

depends <- function(x) {
  na_inds <- which(is.na(x))
  sapply(na_inds, function(y){
    x[y] <<- mean(x[seq_len(y)], na.rm = TRUE)
  })
  x
}

ds %>% mutate(ival = depends(ival))
#> # A tibble: 10 × 3
#>    iname itime  ival
#>    <chr> <int> <dbl>
#>  1 a         1  1   
#>  2 a         2  2   
#>  3 a         3  3   
#>  4 a         4  2   
#>  5 a         5  5   
#>  6 a         6  6   
#>  7 a         7  3.17
#>  8 a         8  8   
#>  9 a         9  9   
#> 10 a        10 10