I'd like some help to compute lag mean of a column just if a value is NA
Follow the example
library(tidyverse)
iname <- rep('a', 10)
itime <- 1:10
ival <- c(1, 2, 3, NA, 5, 6, NA, 8, 9, 10)
ds <- tibble(iname, itime, ival)
ds
# A tibble: 10 × 3
iname itime ival
<chr> <int> <dbl>
1 a 1 1
2 a 2 2
3 a 3 3
4 a 4 NA
5 a 5 5
6 a 6 6
7 a 7 NA
8 a 8 8
9 a 9 9
10 a 10 10
ds %>%
group_by(iname) %>%
mutate(...) -> result
result
# A tibble: 10 × 3
iname itime ival
<chr> <int> <dbl>
1 a 1 1
2 a 2 2
3 a 3 3
4 a 4 2
5 a 5 5
6 a 6 6
7 a 7 3.16
8 a 8 8
9 a 9 9
10 a 10 10
The itime==4
has an empty value in ival
, so in the result, this value were calculated using the previous values mean(1,2,3)
.
In the itime==7
I don't really care if the computation includes or not the values imputed by the mean. But both solutions are welcome, cos it has some difference.
> mean(c(1, 2, 3, 2, 5, 6))
[1] 3.166667
> mean(c(1, 2, 3, 5, 6))
[1] 3.4
I found an equivalent question here Use tidyverse to replace NA with mean of data, by group
Thanks in advance
CodePudding user response:
A solution for the result where each subsequent mean doesn't rely on the previously updated one:
nacummean <- function(x) {
nas <- is.na(x)
cs <- cumsum(replace(x, nas, 0))
x[nas] <- cs[nas] / (which(nas) - cumsum(nas)[nas])
x
}
library(dplyr)
ds %>%
group_by(iname) %>%
mutate(ival = nacummean(ival))
## A tibble: 10 x 3
## Groups: iname [1]
# iname itime ival
# <chr> <int> <dbl>
# 1 a 1 1
# 2 a 2 2
# 3 a 3 3
# 4 a 4 2
# 5 a 5 5
# 6 a 6 6
# 7 a 7 3.4
# 8 a 8 8
# 9 a 9 9
#10 a 10 10
The logic is: take a cumulative sum along the variable, after replacing NA
values with 0
, then replace all the NA
s with the cumsum value at that point, divided by its position in the vector minus the cumulative number of NA
values.
The other calculation is a bit trickier as it will probably require multiple passes. Here's one attempt using a while
loop:
nacummean_upd <- function(x) {
while(sum(is.na(x)) > 0) {
fna <- match(NA,x)
xs <- head(x, fna)
x[seq(1,fna)] <- nacummean(xs)
}
x
}
library(dplyr)
ds %>%
group_by(iname) %>%
mutate(ival = nacummean_upd(ival))
## A tibble: 10 x 3
## Groups: iname [1]
# iname itime ival
# <chr> <int> <dbl>
# 1 a 1 1
# 2 a 2 2
# 3 a 3 3
# 4 a 4 2
# 5 a 5 5
# 6 a 6 6
# 7 a 7 3.17
# 8 a 8 8
# 9 a 9 9
#10 a 10 10
CodePudding user response:
scenario 1: doesn't depend on replaced NA values
not_depend <- function(x) {
na_inds <- which(is.na(x))
x[na_inds] <- sapply(na_inds, function(y) {
mean(x[seq_len(y)], na.rm = TRUE)
})
x
}
ds %>% mutate(ival = not_depend(ival))
#> # A tibble: 10 × 3
#> iname itime ival
#> <chr> <int> <dbl>
#> 1 a 1 1
#> 2 a 2 2
#> 3 a 3 3
#> 4 a 4 2
#> 5 a 5 5
#> 6 a 6 6
#> 7 a 7 3.4
#> 8 a 8 8
#> 9 a 9 9
#> 10 a 10 10
scenario 2: Depends on replaced NA values:
depends <- function(x) {
na_inds <- which(is.na(x))
sapply(na_inds, function(y){
x[y] <<- mean(x[seq_len(y)], na.rm = TRUE)
})
x
}
ds %>% mutate(ival = depends(ival))
#> # A tibble: 10 × 3
#> iname itime ival
#> <chr> <int> <dbl>
#> 1 a 1 1
#> 2 a 2 2
#> 3 a 3 3
#> 4 a 4 2
#> 5 a 5 5
#> 6 a 6 6
#> 7 a 7 3.17
#> 8 a 8 8
#> 9 a 9 9
#> 10 a 10 10