I have a dataframe in which a column (x) has some missing values and I would like to create a new column (y) replacing these NAs with the nearest value already existing. Also I would like that if the distance is the same both ways, I get the mean of both.
Here is an example of what I mean:
data <- data.frame(x = c(2, 3, 3, NA, NA, 4, NA, 3, 2, NA, NA, NA, 4))
The dataframe I would like to obtain would be that:
x | y |
---|---|
2 | 2 |
3 | 3 |
3 | 3 |
NA | 3 |
NA | 4 |
4 | 4 |
NA | 3.5 |
3 | 3 |
2 | 2 |
NA | 2 |
NA | 3 |
NA | 4 |
4 | 4 |
Thanks in advance
CodePudding user response:
Your case is slightly different from the usual cases, where interpolation is needed from the first NA. Here, interpolation is necessary only when the NA sequence is 1, 3, or more, and for the first and last value in a sequence of NA, you need the closest non-NA.
You can use lead
and lag
to do so, and then apply interpolation with zoo::na.approx
.
library(dplyr)
library(zoo)
data %>%
mutate(y = case_when(is.na(x) & !is.na(lag(x)) & is.na(lead(x)) ~ lag(x),
is.na(x) & is.na(lag(x)) & !is.na(lead(x)) ~ lead(x),
TRUE ~ x) %>%
na.approx(.))
output
x y
1 2 2.0
2 3 3.0
3 3 3.0
4 NA 3.0
5 NA 4.0
6 4 4.0
7 NA 3.5
8 3 3.0
9 2 2.0
10 NA 2.0
11 NA 3.0
12 NA 4.0
13 4 4.0