I have the following example data:
Example <- data.frame(col1 =c(1, NA, NA, 4, NA, NA, 6, NA, NA, NA, 6, 8, NA, 2, NA))
col1 |
---|
1 |
NA |
NA |
4 |
NA |
NA |
6 |
NA |
NA |
NA |
6 |
8 |
NA |
2 |
NA |
I want to fill the NAs with value from above, but only if the NAs are between 2 identical values. In this example the first NA gap from 1 to 4 should not be filled with 1s. But the gap between the first 6 and the second 6 should be filled, with 6s. All other values should stay NA. Therefore, afterwards it should look like:
col1 |
---|
1 |
NA |
NA |
4 |
NA |
NA |
6 |
6 |
6 |
6 |
6 |
8 |
NA |
2 |
NA |
But in reality I do not have only 15 observations, but over 50000. Therefore I need a efficient solution, which is more difficult than I thought. I tried to use the Fill function but was not able to come up with a solution.
CodePudding user response:
One dplyr
and zoo
option could be:
df %>%
mutate(cond = na.locf0(col1) == na.locf0(col1, fromLast = TRUE),
col1 = ifelse(cond, na.locf0(col1), col1)) %>%
select(-cond)
col1
1 1
2 NA
3 NA
4 4
5 NA
6 NA
7 6
8 6
9 6
10 6
11 6
12 8
13 NA
14 2
15 NA
CodePudding user response:
Here is a dply solution:
First I create the data in tibble format:
df <- tibble(
x = c(1, NA_real_, NA_real_,
4, NA_real_, NA_real_,
6, NA_real_, NA_real_, NA_real_,
6, 8, NA_real_, 2, NA_real_)
)
Next, I create two grouping variables which will be helpful in identifying the first and the last non-NA value.
I then save these reference values to ref_start
and ref_end
.
In the end I overwrite the values of x
:
df %>%
mutate(gr1 = cumsum(!is.na(x))) %>%
group_by(gr1) %>%
mutate(ref_start = first(x)) %>%
ungroup() %>%
mutate(gr2 = lag(gr1, default = 1)) %>%
group_by(gr2) %>%
mutate(ref_end = last(x)) %>%
ungroup() %>%
mutate(x = if_else(is.na(x) & ref_start == ref_end, ref_start, x))
# A tibble: 15 x 1
x
<dbl>
1 1
2 NA
3 NA
4 4
5 NA
6 NA
7 6
8 6
9 6
10 6
11 6
12 8
13 NA
14 2
15 NA
CodePudding user response:
df <- data.frame(col1 =c(1, NA, NA, 4, NA, NA, 6, NA, NA, NA, 6, 8, NA, 2, NA))
library(data.table)
library(magrittr)
setDT(df)[!is.na(col1), n := .N, by = col1] %>%
.[, n := nafill(n, type = "locf")] %>%
.[n == 2, col1 := nafill(col1, type = "locf")] %>%
.[, n := NULL] %>%
.[]
#> col1
#> 1: 1
#> 2: NA
#> 3: NA
#> 4: 4
#> 5: NA
#> 6: NA
#> 7: 6
#> 8: 6
#> 9: 6
#> 10: 6
#> 11: 6
#> 12: 8
#> 13: NA
#> 14: 2
#> 15: NA
Created on 2021-10-11 by the reprex package (v2.0.1)
CodePudding user response:
Here is a tidyverse
approach using dplyr
and tidyr
:
Logic:
- Create an
id
column - Remove all na rows
- Flag if next value is the same
right_join
with firstExample
dffill
downflag
and correspondingcol1.y
mutate
with anifelse
library(dplyr)
library(tidyr)
Example <- Example %>%
mutate(id=row_number())
Example %>%
na.omit() %>%
mutate(flag = ifelse(col1==lead(col1), TRUE, FALSE)) %>%
right_join(Example, by="id") %>%
arrange(id) %>%
fill(col1.y, .direction="down") %>%
fill(flag, .direction="down") %>%
mutate(col1.x = ifelse(flag==TRUE, col1.y, col1.x), .keep="unused") %>%
select(col1 = col1.x)
Output:
col1
1 1
2 NA
3 NA
4 4
5 NA
6 NA
7 6
8 6
9 6
10 6
11 6
12 8
13 NA
14 2
15 NA