Consider this data
data <- data.frame(
group = c(rep("A", 10), rep("B", 10)),
year = rep(2016:2025, 2),
value = c(10, 20, 30, 40, NA, NA, NA, NA, NA, NA,
70, 80, NA, NA, NA, NA, NA, NA, NA, NA)
)
data
#> group year value
#> 1 A 2016 10
#> 2 A 2017 20
#> 3 A 2018 30
#> 4 A 2019 40
#> 5 A 2020 NA
#> 6 A 2021 NA
#> 7 A 2022 NA
#> 8 A 2023 NA
#> 9 A 2024 NA
#> 10 A 2025 NA
#> 11 B 2016 70
#> 12 B 2017 80
#> 13 B 2018 NA
#> 14 B 2019 NA
#> 15 B 2020 NA
#> 16 B 2021 NA
#> 17 B 2022 NA
#> 18 B 2023 NA
#> 19 B 2024 NA
#> 20 B 2025 NA
Now I’d like to fill in the missing values applying an arbitrary function to the last non-missing value, recursively.
For example, let’s say that I want to let the values increase by 10 points, yearly.
So, for rows where value
is not NA, it should remain unmodified.
Starting from the row where value
is NA, it applies that valuet = valuet − 1 10
A naive attempt to do it would be to use dplyr::lag
,
but this only works for the first missing value because lag
is vectorized and
operates on the value
vector and do not recurse over the previous value
s
library(dplyr)
data |>
group_by(group) |>
mutate(value_fix = dplyr::lag(value) 10)
#> # A tibble: 20 × 4
#> # Groups: group [2]
#> group year value value_fix
#> <chr> <int> <dbl> <dbl>
#> 1 A 2016 10 NA
#> 2 A 2017 20 20
#> 3 A 2018 30 30
#> 4 A 2019 40 40
#> 5 A 2020 NA 50
#> 6 A 2021 NA NA
#> 7 A 2022 NA NA
#> 8 A 2023 NA NA
#> 9 A 2024 NA NA
#> 10 A 2025 NA NA
#> 11 B 2016 70 NA
#> 12 B 2017 80 80
#> 13 B 2018 NA 90
#> 14 B 2019 NA NA
#> 15 B 2020 NA NA
#> 16 B 2021 NA NA
#> 17 B 2022 NA NA
#> 18 B 2023 NA NA
#> 19 B 2024 NA NA
#> 20 B 2025 NA NA
This is where I thought tidyr
could help, because it is somewhat similar
to tidyr::fill
data |>
group_by(group) |>
tidyr::fill(value)
but ideally, with an .f
argument to be applied recursively to the last value.
There does not seem to be something like that.
Googling around I came up with this solution
data |>
group_by(group) |>
mutate(last_value = case_when(
value == dplyr::last(na.omit(value)) ~ value,
TRUE ~ NA_real_
)) |>
mutate(value_fix = purrr::accumulate(
.x = last_value,
.f = ~ coalesce(.x 10, .y)
))
#> # A tibble: 20 × 5
#> # Groups: group [2]
#> group year value last_value value_fix
#> <chr> <int> <dbl> <dbl> <dbl>
#> 1 A 2016 10 NA NA
#> 2 A 2017 20 NA NA
#> 3 A 2018 30 NA NA
#> 4 A 2019 40 40 40
#> 5 A 2020 NA NA 50
#> 6 A 2021 NA NA 60
#> 7 A 2022 NA NA 70
#> 8 A 2023 NA NA 80
#> 9 A 2024 NA NA 90
#> 10 A 2025 NA NA 100
#> 11 B 2016 70 NA NA
#> 12 B 2017 80 80 80
#> 13 B 2018 NA NA 90
#> 14 B 2019 NA NA 100
#> 15 B 2020 NA NA 110
#> 16 B 2021 NA NA 120
#> 17 B 2022 NA NA 130
#> 18 B 2023 NA NA 140
#> 19 B 2024 NA NA 150
#> 20 B 2025 NA NA 160
Which works, but seems kind of hacky and not easy to read. It would be cleaner to just write the loop and be happy.
I really thought for such a simple case, there would be a built-in way (vectorized, readable-code) to do it in the tidyverse. But I could not find any. Am I missing something?, any ideas how to better do this?
Created on 2022-08-30 with reprex v2.0.2
CodePudding user response:
custom_fun <- function(x, y) {
if(is.na(y)) x 10 else y
}
data %>%
group_by(group)%>%
mutate(value = accumulate(value, custom_fun))
# Groups: group [2]
group year value
<chr> <int> <dbl>
1 A 2016 10
2 A 2017 20
3 A 2018 30
4 A 2019 40
5 A 2020 50
6 A 2021 60
7 A 2022 70
8 A 2023 80
9 A 2024 90
10 A 2025 100
11 B 2016 70
12 B 2017 80
13 B 2018 90
14 B 2019 100
15 B 2020 110
16 B 2021 120
17 B 2022 130
18 B 2023 140
19 B 2024 150
20 B 2025 160
CodePudding user response:
A base
solution with ave()
Reduce(accumulate = TRUE)
:
transform(data, value = ave(value, group, FUN = \(val) {
Reduce(\(x, y) if(is.na(y)) x 10 else y, val, accumulate = TRUE)
}))
group year value
1 A 2016 10
2 A 2017 20
3 A 2018 30
4 A 2019 40
5 A 2020 50
6 A 2021 60
7 A 2022 70
8 A 2023 80
9 A 2024 90
10 A 2025 100
11 B 2016 70
12 B 2017 80
13 B 2018 90
14 B 2019 100
15 B 2020 110
16 B 2021 120
17 B 2022 130
18 B 2023 140
19 B 2024 150
20 B 2025 160
CodePudding user response:
You could also write a vectorized function:
fun <- function(x){
idx <- is.na(x)
b <- rle(idx)
id2 <- cumsum(b$lengths)[!b$values]
x[idx] <- sequence(b$lengths[b$values], x[id2] 10, by=10)
x
}
transform(data, value = fun(value))
group year value
1 A 2016 10
2 A 2017 20
3 A 2018 30
4 A 2019 40
5 A 2020 50
6 A 2021 60
7 A 2022 70
8 A 2023 80
9 A 2024 90
10 A 2025 100
11 B 2016 70
12 B 2017 80
13 B 2018 90
14 B 2019 100
15 B 2020 110
16 B 2021 120
17 B 2022 130
18 B 2023 140
19 B 2024 150
20 B 2025 160