Home > other >  fill a column down, recursively applying a function to the previous value, using R
fill a column down, recursively applying a function to the previous value, using R

Time:08-30

Consider this data

data <- data.frame(
  group = c(rep("A", 10), rep("B", 10)),
  year = rep(2016:2025, 2),
  value = c(10, 20, 30, 40, NA, NA, NA, NA, NA, NA, 
            70, 80, NA, NA, NA, NA, NA, NA, NA, NA)
)
data
#>    group year value
#> 1      A 2016    10
#> 2      A 2017    20
#> 3      A 2018    30
#> 4      A 2019    40
#> 5      A 2020    NA
#> 6      A 2021    NA
#> 7      A 2022    NA
#> 8      A 2023    NA
#> 9      A 2024    NA
#> 10     A 2025    NA
#> 11     B 2016    70
#> 12     B 2017    80
#> 13     B 2018    NA
#> 14     B 2019    NA
#> 15     B 2020    NA
#> 16     B 2021    NA
#> 17     B 2022    NA
#> 18     B 2023    NA
#> 19     B 2024    NA
#> 20     B 2025    NA

Now I’d like to fill in the missing values applying an arbitrary function to the last non-missing value, recursively.

For example, let’s say that I want to let the values increase by 10 points, yearly.

So, for rows where value is not NA, it should remain unmodified. Starting from the row where value is NA, it applies that valuet = valuet − 1   10

A naive attempt to do it would be to use dplyr::lag, but this only works for the first missing value because lag is vectorized and operates on the value vector and do not recurse over the previous values

library(dplyr)
data |> 
  group_by(group) |> 
  mutate(value_fix = dplyr::lag(value)   10)   
#> # A tibble: 20 × 4
#> # Groups:   group [2]
#>    group  year value value_fix
#>    <chr> <int> <dbl>     <dbl>
#>  1 A      2016    10        NA
#>  2 A      2017    20        20
#>  3 A      2018    30        30
#>  4 A      2019    40        40
#>  5 A      2020    NA        50
#>  6 A      2021    NA        NA
#>  7 A      2022    NA        NA
#>  8 A      2023    NA        NA
#>  9 A      2024    NA        NA
#> 10 A      2025    NA        NA
#> 11 B      2016    70        NA
#> 12 B      2017    80        80
#> 13 B      2018    NA        90
#> 14 B      2019    NA        NA
#> 15 B      2020    NA        NA
#> 16 B      2021    NA        NA
#> 17 B      2022    NA        NA
#> 18 B      2023    NA        NA
#> 19 B      2024    NA        NA
#> 20 B      2025    NA        NA

This is where I thought tidyr could help, because it is somewhat similar to tidyr::fill

data |> 
  group_by(group) |> 
  tidyr::fill(value) 

but ideally, with an .f argument to be applied recursively to the last value. There does not seem to be something like that.

Googling around I came up with this solution

data |> 
  group_by(group) |> 
  mutate(last_value = case_when(
    value == dplyr::last(na.omit(value)) ~ value,
    TRUE ~ NA_real_
  )) |> 
  mutate(value_fix = purrr::accumulate(
      .x = last_value,
      .f = ~ coalesce(.x   10, .y)
  ))
#> # A tibble: 20 × 5
#> # Groups:   group [2]
#>    group  year value last_value value_fix
#>    <chr> <int> <dbl>      <dbl>     <dbl>
#>  1 A      2016    10         NA        NA
#>  2 A      2017    20         NA        NA
#>  3 A      2018    30         NA        NA
#>  4 A      2019    40         40        40
#>  5 A      2020    NA         NA        50
#>  6 A      2021    NA         NA        60
#>  7 A      2022    NA         NA        70
#>  8 A      2023    NA         NA        80
#>  9 A      2024    NA         NA        90
#> 10 A      2025    NA         NA       100
#> 11 B      2016    70         NA        NA
#> 12 B      2017    80         80        80
#> 13 B      2018    NA         NA        90
#> 14 B      2019    NA         NA       100
#> 15 B      2020    NA         NA       110
#> 16 B      2021    NA         NA       120
#> 17 B      2022    NA         NA       130
#> 18 B      2023    NA         NA       140
#> 19 B      2024    NA         NA       150
#> 20 B      2025    NA         NA       160

Which works, but seems kind of hacky and not easy to read. It would be cleaner to just write the loop and be happy.

I really thought for such a simple case, there would be a built-in way (vectorized, readable-code) to do it in the tidyverse. But I could not find any. Am I missing something?, any ideas how to better do this?

Created on 2022-08-30 with reprex v2.0.2

CodePudding user response:

custom_fun <- function(x, y) {
  if(is.na(y)) x   10 else y
}


data %>%
  group_by(group)%>%
  mutate(value = accumulate(value, custom_fun))

# Groups:   group [2]
   group  year value
   <chr> <int> <dbl>
 1 A      2016    10
 2 A      2017    20
 3 A      2018    30
 4 A      2019    40
 5 A      2020    50
 6 A      2021    60
 7 A      2022    70
 8 A      2023    80
 9 A      2024    90
10 A      2025   100
11 B      2016    70
12 B      2017    80
13 B      2018    90
14 B      2019   100
15 B      2020   110
16 B      2021   120
17 B      2022   130
18 B      2023   140
19 B      2024   150
20 B      2025   160

CodePudding user response:

A base solution with ave() Reduce(accumulate = TRUE):

transform(data, value = ave(value, group, FUN = \(val) {
    Reduce(\(x, y) if(is.na(y)) x   10 else y, val, accumulate = TRUE)
}))

   group year value
1      A 2016    10
2      A 2017    20
3      A 2018    30
4      A 2019    40
5      A 2020    50
6      A 2021    60
7      A 2022    70
8      A 2023    80
9      A 2024    90
10     A 2025   100
11     B 2016    70
12     B 2017    80
13     B 2018    90
14     B 2019   100
15     B 2020   110
16     B 2021   120
17     B 2022   130
18     B 2023   140
19     B 2024   150
20     B 2025   160

CodePudding user response:

You could also write a vectorized function:

fun <- function(x){
  idx <- is.na(x)
  b <- rle(idx)
  id2 <- cumsum(b$lengths)[!b$values]
  x[idx] <- sequence(b$lengths[b$values], x[id2]   10, by=10)
  x
}


transform(data, value = fun(value))

  group year value
1      A 2016    10
2      A 2017    20
3      A 2018    30
4      A 2019    40
5      A 2020    50
6      A 2021    60
7      A 2022    70
8      A 2023    80
9      A 2024    90
10     A 2025   100
11     B 2016    70
12     B 2017    80
13     B 2018    90
14     B 2019   100
15     B 2020   110
16     B 2021   120
17     B 2022   130
18     B 2023   140
19     B 2024   150
20     B 2025   160
  • Related