Home > Software design >  Fill missing values in dataframe in R
Fill missing values in dataframe in R

Time:11-14

I have the following problem: I have a dataframe with several columns. (See below) I am trying to fill in missing values. Concretely, I only want to fill in values when I have a datapoint before and one after the missing value and when they are equal. Note, I also have different IDs, so I want to do this for each ID. In the example below for example I would like to fill in a 1 for row 2 in year 2017 and in the penultimate row for ID 2 I would like to fill in a 2 (in both cases I have the same value "surrounding" the missing value). I do not want to fill in the last rows' value.

ID Year value
1 2016 1
1 2017 -8
1 2018 1
2 2016 -8
2 2017 2
2 2018 2
2 2019 - 8
2 2020 2
3 2017 4
3 2018 4
3 2019 -9

I am somewhat clueless as to how to approach this problem. I have tried using group_by but I don't see a clear way to do it.

Thank you so much for the help!

CodePudding user response:

Using an if_else and lead and lag you could do:

library(dplyr, w = FALSE)

dat |>
  group_by(ID) |>
  mutate(value = if_else(value < 0 &
    (lag(value, default = "-99") == lead(value, default = "-999")),
  lag(value), value
  )) |>
  ungroup()
#> # A tibble: 11 × 3
#>       ID  Year value
#>    <int> <int> <chr>
#>  1     1  2016 1    
#>  2     1  2017 1    
#>  3     1  2018 1    
#>  4     2  2016 -8   
#>  5     2  2017 2    
#>  6     2  2018 2    
#>  7     2  2019 2    
#>  8     2  2020 2    
#>  9     3  2017 4    
#> 10     3  2018 4    
#> 11     3  2019 -9

Note 1: Instead of the default NA I used a -99 and a -999 as the default for lag and lead. Otherwise the if_else would replace some values with an NA.

Note 2:: In row 7 of your example data the missing value was a - 8. I assumed that this was a typo and replaced it by a -8.

DATA

dat <- data.frame(
  stringsAsFactors = FALSE,
  ID = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L),
  Year = c(
    2016L, 2017L, 2018L, 2016L,
    2017L, 2018L, 2019L, 2020L, 2017L, 2018L, 2019L
  ),
  value = c(
    "1", "-8", "1", "-8", "2", "2",
    "-8", "2", "4", "4", "-9"
  )
)
  • Related