Home > other >  R: Adding Missing Rows to a Dataset
R: Adding Missing Rows to a Dataset

Time:12-25

I am working with the R.

I have a dataset that looks something like this:

id = c("john", "john", "john", "john","john", "james", "james", "james", "james", "james")
year = c(2010,2011, 2014, 2016,2017, 2013, 2016, 2017, 2018,2020)
var = c(1,1,1,1,1,1,1,1,1,1)
my_data = data.frame(id, year, var)

> my_data
      id year var
1   john 2010   1
2   john 2011   1
3   john 2014   1
4   john 2016   1
5   john 2017   1
6  james 2013   1
7  james 2016   1
8  james 2017   1
9  james 2018   1
10 james 2020   1

As we can see, there are some missing years (i.e. non-consecutive years) in this dataset - for each ID, I am trying to add rows corresponding to these missing years and assign the "var" variable as "0" in these rows.

As an example, this would look something like this for the first ID:

    id year var
1 john 2010   1
2 john 2011   1
3 john 2012   0
4 john 2013   0
5 john 2014   1
6 john 2015   0
7 john 2016   1
8 john 2017   1

I tried to do this with the following code:

# https://stackoverflow.com/questions/74365569/backfilling-rows-based-on-max-conditions-in-r

library(dplyr)
library(tidyr)

my_data %>% 
    group_by(id) %>% 
    complete(year = full_seq(year, period = 1)) %>% 
    fill(year, var, .direction = "downup") %>%
    mutate(var=  0 ) %>%
   ungroup

But this is not giving the desired result - as we can see, rows have been deleted and all values of "var" have been replaced with 0:

 A tibble: 16 x 3
   id     year   var
   <chr> <dbl> <dbl>
 1 james  2013     0
 2 james  2014     0
 3 james  2015     0
 4 james  2016     0
 5 james  2017     0
 6 james  2018     0
 7 james  2019     0
 8 james  2020     0

Can someone please show me how to fix this problem?

Thanks!

CodePudding user response:

You can create a data.frame with all year's and id's, then do a full_join with the original data.frame

library(dplyr)
library(tidyr)

expand_grid(id = unique(my_data$id),year = min(my_data$year):max(my_data$year)) %>% 
  full_join(my_data) %>% 
  replace_na(replace = list(var = 0))

CodePudding user response:

I would include the fill argument in your complete function. There you can specify in a named list what you want to include as values for missing combinations.

library(tidyverse)

my_data %>% 
  group_by(id) %>% 
  complete(year = full_seq(year, period = 1), fill = list(var = 0)) %>%
  ungroup

Output

   id     year   var
   <chr> <dbl> <dbl>
 1 james  2013     1
 2 james  2014     0
 3 james  2015     0
 4 james  2016     1
 5 james  2017     1
 6 james  2018     1
 7 james  2019     0
 8 james  2020     1
 9 john   2010     1
10 john   2011     1
11 john   2012     0
12 john   2013     0
13 john   2014     1
14 john   2015     0
15 john   2016     1
16 john   2017     1
  • Related