I am working with the R.
I have a dataset that looks something like this:
id = c("john", "john", "john", "john","john", "james", "james", "james", "james", "james")
year = c(2010,2011, 2014, 2016,2017, 2013, 2016, 2017, 2018,2020)
var = c(1,1,1,1,1,1,1,1,1,1)
my_data = data.frame(id, year, var)
> my_data
id year var
1 john 2010 1
2 john 2011 1
3 john 2014 1
4 john 2016 1
5 john 2017 1
6 james 2013 1
7 james 2016 1
8 james 2017 1
9 james 2018 1
10 james 2020 1
As we can see, there are some missing years (i.e. non-consecutive years) in this dataset - for each ID, I am trying to add rows corresponding to these missing years and assign the "var" variable as "0" in these rows.
As an example, this would look something like this for the first ID:
id year var
1 john 2010 1
2 john 2011 1
3 john 2012 0
4 john 2013 0
5 john 2014 1
6 john 2015 0
7 john 2016 1
8 john 2017 1
I tried to do this with the following code:
# https://stackoverflow.com/questions/74365569/backfilling-rows-based-on-max-conditions-in-r
library(dplyr)
library(tidyr)
my_data %>%
group_by(id) %>%
complete(year = full_seq(year, period = 1)) %>%
fill(year, var, .direction = "downup") %>%
mutate(var= 0 ) %>%
ungroup
But this is not giving the desired result - as we can see, rows have been deleted and all values of "var" have been replaced with 0:
A tibble: 16 x 3
id year var
<chr> <dbl> <dbl>
1 james 2013 0
2 james 2014 0
3 james 2015 0
4 james 2016 0
5 james 2017 0
6 james 2018 0
7 james 2019 0
8 james 2020 0
Can someone please show me how to fix this problem?
Thanks!
CodePudding user response:
You can create a data.frame with all year's and id's, then do a full_join
with the original data.frame
library(dplyr)
library(tidyr)
expand_grid(id = unique(my_data$id),year = min(my_data$year):max(my_data$year)) %>%
full_join(my_data) %>%
replace_na(replace = list(var = 0))
CodePudding user response:
I would include the fill
argument in your complete
function. There you can specify in a named list what you want to include as values for missing combinations.
library(tidyverse)
my_data %>%
group_by(id) %>%
complete(year = full_seq(year, period = 1), fill = list(var = 0)) %>%
ungroup
Output
id year var
<chr> <dbl> <dbl>
1 james 2013 1
2 james 2014 0
3 james 2015 0
4 james 2016 1
5 james 2017 1
6 james 2018 1
7 james 2019 0
8 james 2020 1
9 john 2010 1
10 john 2011 1
11 john 2012 0
12 john 2013 0
13 john 2014 1
14 john 2015 0
15 john 2016 1
16 john 2017 1