I am working with the R programming language.
I have the following dataset:
name = c("john", "john", "john", "sarah", "sarah", "peter", "peter", "peter", "peter")
year = c(2010, 2011, 2014, 2010, 2015, 2011, 2012, 2013, 2015)
age = c(21, 22, 25, 55, 60, 61, 62, 63, 65)
gender = c("male", "male", "male", "female", "female", "male", "male", "male", "male" )
country_of_birth = c("australia", "australia", "australia", "uk", "uk", "mexico", "mexico", "mexico", "mexico")
my_data = data.frame(name, year, age, gender, country_of_birth)
name year age gender country_of_birth
1 john 2010 21 male australia
2 john 2011 22 male australia
3 john 2014 25 male australia
4 sarah 2010 55 female uk
5 sarah 2015 60 female uk
6 peter 2011 61 male mexico
7 peter 2012 62 male mexico
8 peter 2013 63 male mexico
9 peter 2015 65 male mexico
As we can see here, some of the people are missing "years" in this dataset. Suppose the first row corresponding to a person is the earliest year and the last row is the maximum year.
For each person in this dataset - I would like to "fill" between the missing rows. For example - in each missing row:
- I would like the "age" variable to increase by 1 (e.g. in 2012, john should be 23 - in 2012, john should be 24)
- I would the "gender" variable to stay the same
- I would like the "country_of_birth" variable to stay the same
Here is the R code I am using:
library(tidyr)
library(dplyr)
my_data %>%
group_by(name) %>%
complete(year = full_seq(year, period = 1)) %>%
fill(year, age, gender, country_of_birth, .direction = "downup") %>%
mutate(real_age= age - (row_number() - 1)) %>%
ungroup
This code runs and seems to have added the missing rows - but the age variable is not being correctly added:
# A tibble: 16 x 6
name year age gender country_of_birth real_age
<chr> <dbl> <dbl> <chr> <chr> <dbl>
1 john 2010 21 male australia 21
2 john 2011 22 male australia 21
3 john 2012 22 male australia 20
4 john 2013 22 male australia 19
5 john 2014 25 male australia 21
6 peter 2011 61 male mexico 61
7 peter 2012 62 male mexico 61
8 peter 2013 63 male mexico 61
9 peter 2014 63 male mexico 60
10 peter 2015 65 male mexico 61
11 sarah 2010 55 female uk 55
12 sarah 2011 55 female uk 54
13 sarah 2012 55 female uk 53
14 sarah 2013 55 female uk 52
15 sarah 2014 55 female uk 51
16 sarah 2015 60 female uk 55
Currently, I am trying to resolve this problem by trying different combinations of mutate(real_age= age - (row_number() - 1))
- but so far, nothing seems to be working.
Can someone please show me how to fix this?
Thanks!
CodePudding user response:
One way is:
library(dplyr)
library(tidyr)
my_data %>%
group_by(name) %>%
complete(year = first(year): last(year)) %>%
mutate(age = ifelse(is.na(age), first(age) row_number()-1,age)) %>%
fill(c(gender, country_of_birth), .direction = "down")
name year age gender country_of_birth
<chr> <dbl> <dbl> <chr> <chr>
1 john 2010 21 male australia
2 john 2011 22 male australia
3 john 2012 23 male australia
4 john 2013 24 male australia
5 john 2014 25 male australia
6 peter 2011 61 male mexico
7 peter 2012 62 male mexico
8 peter 2013 63 male mexico
9 peter 2014 64 male mexico
10 peter 2015 65 male mexico
11 sarah 2010 55 female uk
12 sarah 2011 56 female uk
13 sarah 2012 57 female uk
14 sarah 2013 58 female uk
15 sarah 2014 59 female uk
16 sarah 2015 60 female uk