I am working with the R programming language.
I have this dataset:
name = c("john", "john", "john", "sarah", "sarah", "peter", "peter", "peter", "peter")
year = c(2010, 2011, 2014, 2010, 2015, 2011, 2012, 2013, 2015)
age = c(21, 22, 25, 55, 60, 61, 62, 63, 65)
gender = c("male", "male", "male", "female", "female", "male", "male", "male", "male" )
country_of_birth = c("australia", "australia", "australia", "uk", "uk", "mexico", "mexico", "mexico", "mexico")
source = "ORIGINAL"
my_data = data.frame(name, year, age, gender, country_of_birth, source)
> my_data
name year age gender country_of_birth source
1 john 2010 21 male australia ORIGINAL
2 john 2011 22 male australia ORIGINAL
3 john 2014 25 male australia ORIGINAL
4 sarah 2010 55 female uk ORIGINAL
5 sarah 2015 60 female uk ORIGINAL
6 peter 2011 61 male mexico ORIGINAL
7 peter 2012 62 male mexico ORIGINAL
8 peter 2013 63 male mexico ORIGINAL
9 peter 2015 65 male mexico ORIGINAL
In a previous question(Filling Spaces in Missing Rows) - for each person, I learned how to add rows where the intermediate years are missing, and populate these rows with the "logically" corresponding information (e.g. adjust the "age" variable while keeping "gender" and "country_of_birth" the same):
library(dplyr)
library(tidyr)
final = my_data %>%
group_by(name) %>%
complete(year = first(year): last(year)) %>%
mutate(age = ifelse(is.na(age), first(age) row_number()-1,age)) %>%
fill(c(gender, country_of_birth), .direction = "down")
name year age gender country_of_birth source
<chr> <dbl> <dbl> <chr> <chr> <chr>
1 john 2010 21 male australia ORIGINAL
2 john 2011 22 male australia ORIGINAL
3 john 2012 23 male australia NA
4 john 2013 24 male australia NA
Now, I would like to add a "source" variable that shows whether a row in this final dataset has been added or has existed all along
I had tried to add this logic to the DPLYR code, but I was not sure how to do this (I tried different ways but nothing worked). I had an idea that perhaps I could just do this with an ifelse()
statement?
my_data$source <- ifelse(is.na(my_data$source), "MISSING", my_data$source)
name year age gender country_of_birth source
<chr> <dbl> <dbl> <chr> <chr> <chr>
1 john 2010 21 male australia ORIGINAL
2 john 2011 22 male australia ORIGINAL
3 john 2012 23 male australia MISSING
Can someone please tell me if this logic is correct and how I might have been able to implement this logic within the original DPLYR code?
Thanks!
CodePudding user response:
Alternatively you could try the below
data.frame(name, year, age, gender, country_of_birth, source) %>% group_by(name) %>%
complete(year=full_seq(year, 1)) %>%
mutate(source=ifelse(is.na(source), 'MISSING', source)) %>%
fill(c(age, gender, country_of_birth), .direction = 'down')
Created on 2023-01-23 with reprex v2.0.2