I want to create a new variable(named "treatment") in a dataset using two different datasets. My original datasets are two big datasets with other variables however, for simplicity, let's say I have the following datasets:
#individual level data, birth years
a <- data.frame (country_code = c(2,2,2,10,10,10,10,8),
birth_year = c(1920,1930,1940,1970,1980,1990, 2000, 1910))
#country level reform info with affected cohorts
b <- data.frame(country_code = c(2,10,10,11),
lower_cutoff = c(1928, 1975, 1907, 1934),
upper_cutoff = c(1948, 1995, 1927, 1948),
cohort = c(1938, 1985, 1917, 1942))
Dataset a
is an individual dataset with birth year informations and dataset b
is country-level data with some country reform information. According to dataset b
I want to create a treatment column in the dataset a
. Treatment is 1 if the birth_year
is between the cohort
and upper_cutoff
and 0 if between cohort
and lower_cutoff
. And anything else should be NA.
After creating an empty treatment column, I used the following code below:
for(i in 1:nrow(a)) {
for(j in 1:nrow(b)){
a$treatment[i] <- ifelse(a$country_code[i] == b$country_code[j] &
a$birth_year[i] >= b$cohort[j] &
a$birth_year[i]<= b$upper_cutoff[j], "1",
ifelse(a$ison[i] == b$ison[j] &
a$birth_year[i] < b$cohort[j] &
a$birth_year[i]>= b$lower_cutoff[j], "0", NA))
}
}
As well as:
for(i in 1:nrow(a)) {
for(j in 1:nrow(b)){
a[i, "treatment"] <- case_when(a[i,"country_code"] == b[j, "country_code"] &
a[i,"birth_year"] >= b[j,"cohort"] &
a[i,"birth_year"]<= b[j,"upper_cutoff"] ~ 1,
a[i,"country_code"] == b[j, "country_code"] &
a[i,"birth_year"] < b[j,"cohort"]&
a[i,"birth_year"]>= b[j,"lower_cutoff"] ~ 0)
}
}
Both codes run, but they only return NAs. The following is the result I want to get:
treatment <- c(NA, 0, 1, NA, 0, 1, NA, 0)
Any ideas about what is wrong? Or any other suggestions? Thanks in advance!
CodePudding user response:
I believe you are switching your upper and lower cutoffs. Try this approach with dplyr:
library(dplyr)
left_join(a,b) %>%
mutate(treatment = case_when(
(birth_year>=cohort & birth_year<=lower_cutoff)~1,
(birth_year<cohort & birth_year>=upper_cutoff)~0
))
Output:
country_code birth_year upper_cutoff lower_cutoff cohort treatment
1 2 1920 1928 1948 1938 NA
2 2 1930 1928 1948 1938 0
3 2 1940 1928 1948 1938 1
4 10 1970 1975 1995 1985 NA
5 10 1970 1907 1927 1917 NA
6 10 1980 1975 1995 1985 0
7 10 1980 1907 1927 1917 NA
8 10 1990 1975 1995 1985 1
9 10 1990 1907 1927 1917 NA
10 10 2000 1975 1995 1985 NA
11 10 2000 1907 1927 1917 NA
12 8 1910 NA NA NA NA
CodePudding user response:
- Try this
for loop
for(i in 1:nrow(a)){
x <- which(a$country_code[i] == b$country_code)
a$treatment[i] <- NA
for(j in x){
if(a$birth_year[i] %in% b$cohort[j]:b$upper_cutoff[j]){
a$treatment[i] <- 1
}
if(a$birth_year[i] %in% b$lower_cutoff[j]:b$cohort[j]){
a$treatment[i] <- 0
}
}
}
- Output
country_code birth_year treatment
1 2 1920 NA
2 2 1930 0
3 2 1940 1
4 10 1970 NA
5 10 1980 0
6 10 1990 1
7 10 2000 NA
8 8 1910 NA
CodePudding user response:
I found the mistake in my code. Apparently, I should have used a break to avoid overwriting the variable I'm creating. But, I'm still open to other answers.
for(i in 1:nrow(a)) {
for(j in 1:nrow(b)){
if(!is.na(a$treatment[i])){break} #to make it stop if I already assign a value
a$treatment[i] <- ifelse(a$country_code[i] == b$country_code[j] &
a$birth_year[i] >= b$cohort[j] &
a$birth_year[i]<= b$upper_cutoff[j], "1",
ifelse(a$ison[i] == b$ison[j] &
a$birth_year[i] < b$cohort[j] &
a$birth_year[i]>= b$lower_cutoff[j], "0", NA))
}
}