Home > Blockchain >  Nested for loop to create a column, using two data sets in R
Nested for loop to create a column, using two data sets in R

Time:08-14

I want to create a new variable(named "treatment") in a dataset using two different datasets. My original datasets are two big datasets with other variables however, for simplicity, let's say I have the following datasets:

#individual level data, birth years
a <- data.frame (country_code = c(2,2,2,10,10,10,10,8), 
                               birth_year = c(1920,1930,1940,1970,1980,1990, 2000, 1910))
#country level reform info with affected cohorts
b <- data.frame(country_code = c(2,10,10,11),
                      lower_cutoff = c(1928, 1975, 1907, 1934),
                      upper_cutoff = c(1948, 1995, 1927, 1948),
                      cohort = c(1938, 1985, 1917, 1942))

Dataset a is an individual dataset with birth year informations and dataset b is country-level data with some country reform information. According to dataset b I want to create a treatment column in the dataset a. Treatment is 1 if the birth_year is between the cohort and upper_cutoff and 0 if between cohort and lower_cutoff. And anything else should be NA.

After creating an empty treatment column, I used the following code below:

for(i in 1:nrow(a)) {
  for(j in 1:nrow(b)){
    a$treatment[i] <- ifelse(a$country_code[i] == b$country_code[j] & 
                                    a$birth_year[i] >= b$cohort[j] & 
                                    a$birth_year[i]<= b$upper_cutoff[j], "1",
                                  ifelse(a$ison[i] == b$ison[j] & 
                                           a$birth_year[i] < b$cohort[j] & 
                                           a$birth_year[i]>= b$lower_cutoff[j], "0", NA))
  }
}

As well as:

for(i in 1:nrow(a)) {
  for(j in 1:nrow(b)){
    a[i, "treatment"] <- case_when(a[i,"country_code"] == b[j, "country_code"] &
                                          a[i,"birth_year"] >= b[j,"cohort"] &
                                          a[i,"birth_year"]<= b[j,"upper_cutoff"] ~ 1,
                                        a[i,"country_code"] == b[j, "country_code"] &
                                          a[i,"birth_year"] < b[j,"cohort"]&
                                          a[i,"birth_year"]>= b[j,"lower_cutoff"] ~ 0)
  }
}

Both codes run, but they only return NAs. The following is the result I want to get:

treatment <- c(NA, 0, 1, NA, 0, 1, NA, 0)

Any ideas about what is wrong? Or any other suggestions? Thanks in advance!

CodePudding user response:

I believe you are switching your upper and lower cutoffs. Try this approach with dplyr:

library(dplyr)

left_join(a,b) %>%
  mutate(treatment = case_when(
    (birth_year>=cohort & birth_year<=lower_cutoff)~1,
    (birth_year<cohort & birth_year>=upper_cutoff)~0
  ))

Output:

   country_code birth_year upper_cutoff lower_cutoff cohort treatment
1             2       1920         1928         1948   1938        NA
2             2       1930         1928         1948   1938         0
3             2       1940         1928         1948   1938         1
4            10       1970         1975         1995   1985        NA
5            10       1970         1907         1927   1917        NA
6            10       1980         1975         1995   1985         0
7            10       1980         1907         1927   1917        NA
8            10       1990         1975         1995   1985         1
9            10       1990         1907         1927   1917        NA
10           10       2000         1975         1995   1985        NA
11           10       2000         1907         1927   1917        NA
12            8       1910           NA           NA     NA        NA

CodePudding user response:

  • Try this for loop
for(i in 1:nrow(a)){
    x <- which(a$country_code[i] == b$country_code)
    a$treatment[i] <- NA
    for(j in x){
        if(a$birth_year[i] %in% b$cohort[j]:b$upper_cutoff[j]){
            a$treatment[i] <- 1
        }
        if(a$birth_year[i] %in% b$lower_cutoff[j]:b$cohort[j]){
            a$treatment[i] <- 0
        }
    }
}
  • Output
 country_code birth_year treatment
1            2       1920        NA
2            2       1930         0
3            2       1940         1
4           10       1970        NA
5           10       1980         0
6           10       1990         1
7           10       2000        NA
8            8       1910        NA

CodePudding user response:

I found the mistake in my code. Apparently, I should have used a break to avoid overwriting the variable I'm creating. But, I'm still open to other answers.

for(i in 1:nrow(a)) {
  for(j in 1:nrow(b)){

if(!is.na(a$treatment[i])){break} #to make it stop if I already assign a value

a$treatment[i] <- ifelse(a$country_code[i] == b$country_code[j] & 
                                    a$birth_year[i] >= b$cohort[j] & 
                                    a$birth_year[i]<= b$upper_cutoff[j], "1",
                                  ifelse(a$ison[i] == b$ison[j] & 
                                           a$birth_year[i] < b$cohort[j] & 
                                           a$birth_year[i]>= b$lower_cutoff[j], "0", NA))
  }
}
  • Related