How to fasten nested for-loop R-CodePudding

I have two datasets, and one of them is very big. I'm trying to run the following loop to create a treatment column, treatment, in the dataset a. However, it is way too slow. I looked for a way to fasten for-loops like vectorization or defining conditions outside the loops however I'm having a hard time applying those methods since I have two datasets I'm conditioning on.

Here is my code:

reform_loop <- function(a, b){
  for(i in 1:nrow(a)) {
    
    for(j in 1:nrow(b)){
      if(!is.na(a[i,"treatment"])){break}
      

      a[i,"treatment"] <- case_when(a[i,"country_code"] == b[j, "country_code"] &
                            a[i,"birth_year"] >= b[j,"cohort"] &
                            a[i,"birth_year"]<= b[j,"upper_cutoff"] ~ 1,
                          
                          a[i,"country_code"] == b[j, "country_code"] &
                            a[i,"birth_year"] < b[j,"cohort"]&
                            a[i,"birth_year"]>= b[j,"lower_cutoff"] ~ 0)
      
    }
  }
  return(a)
}

a <- reform_loop(a, b)

You can find a sample dataset below. Dataset a is an individual dataset with birth year informations and dataset b is country-level data with some country reform information. treatment is 1 if thebirth_year is between the cohort and upper_cutoff and 0 if between cohort and lower_cutoff in a specific country which means country_code variables should also be matched. And anything else should be NA.

#individual level data, birth years
a <- data.frame (country_code = c(2,2,2,10,10,10,10,8), 
                               birth_year = c(1920,1930,1940,1970,1980,1990, 2000, 1910))
#country level reform info with affected cohorts
b <- data.frame(country_code = c(2,10,10,11),
                      lower_cutoff = c(1928, 1975, 1907, 1934),
                      upper_cutoff = c(1948, 1995, 1927, 1948),
                      cohort = c(1938, 1985, 1917, 1942))

The following is the result I want to get:

treatment <- c(NA, 0, 1, NA, 0, 1, NA, NA)

Unfortunately, I cannot merge these two datasets since most of the countries in my dataset have more than one reform.

Any ideas on how can I fasten this code? Thank you so much in advance!

CodePudding user response：

This is a range-based non-equi join. As such, this can be done with data.table or fuzzyjoin or sqldf.

data.table

library(data.table)
setDT(a)
setDT(b)
b[, treatment := 1L]
a[b, treatment := i.treatment, on = .(country_code, birth_year >= lower_cutoff, birth_year <= upper_cutoff)]
a[is.na(treatment), treatment := 0L]
a
#    country_code birth_year treatment
#           <num>      <num>     <int>
# 1:            2       1920         0
# 2:            2       1930         1
# 3:            2       1940         1
# 4:           10       1970         0
# 5:           10       1980         1
# 6:           10       1990         1
# 7:           10       2000         0
# 8:            8       1910         0

sqldf

out <- sqldf::sqldf("select a.*, b.treatment from a left join b on a.country_code=b.country_code and a.birth_year between b.lower_cutoff and b.upper_cutoff")
out$treatment[is.na(out$treatment)] <- 0L
out
#   country_code birth_year treatment
# 1            2       1920         0
# 2            2       1930         1
# 3            2       1940         1
# 4           10       1970         0
# 5           10       1980         1
# 6           10       1990         1
# 7           10       2000         0
# 8            8       1910         0

fuzzyjoin

fuzzyjoin::fuzzy_left_join(a, b, by = c("country_code" = "country_code", "birth_year" = "lower_cutoff", "birth_year" = "upper_cutoff"), match_fun = list(`==`, `>=`, `<=`))
#   country_code.x birth_year country_code.y lower_cutoff upper_cutoff cohort treatment
# 1              2       1920             NA           NA           NA     NA        NA
# 2              2       1930              2         1928         1948   1938         1
# 3              2       1940              2         1928         1948   1938         1
# 4             10       1970             NA           NA           NA     NA        NA
# 5             10       1980             10         1975         1995   1985         1
# 6             10       1990             10         1975         1995   1985         1
# 7             10       2000             NA           NA           NA     NA        NA
# 8              8       1910             NA           NA           NA     NA        NA

and then you need to clean up the extra columns (and fill 0 for NA).