I have two datasets, and one of them is very big. I'm trying to run the following loop to create a treatment column, treatment
, in the dataset a
. However, it is way too slow. I looked for a way to fasten for-loops like vectorization or defining conditions outside the loops however I'm having a hard time applying those methods since I have two datasets I'm conditioning on.
Here is my code:
reform_loop <- function(a, b){
for(i in 1:nrow(a)) {
for(j in 1:nrow(b)){
if(!is.na(a[i,"treatment"])){break}
a[i,"treatment"] <- case_when(a[i,"country_code"] == b[j, "country_code"] &
a[i,"birth_year"] >= b[j,"cohort"] &
a[i,"birth_year"]<= b[j,"upper_cutoff"] ~ 1,
a[i,"country_code"] == b[j, "country_code"] &
a[i,"birth_year"] < b[j,"cohort"]&
a[i,"birth_year"]>= b[j,"lower_cutoff"] ~ 0)
}
}
return(a)
}
a <- reform_loop(a, b)
You can find a sample dataset below. Dataset a
is an individual dataset with birth year informations and dataset b
is country-level data with some country reform information. treatment
is 1 if thebirth_year
is between the cohort
and upper_cutoff
and 0 if between cohort
and lower_cutoff
in a specific country which means country_code
variables should also be matched. And anything else should be NA.
#individual level data, birth years
a <- data.frame (country_code = c(2,2,2,10,10,10,10,8),
birth_year = c(1920,1930,1940,1970,1980,1990, 2000, 1910))
#country level reform info with affected cohorts
b <- data.frame(country_code = c(2,10,10,11),
lower_cutoff = c(1928, 1975, 1907, 1934),
upper_cutoff = c(1948, 1995, 1927, 1948),
cohort = c(1938, 1985, 1917, 1942))
The following is the result I want to get:
treatment <- c(NA, 0, 1, NA, 0, 1, NA, NA)
Unfortunately, I cannot merge these two datasets since most of the countries in my dataset have more than one reform.
Any ideas on how can I fasten this code? Thank you so much in advance!
CodePudding user response:
This is a range-based non-equi join. As such, this can be done with data.table
or fuzzyjoin
or sqldf
.
data.table
library(data.table)
setDT(a)
setDT(b)
b[, treatment := 1L]
a[b, treatment := i.treatment, on = .(country_code, birth_year >= lower_cutoff, birth_year <= upper_cutoff)]
a[is.na(treatment), treatment := 0L]
a
# country_code birth_year treatment
# <num> <num> <int>
# 1: 2 1920 0
# 2: 2 1930 1
# 3: 2 1940 1
# 4: 10 1970 0
# 5: 10 1980 1
# 6: 10 1990 1
# 7: 10 2000 0
# 8: 8 1910 0
sqldf
out <- sqldf::sqldf("select a.*, b.treatment from a left join b on a.country_code=b.country_code and a.birth_year between b.lower_cutoff and b.upper_cutoff")
out$treatment[is.na(out$treatment)] <- 0L
out
# country_code birth_year treatment
# 1 2 1920 0
# 2 2 1930 1
# 3 2 1940 1
# 4 10 1970 0
# 5 10 1980 1
# 6 10 1990 1
# 7 10 2000 0
# 8 8 1910 0
fuzzyjoin
fuzzyjoin::fuzzy_left_join(a, b, by = c("country_code" = "country_code", "birth_year" = "lower_cutoff", "birth_year" = "upper_cutoff"), match_fun = list(`==`, `>=`, `<=`))
# country_code.x birth_year country_code.y lower_cutoff upper_cutoff cohort treatment
# 1 2 1920 NA NA NA NA NA
# 2 2 1930 2 1928 1948 1938 1
# 3 2 1940 2 1928 1948 1938 1
# 4 10 1970 NA NA NA NA NA
# 5 10 1980 10 1975 1995 1985 1
# 6 10 1990 10 1975 1995 1985 1
# 7 10 2000 NA NA NA NA NA
# 8 8 1910 NA NA NA NA NA
and then you need to clean up the extra columns (and fill 0
for NA
).