Speed up the processing time of for loop for big data in R-CodePudding

I have very large datasets bdd_cases having 150,000 rows and bdd_control containing 15 million rows. Here I have reduced the size of these datasets and given as drive link for simplicity. Among other things, I am trying to add matching rows from bdd_control to bdd_cases based on cluster_case and subset variables.

I have the following for loop written for this purpose and it works perfectly for the small dataset example given here. It takes around 13 secs even for this small dataset.

#import data
id1 <- "199TNlYFwqzzWpi1iY5qX1-M11UoC51Cp"
id2 <- "1TeFCkqLDtEBz0JMBHh8goNWEjYol4O2z"

bdd_cases <- as.data.frame(read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download", id1)))
bdd_control <- as.data.frame(read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download", id2)))

#declare empty dataframe
bdd_temp <- NULL
list_p <- unique(bdd_cases$cluster_case)

#for loop
for (i in 1:length(list_p)) {

  temp <- bdd_cases %>% 
    filter(cluster_case==list_p[i])                                  #select the first case from bdd_cases
  
  temp0 <- bdd_control %>% filter(subset==temp$subset)               #select the rows from bdd_control that match the first case above on the subset variable
  
  temp <- rbind(temp, temp0)                                         #bind the two

  temp$cluster_case <- list_p[i]                                     #add the ith cluster_case to all the rows 
  
  temp <- temp %>%
    group_by(cluster_case) %>% #group by cluster case
    mutate(age_diff = abs(age - age[case_control=="case"]),          #calculate difference in age between case and controls
           fup_diff = foll_up - foll_up[case_control=="case"],       #calculate difference in foll_up between case and controls
           age_fup = ifelse(age_diff<=2 & fup_diff==0,"accept","delete")) %>% #keep the matching controls and remove the other controls for the ith cluster_case
    filter(age_fup=="accept") %>% 
    select(-age_fup)
  
  bdd_temp <- bdd_temp %>% # finally add this matched case and control to the empty dataframe
    bind_rows(temp)
}

My problem arises when I try the same for loop for the original datasets with millions of rows. My program has been running for 2 days. I am running it on R studio server which has 64 cores and 270 GB RAM.

I have referred to previous posts like this one(Speed up the loop operation in R) which talks about vectorisation and use of lists instead of dataframes. However, I am not able to apply those to my specific situation.

Are there any specific improvements I can make to the commands within my for loop which would speed up the execution?

Any little improvement in speed would mean a lot. Thanks.

CodePudding user response：

This should speed things up considerably.

On my systemn, the speed gain is about a factor 5.

#import data
id1 <- "199TNlYFwqzzWpi1iY5qX1-M11UoC51Cp"
id2 <- "1TeFCkqLDtEBz0JMBHh8goNWEjYol4O2z"

library(data.table)
# use fread for reading, fast and get a nice progress bar as bonus
bdd_cases <- fread(sprintf("https://docs.google.com/uc?id=%s&export=download", id1))
bdd_control <- fread(sprintf("https://docs.google.com/uc?id=%s&export=download", id2))
#Put everything in a list
L <- lapply(unique(bdd_cases$cluster_case), function(x){
  temp <- rbind(bdd_cases[cluster_case == x, ],
                bdd_control[subset == bdd_cases[cluster_case == x, ]$subset])
  temp[, cluster_case := x]
  temp[, `:=`(age_diff = abs(age - age[case_control=="case"]),
              fup_diff = foll_up - foll_up[case_control=="case"])]
  temp[age_diff <= 2 & fup_diff == 0, ]
})
#Rowbind the list
final <- rbindlist(L, use.names = TRUE, fill = TRUE)