Speed up for-loop with sequential tasks in j using data.table-CodePudding

I have a challenge with a large dataset, with several million lines and a few hundred columns. I am using the data.table format.

My (long) code executes nicely, except for a section of the code working on prescriptions to specific individuals during all dates in the dataset.

I want to create a one-line "memory" of each category of drugs for each date for use later in the code, and have done so with a for-loop, assignment by reference and the toString command - but this is very, very slow.

I have looked at it for quite a few hours, and tried to make a distilled example for this question - any suggestions are most welcome.

I suspect a faster way of making several lines into one by group, ie a faster toString, would solve the problem, but I can't think of a smarter way of doing this. Any suggestions are most welcome.

Here goes the code (the dataset is intentionally large to reproduce the speed problem with a few seconds), the loop that gives me problems is the last piece of the code:

library(data.table)
##This is one long piece of code generating the dataset - apologies for the complexity, did what I could (within my abilities) to simplify:
set.seed(2532)
healthData <- data.table(id = sample(1:10000 , 10000))
healthData <- healthData[ , list(id = id ,
                   date = seq(as.Date("2000-01-01") ,
                              as.Date("2001-01-01") ,
                              by = "day")) ,
                   by = 1:nrow(healthData)]
healthData[ , nrow := NULL]
prescriptionRegistry <- data.table(id = sample(1:10000 , 1000 , replace = TRUE) ,
                                   category = sample(c("paracetamol" , "oxycodon" , "seroquel") , 1000 , replace = TRUE) ,
                                   dose = sample(c(0.5 , 1 , 2) , 1000 , replace = TRUE) ,
                                   endDate = sample(as.Date(as.Date("2000-02-01"):as.Date("2000-12-31") ,
                                                            "1970-01-01") ,
                                                    1000 ,
                                                    replace = TRUE))
prescriptionRegistry <- prescriptionRegistry[ , list(id = id ,
                                                     category = category ,
                                                     dose = dose ,
                                                     endDate = endDate , 
                                                     date = seq(as.Date("2000-01-01") ,
                                                                endDate , by = "day")) ,
                                by = 1:nrow(prescriptionRegistry)]
prescriptionRegistry[ , nrow := NULL]
prescriptionRegistry[category == "seroquel" , c("seroquelDose" , "seroquelEndDate") :=
                                                  list(dose , endDate)]
prescriptionRegistry[category == "paracetamol" , c("paracetamolDose" , "paracetamolEndDate") :=
                                                     list(dose , endDate)]
prescriptionRegistry[category == "oxycodon" , c("oxycodonDose" , "oxycodonEndDate") :=
                                                  list(dose , endDate)]
healthData <- merge(healthData , prescriptionRegistry , by.x = c("id" , "date") , by.y = c("id" , "date") , all.x = TRUE , allow.cartesian = TRUE)

##The purpose of this is to reduce to the data that gives me problems - that is when an individual has several prescriptions a day for the same drug:
setorder(healthData , id , date)
healthData[ , index := 1:.N , by = c("id" , "date")]
index <- healthData[index == 2 , .(id)]
index <- unique(index)
setkey(healthData , id)
setkey(index , id)
healthData <- healthData[index]
rm(index)
##End of code generating dataset

##This is the loop that is very slow on large datasets - suggestions are most welcome.
categories <- c("paracetamol" , "oxycodon" , "seroquel")
for (i in categories) {
    healthData[ ,
               c(paste0(i , "DoseTotal") ,
                 paste0(i , "DoseText") ,
                 paste0(i , "EndDateText")) := list(
                   sum(get(paste0(i , "Dose")) , na.rm = TRUE) ,
                   toString(get(paste0(i , "Dose"))) ,
                   toString(get(paste0(i , "EndDate")))) ,
               by = c("id" , "date")]

My real problem is on a server with data.table 1.12.2 and R 3.61 on a Windows server 2012 R2, but it seems to be quite slow as well on my laptop with Lubuntu 20.04, R 4.1.2 and data.table 4.14.2. To quantify, every iteration of the loop on the server takes 2-3 hours using 30 processor threads and with access to 1 terabyte RAM.

Thank you for your time!

CodePudding user response：

If you are looking for a faster toString, you could use instead a list column. On my computer, your example goes from 2.3 sec to 0.6 sec.

for (i in categories) {
  healthData[ ,
              c(paste0(i , "DoseTotal") ,
                paste0(i , "DoseText") ,
                paste0(i , "EndDateText")) := list(
                  sum(get(paste0(i , "Dose")) , na.rm = TRUE) ,
                  list(get(paste0(i , "Dose"))) ,
                  list(get(paste0(i , "EndDate")))) ,
              by = c("id" , "date")]
}