Parallel computing in R (foreach loop and if statements)-CodePudding

I've been trying to adapt some of my code to work in parallel, since the information that I am processing is not sequential. The idea is that I am going through several tables and processing their entries, changing their values to 1, 2, or 3 depending on their original value. My current code looks like this:

typeof(Table)
[1] "list"
dim(Table)
[1] 5001 1247

Processed_Table <- Table

for (i in 2:length(Table))             # Iterating through the columns
  for (j in 1:(length(Table[,2])-1)){  # Iterating through the lines
    if (Table[j,i] > 9) {              # Updating the values based on their range
      Processed_Table[j,i] <- 1 
    } else if (9 > Table[j,i] && Table[j,i] > 5.5){
      Processed_Table[j,i] <- 2 
    } else if (Table[j,i]<5.5){
      Processed_Table[j,i] <- 3 
    }
  }
}

Since my tables are big and I have several of them, I was thinking about parallelizing this task by using the parallel library and using a core for each column or line in the loop. I am having trouble in understanding how to do this though, as the fact that the values do not update automatically (as I understood the numbers in the workers are not actually saved in the original variables). This is what I've tried to do and failed:

library(parallel)

n.cores <- parallel::detectCores() - 1

my.cluster <- parallel::makeCluster(
  n.cores, 
  type = "PSOCK"
)

doParallel::registerDoParallel(cl = my.cluster)

Processed_Table<-foreach(j = 1:(length(Table[,2])-1),.combine = 'cbind') %dopar% {
                   for (i in 2:length(Table)) {
                     if (Table[j,i] > 9) { 
                               1 
                     } else if (9 > Table[j,i] && Table[j,i] > 5.5){
                               2 
                     } else if (Table<5.5){
                               3 
                     }
                   }
                 }

Thanks in advance for the help!

CodePudding user response：

I suggest the 'cut()' function, which is vectorized and will work much much faster than a for loop. If you convert the relevant columns your data frame to a matrix, 'cut()' will return a matrix back that you can recombine with the other columns in your data.

I'm not fully sure why you're looking to use parallel computing, so I'm going to give general advice (too many rows? Columns? Both?). Note that a data frame is just a list of vectors. The mclapply function will happily work on the columns, and you can use code like:

new_columns <- mclapply(Table[, -1], cut, breaks = c(-Inf, 5.5, 9, Inf), labels = FALSE)

and then something like 'do.call(cbind, new_columns)' and then append that to your other columns.

CodePudding user response：

Depending how big your data is, you can just use vectorised functions. Here is an approach with dplyr for one table (which you could use in a loop/lapply for all tables). Please note that you've missed the values 9 and 5.5 in your approach by using > instead of >=.

library(dplyr)

# assuming your Table is a data.frame
Table <- Table %>% 
  # assuming your first column is called ID, this column is excluded
  mutate(across(-ID, ~case_when(.x > 9 ~ 1,
                                .x <= 9 & > 5.5 ~ 2,
                                .x <= 5.5 ~ 3)))