R run multiple scripts at once in terminal-CodePudding

I have an R function that loads, processes, and saves many files. Here is a dummy version:

load_process_saveFiles <- function(onlyFiles = c()){
    
  allFiles <- paste(LETTERS, '.csv', sep = '')
  
  # If desired, only include certain files
  if(length(onlyFiles) > 0){
    allFiles <- allFiles[allFiles %in% onlyFiles]
  }
  
  for(file in allFiles){
    # load file
    rawFile <- file
    
    # Run a super long function
    processedFile <- rawFile
    
    # Save file
    # write.csv(processedFile, paste('./Other/Path/', file, sep = ''), row.names = FALSE)
  
    cat('\nDone with file ', file, sep = '')
  }  
}

It has to run through about 30 files, and each one takes about 3 minutes. It can be very time consuming to loop through the entire thing. What I'd like to do is run each one separately at the same time so that it would take 3 minutes all together instead of 3 x 30 = 90 minutes.

I know I can achieve this by creating a bunch of RStudio sessions or many terminal tabs, but I can't handle having that many sessions or tabs open at once.

Ideally, I'd like to have all of the files with separate functions listed in one batchRun.R file which I can run from the terminal:

source('./PathToFunction/load_process_saveFiles.R')

load_process_saveFiles(onlyFiles = 'A.csv')
load_process_saveFiles(onlyFiles = 'B.csv')
load_process_saveFiles(onlyFiles = 'C.csv')
load_process_saveFiles(onlyFiles = 'D.csv')
load_process_saveFiles(onlyFiles = 'E.csv')
load_process_saveFiles(onlyFiles = 'F.csv')

So then run $ RScript batchRun.R from the terminal.

I've tried looking up different examples on SO trying to accomplish something similar, but each have some unique features and I just can't get it to work. Is what I'm trying to do possible? Thanks!

CodePudding user response：

Package parallel gives you a number of options. One option is to parallelize the calls to load_process_saveFiles and have the loop inside of the function run serially. Another option is to parallelize the loop and have the calls run serially. The best way to assess which approach is more suitable for your job is to time them both yourself.

Evaluating the calls to load_process_saveFiles in parallel is relatively straightforward with mclapply, the parallel version of the base function lapply (see ?lapply):

parallel::mclapply(x, load_process_saveFiles, mc.cores = 2L)

Here, x is a list of values of the argument onlyFiles, and mc.cores = 2L indicates that you want to divide the calls among two R processes.

Evaluating the loop inside of load_process_saveFiles in parallel would involve replacing the entire for statement with something like

f <- function(file) {
  cat("Processing file", file, "...")
  x <- read(file)
  y <- process(x)
  write(y, file = file.path("path", "to", file))
  cat(" done!\n")
}
parallel::mclapply(allFiles, f, ...)

and redefining load_process_saveFiles to allow optional arguments:

load_process_saveFiles <- function(onlyFiles = character(0L), ...) {
  ## body
}

Then you could do, for example, load_process_saveFiles(onlyFiles, mc.cores = 2L).

I should point out that mclapply is not supported on Windows. On Windows, you can use parLapply instead, but there are some extra steps involved. These are described in the parallel vignette, which can be opened from R with vignette("parallel", "parallel"). The vignette acts as general introduction to parallelism in R, so it could be worth reading anyway.

CodePudding user response：

Parallel package is useful in this case. And if you are using Linux OS, I would recommend doMC package instead of parallel. This doMC package is useful even for looping over big data used in machine learning projects.