I have an R function that loads, processes, and saves many files. Here is a dummy version:
load_process_saveFiles <- function(onlyFiles = c()){
allFiles <- paste(LETTERS, '.csv', sep = '')
# If desired, only include certain files
if(length(onlyFiles) > 0){
allFiles <- allFiles[allFiles %in% onlyFiles]
}
for(file in allFiles){
# load file
rawFile <- file
# Run a super long function
processedFile <- rawFile
# Save file
# write.csv(processedFile, paste('./Other/Path/', file, sep = ''), row.names = FALSE)
cat('\nDone with file ', file, sep = '')
}
}
It has to run through about 30 files, and each one takes about 3 minutes. It can be very time consuming to loop through the entire thing. What I'd like to do is run each one separately at the same time so that it would take 3 minutes all together instead of 3 x 30 = 90 minutes.
I know I can achieve this by creating a bunch of RStudio sessions or many terminal tabs, but I can't handle having that many sessions or tabs open at once.
Ideally, I'd like to have all of the files with separate functions listed in one batchRun.R
file which I can run from the terminal:
source('./PathToFunction/load_process_saveFiles.R')
load_process_saveFiles(onlyFiles = 'A.csv')
load_process_saveFiles(onlyFiles = 'B.csv')
load_process_saveFiles(onlyFiles = 'C.csv')
load_process_saveFiles(onlyFiles = 'D.csv')
load_process_saveFiles(onlyFiles = 'E.csv')
load_process_saveFiles(onlyFiles = 'F.csv')
So then run $ RScript batchRun.R
from the terminal.
I've tried looking up different examples on SO trying to accomplish something similar, but each have some unique features and I just can't get it to work. Is what I'm trying to do possible? Thanks!
CodePudding user response:
Package parallel
gives you a number of options. One option is to parallelize the calls to load_process_saveFiles
and have the loop inside of the function run serially. Another option is to parallelize the loop and have the calls run serially. The best way to assess which approach is more suitable for your job is to time them both yourself.
Evaluating the calls to load_process_saveFiles
in parallel is relatively straightforward with mclapply
, the parallel version of the base function lapply
(see ?lapply
):
parallel::mclapply(x, load_process_saveFiles, mc.cores = 2L)
Here, x
is a list of values of the argument onlyFiles
, and mc.cores = 2L
indicates that you want to divide the calls among two R processes.
Evaluating the loop inside of load_process_saveFiles
in parallel would involve replacing the entire for
statement with something like
f <- function(file) {
cat("Processing file", file, "...")
x <- read(file)
y <- process(x)
write(y, file = file.path("path", "to", file))
cat(" done!\n")
}
parallel::mclapply(allFiles, f, ...)
and redefining load_process_saveFiles
to allow optional arguments:
load_process_saveFiles <- function(onlyFiles = character(0L), ...) {
## body
}
Then you could do, for example, load_process_saveFiles(onlyFiles, mc.cores = 2L)
.
I should point out that mclapply
is not supported on Windows. On Windows, you can use parLapply
instead, but there are some extra steps involved. These are described in the parallel
vignette, which can be opened from R with vignette("parallel", "parallel")
. The vignette acts as general introduction to parallelism in R, so it could be worth reading anyway.
CodePudding user response:
Parallel package is useful in this case. And if you are using Linux OS, I would recommend doMC package instead of parallel. This doMC package is useful even for looping over big data used in machine learning projects.