Parallelize both model fitting and dredging (glmmTMB dredge)-CodePudding

My aim is to speed up as much as possible the dredge() function when applied to glmmTMB() models. I know that both functions can be parallelized: glmmTMB() with the control argument, and dredge() with the cluster argument.

My question: to get the maximum speed, can I parallelize model fitting and dredging at the same time? In other words, can I combine/stack/add together the speed benefits of parallelizing glmmTMB() and dredge()?

I have attempted to do so by creating two separate clusters in an R session, and by comparing various options with microbenchmark(), it seems like I have achieved my goal.

Nevertheless, since I have just copied code from elsewhere, I have no idea what I'm doing! I have background in statistics and R programming, but parallelization is something I'm just starting to learn. So here are a few questions.

Can this process be sped up even more? Is making two clusters in an R session is a sound idea? In reality, can the speed benefit be added up together, or am I just seeing an artifact? Can someone recommend some learning resource to understand these functions better?

Many thanks!

## Load libraries

library(glmmTMB)
library(microbenchmark)
library(multcomp)
library(MuMIn)
library(parallel)

## Create large dataset (idea from the glmmTMB vignette on parallel optimization)
N <- 3e5
x1 <- rnorm(N, 1, 2)
x2 <- rnorm(N, 4, 2)
x3 <- rnorm(N, 10, 2)
y <- 0.3   0.4 * x1 - 0.2 * x2   0.9 * x3   rnorm(N, 0, 0.25)

df <- data.frame(y,
                 x1,
                 x2,
                 x3)


## Create two clusters

# create cluster "cl", but export nothing
cl  <-  parallel::makeCluster((parallel::detectCores() - 1))

# create cluster" "clust" and export data and libraries (following documentation of pdredge)
clust  <-  parallel::makeCluster((parallel::detectCores() - 1))
parallel::clusterEvalQ(clust, library(glmmTMB))
parallel::clusterEvalQ(clust, library(MuMIn))
parallel::clusterExport(clust, "df")


## Compare running times for glmmTMB(): both "cl" and "clust" reduce running times
microbenchmark::microbenchmark(
  
  # No parallel
  glmmTMB::glmmTMB(y ~ x1   x2   x3, data = df),
  
  # Parallel model with "cl"
  glmmTMB::glmmTMB(y ~ x1   x2   x3, data = df, control = glmmTMBControl(parallel = length(cl))),
  
  # Parallel model with "clust"
  glmmTMB::glmmTMB(y ~ x1   x2   x3, data = df, control = glmmTMBControl(parallel = length(clust))),
  
  times = 10
  
)
Unit: seconds
expr
glmmTMB::glmmTMB(y ~ x1   x2   x3, data = df)
glmmTMB::glmmTMB(y ~ x1   x2   x3, data = df, control = glmmTMBControl(parallel = length(cl)))
glmmTMB::glmmTMB(y ~ x1   x2   x3, data = df, control = glmmTMBControl(parallel = length(clust)))
      min       lq     mean   median       uq      max neval cld
 4.526190 4.556430 4.625324 4.631528 4.670585 4.745891    10   b
 2.271729 2.282912 2.315834 2.293132 2.343508 2.393902    10  a 
 2.231709 2.288383 2.382596 2.400160 2.459594 2.507514    10  a 


## Compare running times when parallelization is attempted
## both for glmmTMB() and dredge()

options(na.action = "na.fail")

microbenchmark::microbenchmark(
  
  # No parallel
  MuMIn::dredge(global.model = glmmTMB::glmmTMB(y ~ x1   x2   x3, data = df),
                rank = "AICc"),
  
  # Parallel glmmTMB with "cl"
  MuMIn::dredge(global.model = glmmTMB::glmmTMB(y ~ x1   x2   x3, data = df, control = glmmTMBControl(parallel = length(cl))),
                rank = "AICc"),
  
  # Parallel dredge with "clust"
  MuMIn::dredge(global.model = glmmTMB::glmmTMB(y ~ x1   x2   x3, data = df),
              rank = "AICc", cluster = clust),
  
  # Both: parallel glmmTMB with "cl", parallel dredge with "clust"
  MuMIn::dredge(global.model = glmmTMB::glmmTMB(y ~ x1   x2   x3, data = df, control = glmmTMBControl(parallel = length(cl))),
              rank = "AICc", cluster = clust),

times = 10

)




Unit: seconds
                                                                                                                                                                   expr
MuMIn::dredge(global.model = glmmTMB::glmmTMB(y ~ x1   x2   x3, data = df),
rank = "AICc")
MuMIn::dredge(global.model = glmmTMB::glmmTMB(y ~ x1   x2   x3, data = df, control = glmmTMBControl(parallel = length(cl))),
rank = "AICc")
MuMIn::dredge(global.model = glmmTMB::glmmTMB(y ~ x1   x2   x3, data = df),
rank = "AICc", cluster = clust)
MuMIn::dredge(global.model = glmmTMB::glmmTMB(y ~ x1   x2   x3, data = df, control = glmmTMBControl(parallel = length(cl))),
rank = "AICc", cluster = clust)

      min       lq     mean   median       uq      max neval  cld
 24.95914 25.17014 25.41935 25.27549 25.53169 26.47337    10    d
 14.21192 14.56461 15.28324 14.93494 15.88009 16.69395    10   c 
 13.48460 13.66408 14.09466 13.99638 14.30151 15.40998    10  b  
 11.07945 11.36578 11.75006 11.60089 12.31227 12.55529    10 a   


## Thse other options don't work

# Parallel dredge with "cl": Not using cluster, regardless of how I parallelize the model
MuMIn::dredge(global.model = glmmTMB::glmmTMB(y ~ x1   x2   x3, data = df, control = glmmTMBControl(parallel = length(cl))),
              rank = "AICc",
              cluster = cl)
MuMIn::dredge(global.model = glmmTMB::glmmTMB(y ~ x1   x2   x3, data = df, control = glmmTMBControl(parallel = length(clust))),
              rank = "AICc",
              cluster = cl)

# Parallel dredge and model with "clust": Doesn't work
MuMIn::dredge(global.model = glmmTMB::glmmTMB(y ~ x1   x2   x3, data = df, control = glmmTMBControl(parallel = length(clust))),
              rank = "AICc",
              cluster = clust)

CodePudding user response：

You are not adding the parallelizations of functions in packages dredge and glmmTMB, the speed gain comes from exporting the packages and data.
When you parallelize, you will have all but one cores busy, so when parallelizing again, there's nothing to be gained, there are no cores left.