My aim is to speed up as much as possible the dredge()
function when applied to glmmTMB()
models.
I know that both functions can be parallelized: glmmTMB()
with the control
argument, and dredge()
with the cluster
argument.
My question: to get the maximum speed, can I parallelize model fitting and dredging at the same time? In other words, can I combine/stack/add together the speed benefits of parallelizing glmmTMB()
and dredge()
?
I have attempted to do so by creating two separate clusters in an R session, and by comparing various options with microbenchmark()
, it seems like I have achieved my goal.
Nevertheless, since I have just copied code from elsewhere, I have no idea what I'm doing! I have background in statistics and R programming, but parallelization is something I'm just starting to learn. So here are a few questions.
Can this process be sped up even more? Is making two clusters in an R session is a sound idea? In reality, can the speed benefit be added up together, or am I just seeing an artifact? Can someone recommend some learning resource to understand these functions better?
Many thanks!
## Load libraries
library(glmmTMB)
library(microbenchmark)
library(multcomp)
library(MuMIn)
library(parallel)
## Create large dataset (idea from the glmmTMB vignette on parallel optimization)
N <- 3e5
x1 <- rnorm(N, 1, 2)
x2 <- rnorm(N, 4, 2)
x3 <- rnorm(N, 10, 2)
y <- 0.3 0.4 * x1 - 0.2 * x2 0.9 * x3 rnorm(N, 0, 0.25)
df <- data.frame(y,
x1,
x2,
x3)
## Create two clusters
# create cluster "cl", but export nothing
cl <- parallel::makeCluster((parallel::detectCores() - 1))
# create cluster" "clust" and export data and libraries (following documentation of pdredge)
clust <- parallel::makeCluster((parallel::detectCores() - 1))
parallel::clusterEvalQ(clust, library(glmmTMB))
parallel::clusterEvalQ(clust, library(MuMIn))
parallel::clusterExport(clust, "df")
## Compare running times for glmmTMB(): both "cl" and "clust" reduce running times
microbenchmark::microbenchmark(
# No parallel
glmmTMB::glmmTMB(y ~ x1 x2 x3, data = df),
# Parallel model with "cl"
glmmTMB::glmmTMB(y ~ x1 x2 x3, data = df, control = glmmTMBControl(parallel = length(cl))),
# Parallel model with "clust"
glmmTMB::glmmTMB(y ~ x1 x2 x3, data = df, control = glmmTMBControl(parallel = length(clust))),
times = 10
)
Unit: seconds
expr
glmmTMB::glmmTMB(y ~ x1 x2 x3, data = df)
glmmTMB::glmmTMB(y ~ x1 x2 x3, data = df, control = glmmTMBControl(parallel = length(cl)))
glmmTMB::glmmTMB(y ~ x1 x2 x3, data = df, control = glmmTMBControl(parallel = length(clust)))
min lq mean median uq max neval cld
4.526190 4.556430 4.625324 4.631528 4.670585 4.745891 10 b
2.271729 2.282912 2.315834 2.293132 2.343508 2.393902 10 a
2.231709 2.288383 2.382596 2.400160 2.459594 2.507514 10 a
## Compare running times when parallelization is attempted
## both for glmmTMB() and dredge()
options(na.action = "na.fail")
microbenchmark::microbenchmark(
# No parallel
MuMIn::dredge(global.model = glmmTMB::glmmTMB(y ~ x1 x2 x3, data = df),
rank = "AICc"),
# Parallel glmmTMB with "cl"
MuMIn::dredge(global.model = glmmTMB::glmmTMB(y ~ x1 x2 x3, data = df, control = glmmTMBControl(parallel = length(cl))),
rank = "AICc"),
# Parallel dredge with "clust"
MuMIn::dredge(global.model = glmmTMB::glmmTMB(y ~ x1 x2 x3, data = df),
rank = "AICc", cluster = clust),
# Both: parallel glmmTMB with "cl", parallel dredge with "clust"
MuMIn::dredge(global.model = glmmTMB::glmmTMB(y ~ x1 x2 x3, data = df, control = glmmTMBControl(parallel = length(cl))),
rank = "AICc", cluster = clust),
times = 10
)
Unit: seconds
expr
MuMIn::dredge(global.model = glmmTMB::glmmTMB(y ~ x1 x2 x3, data = df),
rank = "AICc")
MuMIn::dredge(global.model = glmmTMB::glmmTMB(y ~ x1 x2 x3, data = df, control = glmmTMBControl(parallel = length(cl))),
rank = "AICc")
MuMIn::dredge(global.model = glmmTMB::glmmTMB(y ~ x1 x2 x3, data = df),
rank = "AICc", cluster = clust)
MuMIn::dredge(global.model = glmmTMB::glmmTMB(y ~ x1 x2 x3, data = df, control = glmmTMBControl(parallel = length(cl))),
rank = "AICc", cluster = clust)
min lq mean median uq max neval cld
24.95914 25.17014 25.41935 25.27549 25.53169 26.47337 10 d
14.21192 14.56461 15.28324 14.93494 15.88009 16.69395 10 c
13.48460 13.66408 14.09466 13.99638 14.30151 15.40998 10 b
11.07945 11.36578 11.75006 11.60089 12.31227 12.55529 10 a
## Thse other options don't work
# Parallel dredge with "cl": Not using cluster, regardless of how I parallelize the model
MuMIn::dredge(global.model = glmmTMB::glmmTMB(y ~ x1 x2 x3, data = df, control = glmmTMBControl(parallel = length(cl))),
rank = "AICc",
cluster = cl)
MuMIn::dredge(global.model = glmmTMB::glmmTMB(y ~ x1 x2 x3, data = df, control = glmmTMBControl(parallel = length(clust))),
rank = "AICc",
cluster = cl)
# Parallel dredge and model with "clust": Doesn't work
MuMIn::dredge(global.model = glmmTMB::glmmTMB(y ~ x1 x2 x3, data = df, control = glmmTMBControl(parallel = length(clust))),
rank = "AICc",
cluster = clust)
CodePudding user response:
You are not adding the parallelizations of functions in packages dredge
and glmmTMB
, the speed gain comes from exporting the packages and data.
When you parallelize, you will have all but one cores busy, so when parallelizing again, there's nothing to be gained, there are no cores left.