Home > Back-end >  How to speed up the tidymodels bootstrapping with parallelization
How to speed up the tidymodels bootstrapping with parallelization

Time:01-27

I have the following code, that performs bootstrapping and calculates the confidence interval.

library(resample)
ibrary(broom)
library(dplyr)
library(purrr)
library(tibble)

lm_est <- function(split, ...) {
  lm(mpg ~ disp   hp, data = analysis(split)) %>%
    tidy()
}

set.seed(52156)
car_rs <-
  bootstraps(mtcars, 500, apparent = TRUE) %>%
  mutate(results = map(splits, lm_est))

int_pctl(car_rs, results) # this is important 

It produces

> int_pctl(car_rs, results)
# A tibble: 3 × 6
  term         .lower .estimate   .upper .alpha .method   
  <chr>         <dbl>     <dbl>    <dbl>  <dbl> <chr>     
1 (Intercept) 27.7      31.0    34.1       0.05 percentile
2 disp        -0.0431   -0.0295 -0.0123    0.05 percentile
3 hp          -0.0643   -0.0281 -0.00930   0.05 percentile

But it runs very slowly. How can I speed it up with parallelization? Note that the output of the parallelization needs to be able to be processed by int_pctl().

I tried this but failed:

library(parallel)
# set the number of cores to use for parallelization
cores <- detectCores() - 1
cl <- makeCluster(cores)

# use mcmapply to parallelize the bootstrapping process
car_rs$results <- mcmapply(lm_est, car_rs$splits, mc.cores = cores, mc.preschedule = TRUE)

stopCluster(cl)

CodePudding user response:

There are parallel versions of purrr::map*() functions in the furrr package that you can use.

library(rsample)
library(broom)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(purrr)
library(tibble)
library(furrr)  #<- added
#> Loading required package: future

plan(multisession, workers = parallel::detectCores())  #<- added


lm_est <- function(split, ...) {
  library(broom) #<- added to load inside of remote workers
  lm(mpg ~ disp   hp, data = analysis(split)) %>%
    tidy()
}

set.seed(52156)
car_rs <-
  bootstraps(mtcars, 1500, apparent = TRUE) %>%
  mutate(results = future_map(splits, lm_est))   #<- changed

int_pctl(car_rs, results) # this is important 
#> # A tibble: 3 × 6
#>   term         .lower .estimate   .upper .alpha .method   
#>   <chr>         <dbl>     <dbl>    <dbl>  <dbl> <chr>     
#> 1 (Intercept) 27.7      30.8    33.6       0.05 percentile
#> 2 disp        -0.0443   -0.0298 -0.0146    0.05 percentile
#> 3 hp          -0.0584   -0.0267 -0.00718   0.05 percentile

Created on 2023-01-26 by the reprex package (v2.0.1)

  • Related