fast looping for user written functions-CodePudding

I have written two functions
a) the first create simulated data and estimates a model b) the second iterates this process a number of times, and average statistics from multiple simulations.

The third step I would like to do is to iterate this process across different sample sizes. I know how to do this with a for loop but it takesvery long. Does anyone has suggestions on how to improve looping speed?

In particular, I would be interested in using parallel processing or evaluating alternative looping packages like purrr.

Here is an example:

# create a the first function simulates data and estimates the model
genmodel <- function (n,meanx,meany){
  df <- as.data.frame(list(mean_x=rnorm(n=n, mean=meanx, sd=1)))
  df <- df %>%   mutate(mean_y=rnorm(n=n, mean=meany, sd=1))
  model<- lm_robust(mean_y ~ mean_x, data=df,
                    se_type = "stata")
  pval<- as.data.frame(list(p=summary(model)$coefficients)) %>% t() 
  pval <- as.data.frame(pval) %>% rownames_to_column() 
return(pval)
}
# example
> genmodel(n=100,meanx=2,meany=1)
       rowname  (Intercept)      mean_x
1   p.Estimate 9.984653e-01 -0.05115484
2 p.Std..Error 2.027905e-01  0.10273142
3    p.t.value 4.923630e 00 -0.49794738
4   p.Pr...t.. 3.441203e-06  0.61963671
5   p.CI.Lower 5.960341e-01 -0.25502201
6   p.CI.Upper 1.400896e 00  0.15271232
7         p.DF 9.800000e 01 98.00000000

Generate the second function that that iterate the first function a number of times and averages estimated statistics

average_model <- function(nrep=100, # number of simulations
                      n,
                      mean_x,
                      mean_y
){
tmpres<- lapply(1:nrep, function(x) genmodel(n=n,meanx=mean_x,meany=mean_y))
tmpres <- do.call(rbind, tmpres)
vec<- names(tmpres[2:ncol(tmpres)])
tmpres <- unique(setDT(tmpres)[,paste("avg",(vec),sep = "_"):=map(.SD,~ mean(.x)),by=rowname,.SDcols=(vec)
][,nobs:=n] %>% select(rowname,`avg_(Intercept)`,avg_mean_x,nobs))
}


# example
tst<-average_model(nrep=50,n=100,mean_x=2,mean_y=1)
        rowname avg_(Intercept)  avg_mean_x nobs
1:   p.Estimate      1.06002378 -0.03100749  100
2: p.Std..Error      0.22368299  0.09921118  100
3:    p.t.value      4.83878275 -0.31190506  100
4:   p.Pr...t..      0.00206157  0.45198433  100
5:   p.CI.Lower      0.61613217 -0.22788884  100
6:   p.CI.Upper      1.50391540  0.16587386  100
7:         p.DF     98.00000000 98.00000000  100

Now my objective is to iterate this average_model function over different sample sizes and to create a unique data frame with all of the information. This can be easily done using a for loop

for (i in seq(from=100,to=500,by=30)){ 
  tmpres <- average_model(nrep=50,n=i,mean_x=2,mean_y=1)
  results <- rbind(results, tmpres) # sequentially paste results
head(results)
        rowname avg_(Intercept)   avg_mean_x nobs
1:   p.Estimate     1.001296821  0.000989775  100
2: p.Std..Error     0.224800002  0.099078646  100
3:    p.t.value     4.530076894  0.027428073  100
4:   p.Pr...t..     0.001934362  0.504152193  100
5:   p.CI.Lower     0.555188534 -0.195628574  100
6:   p.CI.Upper     1.447405108  0.197608124  100

# it can also be done using `apply`, but both approach are quite slow


tmpres<- lapply(seq(from=100,to=500,by=30), function(x) average_model(nrep=50,n=x,mean_x=2,mean_y=1)
tmpres <- do.call(rbind, tmpres)

The problem with this for loop is that it is extremely slow.

Is there a way I could do this using parallel processing? Other suggestions for reducing running time?

CodePudding user response：

This "all data.table" approach is about twice as fast, but still disappointing.

The basic idea is to assemble all the datasets into one large data.table and then cycle through the models using data.table group by.

library(data.table)
library(estimatr)
library(tictoc)
##
tic()
mf     <- data.table(nrep=1:50, meanx=2, meany=1)
mf     <- mf[, .(n=seq(100, 500, 30)), by=.(nrep, meanx, meany)]
data   <- mf[, .(mean_x=rnorm(n, meanx), mean_y=rnorm(n, meany)), by=.(n, nrep, meanx, meany)]
result <- data[, as.data.table(t(summary(lm(mean_y~mean_x, .SD, se_type = 'stata'))$coefficients), keep.rownames = TRUE)
               , by=.(n, nrep, meanx, meany)][, nrep:=NULL]
result <- result[, lapply(.SD, mean), by=.(n, meanx, meany, rn)]
toc()
## 2.58 sec elapsed

So this takes between 2.3 - 2.6 sec on my machine, wheres your code runs in about 4.0 - 4.1 sec. About 80% of the time is spent running lm_robust(...). If I swap that out for lm(...) in base R it runs in about 1 sec.

CodePudding user response：

This can be done more straight forward:


expand_grid(
  nreps = 50,
  n = seq.default(100, 500, by = 30),
  mean_x = 2, mean_y = 1
) %>% 
  rowid_to_column("n_idx") %>% 
  uncount(nreps, .remove = FALSE) %>% 
  rowid_to_column("nreps_idx") %>% 
  rowwise() %>% 
  mutate(
    lm_robust = 
      estimatr::lm_robust(
        y ~ X,
        data = 
          tibble(y = rnorm(n, mean = mean_y, sd = 1),
                 X = rnorm(n, mean = mean_x, sd = 1)),
        se_type = "stata"
      ) %>% 
      coefficients() %>% 
      set_names(str_c("coef_", names(.))) %>% 
      list()
  ) %>% 
  unnest_wider(lm_robust) %>% 
  group_by(nreps_idx) %>% 
  summarise(
    n = unique(n),
    across(starts_with("coef"), mean),
  )

Which result in

# A tibble: 700 × 4
   nreps_idx     n `coef_(Intercept)`  coef_X
       <int> <dbl>              <dbl>   <dbl>
 1         1   100              1.34  -0.183 
 2         2   100              0.845  0.0188
 3         3   100              0.949  0.0341
 4         4   100              1.20  -0.0705
 5         5   100              0.731  0.0419
 6         6   100              0.809  0.0564
 7         7   100              0.920  0.0558
 8         8   100              1.22  -0.0673
 9         9   100              1.22  -0.171 
10        10   100              1.26  -0.127 
# … with 690 more rows

Which is computed fairly quickly.

Now, I've not included all the parameters in your code, because honestly it doesn't make sense to take the mean of them, but if you want them as well then...


expand_grid(
  nreps = 50,
  n = seq.default(100, 500, by = 30),
  mean_x = 2, mean_y = 1
) %>% 
  rowid_to_column("n_idx") %>% 
  uncount(nreps, .remove = FALSE) %>% 
  rowid_to_column("nreps_idx") %>% 
  rowwise() %>% 
  mutate(
    lm_robust = 
      estimatr::lm_robust(
        y ~ X,
        data = 
          tibble(y = rnorm(n, mean = mean_y, sd = 1),
                 X = rnorm(n, mean = mean_x, sd = 1)),
        se_type = "stata"
      ) %>% 
      # SECOND APPROACH
      summary() %>% 
      `[[`("coefficients") %>%
      as_tibble(rownames = "rowname") %>% 
      pivot_wider(names_from = "rowname", 
                  values_from = everything()) %>% 
      # FIRST APPROACH
      # coefficients() %>% 
      # set_names(str_c("coef_", names(.))) %>% 
      list()
  ) %>% 
  unnest_wider(lm_robust) %>% 
  print() %>% 
  group_by(nreps_idx) %>% 
  summarise(
    n = unique(n),
    across(starts_with("Estimate"), mean),
    # insert statements here to summarise the other gathered stuff
  )

But this makes things unnecessarily complicated.