I have written two functions
a) the first create simulated data and estimates a model
b) the second iterates this process a number of times, and average statistics from multiple simulations.
The third step I would like to do is to iterate this process across different sample sizes. I know how to do this with a for loop but it takesvery long. Does anyone has suggestions on how to improve looping speed?
In particular, I would be interested in using parallel processing or evaluating alternative looping packages like purrr
.
Here is an example:
# create a the first function simulates data and estimates the model
genmodel <- function (n,meanx,meany){
df <- as.data.frame(list(mean_x=rnorm(n=n, mean=meanx, sd=1)))
df <- df %>% mutate(mean_y=rnorm(n=n, mean=meany, sd=1))
model<- lm_robust(mean_y ~ mean_x, data=df,
se_type = "stata")
pval<- as.data.frame(list(p=summary(model)$coefficients)) %>% t()
pval <- as.data.frame(pval) %>% rownames_to_column()
return(pval)
}
# example
> genmodel(n=100,meanx=2,meany=1)
rowname (Intercept) mean_x
1 p.Estimate 9.984653e-01 -0.05115484
2 p.Std..Error 2.027905e-01 0.10273142
3 p.t.value 4.923630e 00 -0.49794738
4 p.Pr...t.. 3.441203e-06 0.61963671
5 p.CI.Lower 5.960341e-01 -0.25502201
6 p.CI.Upper 1.400896e 00 0.15271232
7 p.DF 9.800000e 01 98.00000000
Generate the second function that that iterate the first function a number of times and averages estimated statistics
average_model <- function(nrep=100, # number of simulations
n,
mean_x,
mean_y
){
tmpres<- lapply(1:nrep, function(x) genmodel(n=n,meanx=mean_x,meany=mean_y))
tmpres <- do.call(rbind, tmpres)
vec<- names(tmpres[2:ncol(tmpres)])
tmpres <- unique(setDT(tmpres)[,paste("avg",(vec),sep = "_"):=map(.SD,~ mean(.x)),by=rowname,.SDcols=(vec)
][,nobs:=n] %>% select(rowname,`avg_(Intercept)`,avg_mean_x,nobs))
}
# example
tst<-average_model(nrep=50,n=100,mean_x=2,mean_y=1)
rowname avg_(Intercept) avg_mean_x nobs
1: p.Estimate 1.06002378 -0.03100749 100
2: p.Std..Error 0.22368299 0.09921118 100
3: p.t.value 4.83878275 -0.31190506 100
4: p.Pr...t.. 0.00206157 0.45198433 100
5: p.CI.Lower 0.61613217 -0.22788884 100
6: p.CI.Upper 1.50391540 0.16587386 100
7: p.DF 98.00000000 98.00000000 100
Now my objective is to iterate this average_model
function over different sample sizes and to create a unique data frame with all of the information. This can be easily done using a for
loop
for (i in seq(from=100,to=500,by=30)){
tmpres <- average_model(nrep=50,n=i,mean_x=2,mean_y=1)
results <- rbind(results, tmpres) # sequentially paste results
head(results)
rowname avg_(Intercept) avg_mean_x nobs
1: p.Estimate 1.001296821 0.000989775 100
2: p.Std..Error 0.224800002 0.099078646 100
3: p.t.value 4.530076894 0.027428073 100
4: p.Pr...t.. 0.001934362 0.504152193 100
5: p.CI.Lower 0.555188534 -0.195628574 100
6: p.CI.Upper 1.447405108 0.197608124 100
# it can also be done using `apply`, but both approach are quite slow
tmpres<- lapply(seq(from=100,to=500,by=30), function(x) average_model(nrep=50,n=x,mean_x=2,mean_y=1)
tmpres <- do.call(rbind, tmpres)
The problem with this for loop is that it is extremely slow.
Is there a way I could do this using parallel processing? Other suggestions for reducing running time?
CodePudding user response:
This "all data.table
" approach is about twice as fast, but still disappointing.
The basic idea is to assemble all the datasets into one large data.table
and then cycle through the models using data.table
group by.
library(data.table)
library(estimatr)
library(tictoc)
##
tic()
mf <- data.table(nrep=1:50, meanx=2, meany=1)
mf <- mf[, .(n=seq(100, 500, 30)), by=.(nrep, meanx, meany)]
data <- mf[, .(mean_x=rnorm(n, meanx), mean_y=rnorm(n, meany)), by=.(n, nrep, meanx, meany)]
result <- data[, as.data.table(t(summary(lm(mean_y~mean_x, .SD, se_type = 'stata'))$coefficients), keep.rownames = TRUE)
, by=.(n, nrep, meanx, meany)][, nrep:=NULL]
result <- result[, lapply(.SD, mean), by=.(n, meanx, meany, rn)]
toc()
## 2.58 sec elapsed
So this takes between 2.3 - 2.6 sec on my machine, wheres your code runs in about 4.0 - 4.1 sec. About 80% of the time is spent running lm_robust(...)
. If I swap that out for lm(...)
in base R it runs in about 1 sec.
CodePudding user response:
This can be done more straight forward:
expand_grid(
nreps = 50,
n = seq.default(100, 500, by = 30),
mean_x = 2, mean_y = 1
) %>%
rowid_to_column("n_idx") %>%
uncount(nreps, .remove = FALSE) %>%
rowid_to_column("nreps_idx") %>%
rowwise() %>%
mutate(
lm_robust =
estimatr::lm_robust(
y ~ X,
data =
tibble(y = rnorm(n, mean = mean_y, sd = 1),
X = rnorm(n, mean = mean_x, sd = 1)),
se_type = "stata"
) %>%
coefficients() %>%
set_names(str_c("coef_", names(.))) %>%
list()
) %>%
unnest_wider(lm_robust) %>%
group_by(nreps_idx) %>%
summarise(
n = unique(n),
across(starts_with("coef"), mean),
)
Which result in
# A tibble: 700 × 4
nreps_idx n `coef_(Intercept)` coef_X
<int> <dbl> <dbl> <dbl>
1 1 100 1.34 -0.183
2 2 100 0.845 0.0188
3 3 100 0.949 0.0341
4 4 100 1.20 -0.0705
5 5 100 0.731 0.0419
6 6 100 0.809 0.0564
7 7 100 0.920 0.0558
8 8 100 1.22 -0.0673
9 9 100 1.22 -0.171
10 10 100 1.26 -0.127
# … with 690 more rows
Which is computed fairly quickly.
Now, I've not included all the parameters in your code, because honestly it doesn't make sense to take the mean of them, but if you want them as well then...
expand_grid(
nreps = 50,
n = seq.default(100, 500, by = 30),
mean_x = 2, mean_y = 1
) %>%
rowid_to_column("n_idx") %>%
uncount(nreps, .remove = FALSE) %>%
rowid_to_column("nreps_idx") %>%
rowwise() %>%
mutate(
lm_robust =
estimatr::lm_robust(
y ~ X,
data =
tibble(y = rnorm(n, mean = mean_y, sd = 1),
X = rnorm(n, mean = mean_x, sd = 1)),
se_type = "stata"
) %>%
# SECOND APPROACH
summary() %>%
`[[`("coefficients") %>%
as_tibble(rownames = "rowname") %>%
pivot_wider(names_from = "rowname",
values_from = everything()) %>%
# FIRST APPROACH
# coefficients() %>%
# set_names(str_c("coef_", names(.))) %>%
list()
) %>%
unnest_wider(lm_robust) %>%
print() %>%
group_by(nreps_idx) %>%
summarise(
n = unique(n),
across(starts_with("Estimate"), mean),
# insert statements here to summarise the other gathered stuff
)
But this makes things unnecessarily complicated.