I want to generate 1 million simulated datasets and save them in a list column.
The problem I run into is that every generated dataset is identical, which I assume is a result of the operation being vectorized. Is there a way to maintain the speed of a vectorized operation while changing the seed for each execution of rbinom()?
library(tidyverse)
SIMS <- 100
df <-
tibble(
sim = 1:5,
data =
list(
tibble(
read = 1:100,
events = rbinom(n=SIMS, size=1e3, prob = .25),
trials = 1e3
)
)
)
df
head(df$data[[1]])
head(df$data[[2]])
I have tried using parallel::mclapply(), which solves the seed issue, but the performance is nowhere near as fast as the vectorized operation.
BASERATE <- .25
DAILY_N <- 1e3
DURATION <- 500
time_to_insight <- function(){
tibble(
read = 1:DURATION,
events = rbinom(n=DURATION, size=DAILY_N, prob = BASERATE),
trials = DAILY_N
)
}
temp <-
parallel::mclapply(
X = 1:1000,
FUN = time_to_insight,
mc.cores = 8L
)
CodePudding user response:
Depending on how your simulation works, it might be possible to create all the data for all the simulations first, and then group them into a list to get the original format with unique data.
n = 5
SIMS = 100
set.seed(42)
df <-
tibble(
sim = rep(1:n, each = SIMS),
read = rep(1:SIMS, times = n),
events = rbinom(n=SIMS*n, size=1e3, prob = .25),
trials = 1e3) %>%
nest(data = -sim)
Result
> head(df$data[[1]])
# A tibble: 6 × 3
read events trials
<int> <int> <dbl>
1 1 240 1000
2 2 262 1000
3 3 240 1000
4 4 250 1000
5 5 214 1000
6 6 244 1000
> head(df$data[[2]])
# A tibble: 6 × 3
read events trials
<int> <int> <dbl>
1 1 247 1000
2 2 260 1000
3 3 253 1000
4 4 245 1000
5 5 259 1000
6 6 247 1000
CodePudding user response:
You can (almost) maintain the speed of your original function by generating the realisations of your random variable (events
) by generating all of the realisations for all of your simulations at the same time. This is possible because your event probability remains constant for all the simulations.
Here's how I would do it. First write a function to generate all your simulations at the same time.
generateTibble <- function() {
tibble() %>%
expand(sim=1:5, read=1:100, trials=1000) %>%
mutate(events=rbinom(n=nrow(.), size=1000, prob=0.25))
}
And use it
df <- generateTibble()
df
# A tibble: 500 × 4
sim read trials events
<int> <int> <dbl> <int>
1 1 1 1000 246
2 1 2 1000 232
3 1 3 1000 258
4 1 4 1000 233
5 1 5 1000 259
6 1 6 1000 251
7 1 7 1000 254
8 1 8 1000 263
9 1 9 1000 267
10 1 10 1000 266
# … with 490 more rows
Demonstrate that realisations do not repeat across sim
s:
df %>% group_by(sim) %>% group_map(head, n=2)
[[1]]
# A tibble: 2 × 3
read trials events
<int> <dbl> <int>
1 1 1000 246
2 2 1000 232
[[2]]
# A tibble: 2 × 3
read trials events
<int> <dbl> <int>
1 1 1000 255
2 2 1000 261
[[3]]
# A tibble: 2 × 3
read trials events
<int> <dbl> <int>
1 1 1000 234
2 2 1000 239
[[4]]
# A tibble: 2 × 3
read trials events
<int> <dbl> <int>
1 1 1000 252
2 2 1000 258
[[5]]
# A tibble: 2 × 3
read trials events
<int> <dbl> <int>
1 1 1000 255
2 2 1000 258
Compare the speed of the new function with the original (faulty) version:
SIMS <- 100
originalFunction <- function() {
df <-
tibble(
sim = 1:5,
data =
list(
tibble(
read = 1:100,
events = rbinom(n=SIMS, size=1e3, prob = .25),
trials = 1e3
)
)
)
df
}
library(microbenchmark)
microbenchmark(originalFunction, generateTibble, times=100)
Unit: nanoseconds
expr min lq mean median uq max neval
originalFunction 42 43 52.95 44 44 877 100
generateTibble 47 48 68.14 48 54 1403 100
So, a 28% increase in mean execution time. Is that reasonable?