I'm trying to sample data from a binomial distribution using two numeric vectors, both of around 8 million elements each. They look like this:
b = runif(n = 8000000, min = 0, max = 1)
d=rpois( 8000000, lambda =0.1)
I want to apply this function for every element of b and d
rbinom_NA = function(x , y) {
result = rbinom(n = x,
size = 1,
prob = y)
if (length(result) == 0)
return(NA)
return(result)
}
I do it with mcmapply. You can see how long it takes for me on my computer.
rbinom_result = vector(mode = "list", length = 8000000)
rbinom_result = mcmapply(
d,
b,
FUN = rbinom_NA,
mc.cores = detectCores()-1) #run on Ubuntu, 32 cores and 64Gb memory
# system.time result:
# user system elapsed
# 90.631 116.584 154.353
Then I calculate the mean for every element of the previous result:
final_result= vapply(rbinom_result, mean, FUN.VALUE = 1)
# system.time result:
# user system elapsed
# 23.372 0.050 23.412
How can I make these two bits of code run faster? Thank you in advance for your help.
Edit: I need to calculate the mean of every element of rbinom_result separately, which might be NA or a variable number of integers depending on the number passed to rbinom_NA()
by the vector d. So in reality I'd make this calculation with d vectors created from different lambdas, so the rbinom_result now looks like this:
#if lamb=0.01
head(rbinom_result)
$pos1
[1] NA
$pos2
[1] NA
$pos3
[1] NA
#etc
#if lamb=5
head(rbinom_result)
$pos1
[1] 0 0 0 0 0 0 0
$pos2
[1] 1 1 1 0 0 1 1
$pos3
[1] 0 0 0 0 0 0
# etc
and then in the end I want to collate final_result into a data frame that contains the results for every value of lambda (lambdas are columns, with 8e6 rows). So the whole thing looks like this:
library(parallel)
lambdas = c(0.01,5)#just two here, will be more
final_result = vector(mode = "list", length = length(lambdas))
for (lamb in lambdas){
b = runif(n = 8e6, min = 0, max = 1)
d=rpois( 8e6, lambda =lamb)
rbinom_NA = function(x , y) {
result = rbinom(n = x,
size = 1,
prob = y)
if (length(result) == 0)
return(NA)
return(result)
}
rbinom_result = vector(mode = "list", length = 8e6)
rbinom_result = mcmapply(
d,
b,
FUN = rbinom_NA,
mc.cores = detectCores()-1) #run on Ubuntu, 32 cores and 64Gb memory
names(rbinom_result) = paste0("pos",1:8e6)
final_result[[as.character(lamb)]]= vapply(rbinom_result, mean, FUN.VALUE = 1)
}
final_result = do.call("cbind", final_result)
In the end:
head(final_result)
0.01 5
pos1 NA 0.0000000
pos2 NA 0.7142857
pos3 NA 0.0000000
pos4 NA 0.6000000
pos5 NA 1.0000000
pos6 NA 0.1666667
#etc
CodePudding user response:
You can get final_result
for each lambda
with a single command involving rbinom
:
b <- runif(8e6)
d <- rpois(8e6, 0.1)
system.time(final_result <- rbinom(length(b), d, b)/d)
# user system elapsed
# 0.327 0.058 0.384
This works because the the following two lines follow the same distribution for atomic x
and y
:
mean(rbinom(x, 1, y))
rbinom(1, x, y)/x
The difference is that the second is vectorized for equal-length vectors of x
and y
simply by changing the n
(first) argument in rbinom
from 1
to length(x)
. Also, if x = 0
, rbinom(1, x, y)/x
is NA
, so the NA behavior is maintained.
For multiple lambda
:
library(parallel)
fSim <- function(lambda, n){
b <- runif(n)
d <- rpois(n, lambda)
return(rbinom(length(b), d, b)/d)
}
lambdas <- c(0.1, 5)
system.time(final_result <- mcmapply(fSim, lambdas, 8e6, mc.cores = min(length(lambdas), detectCores() - 1)))
# user system elapsed
# 2.340 0.931 2.031