Optimization using rbinom with NAs and mcmapply-CodePudding

I'm trying to sample data from a binomial distribution using two numeric vectors, both of around 8 million elements each. They look like this:


b = runif(n = 8000000, min = 0, max = 1)
d=rpois( 8000000, lambda =0.1)

I want to apply this function for every element of b and d

rbinom_NA = function(x , y) {
  result = rbinom(n = x,
                  size = 1,
                  prob = y) 
  
  if (length(result) == 0)
    return(NA)
  return(result)
}

I do it with mcmapply. You can see how long it takes for me on my computer.

rbinom_result = vector(mode = "list", length = 8000000)
rbinom_result = mcmapply(
  d,
  b,
  FUN = rbinom_NA, 
  mc.cores = detectCores()-1) #run on Ubuntu, 32 cores and 64Gb memory
# system.time result:
#  user  system elapsed 
# 90.631 116.584 154.353

Then I calculate the mean for every element of the previous result:

final_result= vapply(rbinom_result, mean, FUN.VALUE = 1)
# system.time result:
# user  system elapsed 
# 23.372   0.050  23.412

How can I make these two bits of code run faster? Thank you in advance for your help.

Edit: I need to calculate the mean of every element of rbinom_result separately, which might be NA or a variable number of integers depending on the number passed to rbinom_NA() by the vector d. So in reality I'd make this calculation with d vectors created from different lambdas, so the rbinom_result now looks like this:

#if lamb=0.01
head(rbinom_result)
$pos1
[1] NA

$pos2
[1] NA

$pos3
[1] NA
#etc

#if lamb=5
head(rbinom_result)

 $pos1
 [1] 0 0 0 0 0 0 0
 
 $pos2
 [1] 1 1 1 0 0 1 1
 
 $pos3
 [1] 0 0 0 0 0 0
# etc

and then in the end I want to collate final_result into a data frame that contains the results for every value of lambda (lambdas are columns, with 8e6 rows). So the whole thing looks like this:

library(parallel)
lambdas = c(0.01,5)#just two here, will be more
final_result = vector(mode = "list", length = length(lambdas))

for (lamb in lambdas){
b = runif(n = 8e6, min = 0, max = 1)
d=rpois( 8e6, lambda =lamb)

rbinom_NA = function(x , y) {
  result = rbinom(n = x,
                  size = 1,
                  prob = y) 
  
  if (length(result) == 0)
    return(NA)
  return(result)
}


rbinom_result = vector(mode = "list", length = 8e6)
rbinom_result = mcmapply(
  d,
  b,
  FUN = rbinom_NA, 
  mc.cores = detectCores()-1) #run on Ubuntu, 32 cores and 64Gb memory

names(rbinom_result) = paste0("pos",1:8e6)

final_result[[as.character(lamb)]]= vapply(rbinom_result, mean, FUN.VALUE = 1)

}
final_result = do.call("cbind", final_result)

In the end:

 head(final_result)
     0.01         5
pos1   NA 0.0000000
pos2   NA 0.7142857
pos3   NA 0.0000000
pos4   NA 0.6000000
pos5   NA 1.0000000
pos6   NA 0.1666667
#etc

CodePudding user response：

You can get final_result for each lambda with a single command involving rbinom:

b <- runif(8e6)
d <- rpois(8e6, 0.1)
system.time(final_result <- rbinom(length(b), d, b)/d)
#   user  system elapsed 
#  0.327   0.058   0.384

This works because the the following two lines follow the same distribution for atomic x and y:

mean(rbinom(x, 1, y))
rbinom(1, x, y)/x

The difference is that the second is vectorized for equal-length vectors of x and y simply by changing the n (first) argument in rbinom from 1 to length(x). Also, if x = 0, rbinom(1, x, y)/x is NA, so the NA behavior is maintained.

For multiple lambda:

library(parallel)

fSim <- function(lambda, n){
  b <- runif(n)
  d <- rpois(n, lambda)
  return(rbinom(length(b), d, b)/d)
}

lambdas <- c(0.1, 5)
system.time(final_result <- mcmapply(fSim, lambdas, 8e6, mc.cores = min(length(lambdas), detectCores() - 1)))
#   user  system elapsed 
#  2.340   0.931   2.031