Numpy generate Bernoulli trials with different means (or random choice with different p for differen-CodePudding

I want to speed up the code below - namely the for loop. Is there a way to do it in numpy?

import numpy as np
# define seend and random state
rng = np.random.default_rng(0)
# num of throws
N = 10**1
# max number of trials
total_n_trials = 10
# generate the throws' distributions of "performace" - overall scores
# mu_throws = rng.normal(0.7, 0.1, N)
mu_throws = np.linspace(0,1,N)

all_trials = np.zeros(N*total_n_trials)
for i in range(N):
    # generate trials per throw as Bernoulli trials with given mean
    all_trials[i*total_n_trials:(i 1)*total_n_trials] = rng.choice([0,1], size=total_n_trials, p=[1-mu_throws[i],mu_throws[i]])

More explanations - I want to generate N sequences of Bernoulli trials (ie. 0s and 1s, called throws) where each sequence has a mean (probability p) given by values in another array (mu_throws). This can be sampled from a normal distribution or in this case, I took it for simplicity to be a sequence of N=10 numbers from 0 to 1. The approach above works but is slow. I expect N to be at least $10^4$ and total_n_trials then can be anything in the order of hundreds to (tens of) thousands (and this is run several times). I have checked the following post but didn't find an answer. I also know that numpy random choice can generate multidimensional arrays but I didn't find a way to set a different set of ps for different rows. Basically getting the same as what I do above just reshaped:

all_trials.reshape(N,-1)

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 1., 0., 0.],
       [1., 0., 0., 1., 0., 0., 0., 1., 1., 0.],
       [1., 0., 1., 0., 0., 1., 0., 1., 0., 1.],
       [1., 0., 1., 0., 0., 0., 1., 1., 0., 0.],
       [1., 0., 0., 1., 0., 1., 0., 1., 1., 0.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])

I also thought about this trick but haven't figured how to use it for Bernoulli trials. Thanks for any tips.

CodePudding user response：

Use numpy.random.binomial:

>>> n = 10
>>> mu_throws = np.linspace(0, 1, n)
>>> np.random.binomial(1, mu_throws[:, None], (n, 10))
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 1, 0],
       [1, 0, 0, 0, 1, 1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 1, 1, 0, 0, 1],
       [0, 1, 1, 1, 1, 0, 1, 1, 1, 0],
       [1, 0, 1, 1, 1, 1, 1, 0, 1, 0],
       [1, 1, 1, 0, 1, 0, 1, 0, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

Numpy now recommends the new random generator. You can consider using numpy.random.Generator.binomial:

>>> rng = np.random.default_rng()
>>> rng.binomial(1, mu_throws[:, None], (n, 10))
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 1, 0],
       [0, 0, 0, 0, 1, 0, 1, 0, 1, 0],
       [1, 0, 0, 1, 1, 0, 1, 1, 1, 0],
       [0, 1, 0, 1, 1, 1, 1, 1, 1, 1],
       [0, 0, 1, 1, 1, 1, 1, 0, 1, 0],
       [1, 1, 1, 1, 1, 0, 0, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int64)

CodePudding user response：

N = 11
mu_throws = np.linspace(0,1,N)
total_n_trials = 10_000

all_trials = (np.random.rand(N, total_n_trials).T<mu_throws).T.astype(int)
all_trials # shape (N, total_n_trials)

output:

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 0, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 0, 1, 1],
       [1, 1, 1, ..., 1, 1, 1]])

Basically, what I'm doing is generate random real numbers in the interval [0, 1) and convert them in boolean results in function of a given probability (in mu_throws).

if you compare the mu_throws (actual probability mean) and the probability estimated in all_trials you have:

np.c_[mu_throws, all_trials.mean(1)]

output:

array([[0.    , 0.    ],
       [0.1   , 0.1003],
       [0.2   , 0.1963],
       [0.3   , 0.305 ],
       [0.4   , 0.4006],
       [0.5   , 0.5056],
       [0.6   , 0.5992],
       [0.7   , 0.6962],
       [0.8   , 0.7906],
       [0.9   , 0.8953],
       [1.    , 1.    ]])

For N and total_n_trials values from your example the time required on my machine is 0.00014019012451171875 sec, vs 0.0012738704681396484 sec of your loop.