Fastest way to generate random bits for Brownian Motion simulation with Python-CodePudding

TL;DR: I try to generate large amounts of random numbers from a set of [-1, 1] I found out that np.random.Generator.choice is the worst method while the best one is to use np.random.Generator.random and move from boolean and to integer. Is there a better way to generate such random numbers in large amounts?

I want to simulate the motion of m particles in a closed container with reflective surfaces in d-D in a mesh for n iterations. There is also a hitting condition that I check after every iteration. Without loss of generality, let's assume d=2.

Since I use a mesh, the movement part boils down to generating -1s and 1s. At this point, I treat each particle individually in a loop so after generating these -1s and 1s, I access them in a loop and as the number of particles still in the system decreases, I generate less numbers. This part can be improved and I am open to comments but it is not what I am asking.

My question is what the fastest way is to generate many random numbers from the set [-1,1]. I thought of five approaches, four with np and with random.

I firstly thought that the obvious choice was np.random.Generator.choice([-1,1],size=(m,2)). It took 595.6 seconds.

n=int(5e6)
m=1000
rng=np.random.default_rng()
tic=time.time()
for i in range(n):
    a=rng.choice([-1,1], size=(m,2))
    for j in range(m): #This useless loop is here to demonstrate that I am accessing each line one by one
        b=a[j]
print("choice",time.time()-tic)

I tried to move from uniform~[0,1) to integers -1 and 1 with code below, which took 527.6 s. Note that the 0 part converts True to 1 and False to 0.

n=int(5e6)
m=1000
rng=np.random.default_rng()
tic=time.time()
for i in range(n):
    a=(rng.random(size=(m,2))>0.5 0)*2-1
    for j in range(m): 
        b=a[j]
print("operate",time.time()-tic)

I tried something very similar with np.sign, which took 543.5 s.

n=int(5e6)
m=1000
rng=np.random.default_rng()
tic=time.time()
for i in range(n):
    a=np.sign(rng.random(size=(m,2))-0.5)
    for j in range(m):
        b=a[j]
print("sign",time.time()-tic)

Finally, I tried np.random.Generator.integers with a few extra manipulations. It took 553.8 s.

n=int(5e6)
m=1000
rng=np.random.default_rng()
tic=time.time()
for i in range(n):
    a=(rng.integers(2,size=(m,2))*2)-1
    for j in range(m):
        b=a[j]
print("integer",time.time()-tic

Since I access them one by one, I could also use the random.random() on the spot. It takes far too long (~750 s).

What baffles me most is that rng.choice() is by far the worst choice. I knew that it didn't work well for small set size, but since m is large, I thought it can redeem itself. I am also surprised that boolean to integer transition offers the fastest solution.

So, my question is, what is the fastest way to generate m*d random integers belonging to the set [-1,1]? Is it one of those that I tried or am I missing something? Or am I not even on the right track?

CodePudding user response：

It will be a lot faster to generate 32 bit unsigned integers, and extract 32 random bits from each integer. By asking for a full range, I believe that avoids any scaling or conversion overhead from manipulating the words pulled from the rng, so it should be the most efficient (eg, calling rng.bytes uses this method, see source):

import numpy as np

m = 65536
rng = np.random.default_rng()
max_int = 2 ** 32

def choice():
    return rng.choice([-1,1], size=(m,2))

def random():
    return (rng.random(size=(m,2))>0.5 0)*2-1

def uint32_32x_too_many():
    return rng.integers(0, max_int, size=(m, 2), dtype=np.uint32)

def uint32():
    return rng.integers(0, max_int, size=(m//32, 2), dtype=np.uint32)

Benchmarking gives (the file name is rand.py):

python -mtimeit -s'import rand' 'rand.choice()'
500 loops, best of 5: 537 usec per loop
python -mtimeit -s'import rand' 'rand.random()'
500 loops, best of 5: 650 usec per loop
python -mtimeit -s'import rand' 'rand.uint32_32x_too_many()'
1000 loops, best of 5: 260 usec per loop
python -mtimeit -s'import rand' 'rand.uint32()'
20000 loops, best of 5: 11.9 usec per loop

Notice that uint32_32x_too_many() generates m*2 32-bit random ints (ie, 32x more random bits than required), and is considerably faster than choice or random. But uint32() is dramatically faster - it only generates m//32 * 2 random ints, but that is still the required m * 2 random bits.

Accessing the random bits in the packed integers will be slightly more expensive. To test the k-th bit, you access bit number k % 32 at index k // 32, eg:

(a[k // 32] >> (k % 32)) & 1

If that proves to be slower than you would like, one optimization is to have an inner loop sized so that it consumes 32 random bits. Then you can simply use one new random int per loop, and inside the loop you can test for each random bit, and shift the int each time:

rand_bit = rand_int & 1
rand_int = rand_int >> 1

If your inner loop is not scalar and requires an array of random bits, then you can efficiently vectorize the conversion of random 32-bit integers to an array of random bits with, eg:

num_ints = 10
a = rng.integers(0, 2**32, size=(num_ints))
rand_bits = (np.repeat(a, 32) >>  np.tile(np.arange(0, 32), num_ints)) & 1

This sets rand_bits to an array of 320 random 0/1 values.

CodePudding user response：

The simplest way to speedup things (not only RNG) is to use numba.

import numpy as np
import time
from numba import njit, prange

n=int(5e6)
m=1000
rng=np.random.default_rng()
tic=time.time()
for i in range(n):
    a=(rng.random(size=(m,2))>0.5 0)*2-1
print("operate",time.time()-tic)

@njit(parallel=True)
def getRandomNumbers(n):
    a = np.zeros(n)
    for i in prange(n):
        a[i] = (np.random.rand()>0.5 0)*2-1
    return a

tic=time.time()
for i in range(n):
    a = getRandomNumbers(2*m).reshape((m,2))
print("numba",time.time()-tic)

The output:

operate 68.01399993896484
numba 43.64120006561279

You can also try to use numba's GPU accelerated RNG. Could be the fastest, especially for large scale problems. Disclaimer: never tried it myself.

P.S. Simple single-threaded approach in C takes ~10 seconds. Parallel variant will be times faster. Consider using a C backend library to produce the numbers.