Fast way to get bootstrap means from long list-CodePudding

The code below is working for 2-3 seconds. Is it real to make it x5-10 faster? I tried to make poisson bootstrap, but it works slower. Also there is no way to use joblib and multiprocessing features because I have parallelization in higher level of code.

lst = range(1000000)
result = np.random.choice(lst, size=(100, len(lst)), replace=True).mean(axis=1)

I think that the main trouble is reshuffling in np.random.choice, maybe it can be used in optimization.

CodePudding user response：

TL;DR: np.random.choice is the main source of slowdown. The thing is Numpy use a latency-bound algorithm in this case and this is very hard to do something better in sequential (due to a random-access memory pattern). The random number generation is also pretty expensive.

Indeed, random-indexed items of the input list must be fetched from the memory. Since the input list is converted by Numpy to an contiguous taking about 4 to 8 MiB, the input array should likely not fit in your CPU cache. As a result, values are fetched from the main RAM. Moreover, the random access pattern prevent the processor to load quickly the values from the RAM. Modern mainstream x86 processor cores can fetch about 10 items concurrently (which is amazing since the code is supposed to be sequential). However, the latency of the current (DDR) RAMs is about 50-100 ns. Thus, fetching 100 million random items the the RAM should take about a second. Not to mention 100 million random values must also be generated (rather slow) and the 400~800 MiB intermediate buffer must be filled too.

One simple way to speed up this latency-bound computation is to use multiple threads. You can use Numba (or Cython) to do that. Moreover, smaller temporary arrays can be used to reduce the overhead of using huge Numpy arrays. The means can be computed on-the-fly.

import numba as nb

# Assume the input array is contiguous
@nb.njit('float64[:](int_[::1])', parallel=True)
def parallelShuffle(arr):
    n, m = 100, arr.size
    res = np.empty(n, np.float64)
    for i in nb.prange(n):
        s = 0.0
        tmp = np.random.randint(0, m, m)
        for j in range(m):
            s  = lst[tmp[j]]
        res[i] = s / m
    return res

lst = range(1000000)
lst = np.fromiter(lst, dtype=np.int_)  # Fast iterable to Numpy conversion
result = parallelShuffle(lst)

This implementation is about 4.5 times faster on my 6-core machine. Note the integer-based Numba random number generator is sadly currently slower than the one of Numpy.

CodePudding user response：

You can get a ~3x boost from not converting a python range object to a numpy array and selecting from it:

result = np.random.choice(1000000, size=(100, 1000000), replace=True).mean(axis=1)

Selecting a bunch of numbers in this manner is identical to just generating numbers in the same range. Unfortunately, the two methods take the same amount of time:

result = np.random.randint(1000000, size=(100, 1000000)).mean(axis=1)