fast way to randomly pick entries with numpy-CodePudding

I have 2D dataset values.shape = (NxN) and have 3 criteria. I want to make an image for those criteria with a color code. So I end up having a NxNx4 array mask which holds for every pixel boolean values for (background, color 1, color2, color 3).

If none of the criteria is fulfilled for a pixel, I set it to [T, F, F, F], i.e. background.
If any of the criteria are fulfilled (e.g. [F, T, F, T]), then the background is automatically False and then I want to randomly assign one of the according colors, so here color1 or color3. If just one criterium is True, that's the one that gets picked.

In other words: If a pixel fulfills one criterion, it gets the according color. If it matches several criteria, one is randomly picked.

Right now, I do

# assign a probability
p = mask/mask.sum(-1)[:, :, None]

# the indices of the colors
lst = np.arange(len(colors))

for i in range(mask.shape[0]):
    for j in range(mask.shape[1]):
        # randomly pick a color index where the mask is true
        idx = np.random.choice(lst, p=p[i,j])
        # assign that color to the pixel
        img[i,j,:] = colors[idx, :]

I can do everything up to that point fast with numpy expressions, but I cannot do this assignment without a double loop. Is there any way to avoid the double loop with some numpy-magic?

CodePudding user response：

AFAIK, there is no way to efficiently vectorize np.random.choice in this case using only Numpy (though there might be a faster way than 2 nested loops). The main issue with the provided implementation is that np.random.choice takes about several dozens of microseconds per call so the overall computation takes at least several minutes. This overhead comes from:

Checks: for example Numpy check the probability of all p-items is equal to 1 (in O(len(p)));
A slow Kahan summation in the Numpy implementation (that might be optimized).

A simple solution is to use Numba so to get ride of most of such overhead but Numba does not support the parameter p. It is not easy to make the computation efficient with large p arrays. One solution is to compute the cumulative sum of p and pick random value (between 0 and 1) so to select an item of lst with the probability in p based on a binary search. In fact, it is not even needed to select an item of lst since lst[i] == i. Here is a Numba-based implementation:

# Faster equivalent of `np.random.choice(np.arange(len(p)), p=p)`
@nb.njit('(float64[::1],)')
def randomInt(p):
    assert p.size > 0
    if p.size == 1:
        return 0
    tmp = np.empty(p.size, np.float64)
    # Cumulative sum
    s = 0.0
    for i in range(p.size):
        s  = p[i]
        tmp[i] = s
    # Renormalization (to increase the accuracy)
    rnd = np.random.rand() * s
    idx = np.searchsorted(tmp, rnd)
    return idx

This function is about 22 times faster than the one of Numpy. However, it is a slighly less accurate if the standard deviation of the probability is huge (eg. most items set to 0.00001 and few to 0.2) or if the number of items is huge (eg. >10_000). That being said, a re-normalization should be enough to give very good results for thousands of items (and non-extreme standard deviations). Note that p is automatically re-normalized.

You can use this function in a pure-Python code but it is faster to also use Numba so to speed up the two nested loops (by putting the code in a Numba function).

Further optimizations

On most processors, the speed of randomInt is limited by the dependency chain of the = operations on floating-point numbers. Indeed, it takes about 4 cycles on Skylake-related processors, 3 on previous architecture and also 3 on AMD Zen-based processors. This means that for 1000 items, the cumulative sum needs to takes at least 4000 cycles on Skylake (ie. 1 µs on a 4 GHz processor). An optimization is to break the dependency chain by the computing the even/odd parts of the cumulative sum with more independent instructions so the processor can better overlap the latency.

Another optimization is to parallelize the nested loops in Numba using prange and the decorator argument parallel=True. That being said, the scalability can be impacted by a contention of the allocator due to the many np.empty calls. This issue can be fixed by pre-allocating tmp in the outer loop and using prange only for this loop. The parameter of randomInt need to change so to include the temporary tmp array in parameter.

Here is an optimized Numba function (untested):

@nb.njit('(float64[::1], float64[::1])')
def randomIntOpt(p, tmp):
    assert p.size > 0
    assert tmp.size == p.size
    if p.size == 1:
        return 0
    s1, s2 = p[0], p[1]
    tmp[0], tmp[1] = s1, s1   s2
    # Optimized cumulative sum
    # (unrolled so to break the dependency chain)
    i1 = 2
    i2 = 3
    iMax = p.size
    while i2 < iMax:
        s1  = p[i1]
        tmp[i1] = s1   s2
        s2  = p[i2]
        tmp[i2] = s1   s2
        i1  = 2
        i2  = 2
    # Remainder
    if p.size % 2 == 1:
        s1  = p[i1]
        tmp[i1] = s1   s2
    s = s1   s2
    # Renormalization (to increase the accuracy)
    rnd = np.random.rand() * s
    idx = np.searchsorted(tmp, rnd)
    return idx

This function is about 32 times faster in sequential on my machine when called from CPython. When called from Numba, it is about 40 times faster. Using multiple threads should make the speed up far better (>100 on mainstream PC processors). This should be enough for the computation to last few seconds. I expect the final computation to be clearly memory-bound due to the huge size of p (typically >200 GiB). Consider reducing its size for better performance (it does not seems reasonable to keep a >10 KiB array per pixel).