numpy array: fast assign short array to large array with index-CodePudding

I want to assign values to large array from short arrays with indexing. Simple codes are as follows:

import numpy as np

def assign_x():
    a = np.zeros((int(3e6), 20))
    index_a = np.random.randint(int(3e6), size=(int(3e6), 20))
    b = np.random.randn(1000, 20)
    for i in range(20):
        index_b = np.random.randint(1000, size=int(3e6))
        a[index_a[:, i], i] = b[index_b, i]
    return a

%timeit x = assign_x()
# 2.79 s ± 18.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I have tried other ways which may be relevant, for example np.take and numba's jit, but seems the above is the fastest way. It can also be possibly speed-up using multiprocessing. I have profile the codes, the most time is at the line below as it runs many times (20 here)

a[index_a[:, i], i] = b[index_b, i]

Any chance I can make this faster before using multiprocessing?

CodePudding user response：

Why this is slow

This is slow because the memory access pattern is very inefficient. Indeed, random accesses are slow because the processor cannot predict them. As a result, it causes expensive cache misses (if the array does not fit in the L1/L2 cache) that cannot be avoided by prefetching data ahead of time. The thing is the arrays are to big to fit in caches: index_a and a takes each 457 MiB and b takes 156 KiB. As a results, access to b are typically done in the L2 cache with a higher latency and the accesses to the two other array are done in RAM. This is slow because the current DDR RAMs have huge latency of 60-100 ns on a typical PC. Even worse: this latency is likely not gonna be much smaller in a near future: the RAM latency has not changed much since the last two decades. This is called the Memory wall. Note also that modern processors fetch a full cache line of usually 64 bytes from the RAM when a value at a random location is requested (resulting in only 56/64=87.5% of the bandwidth to be wasted). Finally, generating random numbers is a quite expensive process, especially large integers, and np.random.randint can generate either 32-bit or 64-bit integers regarding the target platform.

How to improve this

The first improvement is to prefer indirection on the most contiguous dimension which is generally the last one since a[:,i] is slower than a[i,:]. You can transpose the arrays and swap the indexed values. However, the Numpy transposition function only return a view and does not actually transpose the array in memory. Thus an explicit copy in currently required. The best here is simply to directly generate the array so that accesses are efficient (rather than using expensive transpositions). Note you can use simple precision so array can better fit in caches at the expense of a lower precision.

Here is an example that returns a transposed array:

import numpy as np

def assign_x():
    a = np.zeros((20, int(3e6)))
    index_a = np.random.randint(int(3e6), size=(20, int(3e6)))
    b = np.random.randn(20, 1000)
    for i in range(20):
        index_b = np.random.randint(1000, size=int(3e6))
        a[i, index_a[i, :]] = b[i, index_b]
    return a

%timeit x = assign_x()

The code can be improved further using Numba so to run the code in parallel (one core should not be enough to saturate the memory because of the RAM latency but many core can better use it because multiple fetches can be done concurrently). Moreover, it can help avoid the creation of big temporary arrays.

Here is an optimize Numba code:

import numpy as np
import numba as nb
import random

@nb.njit('float64[:,:]()', parallel=True)
def assign_x():
    a = np.zeros((20, int(3e6)))
    b = np.random.randn(20, 1000)
    for i in nb.range(20):
        for j in range(3_000_000):
            index_a = random.randint(0, 3_000_000)
            index_b = random.randint(0, 1000)
            a[i, index_a] = b[i, index_b]
    return a

%timeit x = assign_x()

Here are results on a 10-core Skylake Xeon processor:

Initial code:                  2798 ms
Better memory access pattern:  1741 ms
With Numba:                     318 ms

Note that parallelizing the inner-most loop would theoretically be faster because one line of a is more likely to fit in the last-level cache. However, doing this will cause a race condition that can only be fixed efficiently with atomic stores not yet available in Numba (on CPU).

Note that the final code does not scale well because it is memory-bound. This is due to 87.5% of the memory throughput being wasted as explained before. Additionally, on many processors (like all Intel and AMD-Zen processors) the write allocate cache policy force data being read from memory for each store in this case. This makes the computation much more inefficient raising the wasted throughput to 93.7%... AFAIK, there is no way to prevent this in Python. In C/C , the write allocate issue can be fixed using low-level instructions. The rule of thumb is avoid memory random access patterns on big arrays like the plague.