Parallel execution speed of a vectorized function using numba-CodePudding

Years ago I wrote a math library with numba to do vectorized operations on large arrays of elements (example: multiply two array of matrices). I recently opened it up to experiment with parallelism (feature which did not exist at the time). In doing so there was a speed improvement but not as much as I expected.

To further understand, I wrote this basic example:

import numpy as np
from numba import njit, prange, float64

@njit(float64[:] (float64[:], float64[:]), parallel=True)
def array_sum(array0, array1):
    """
    Sum elements of array0 to array1, similar to doing   on two numpy arrays
    """

    result = np.empty(array0.shape)

    for i in prange(array0.shape[0]):
        result[i] = array0[i]   array1[i]

    return result

And then tested with:

import time
v0 = np.random.random(10**7) # 10 million floats
v1 = np.random.random(10**7) # 10 million floats
array_sum(v0,v1) # run this once so the code compiles

#--- test start ---#
start_time = time.time()
array_sum(v0,v1) # do the work
end_time = time.time()
delta = end_time-start_time
print (delta)

On my machine I get the following results:

# direct numpy array0 array1 ---> 0.016865015029907227
# parallel=False             ---> 0.016468048095703125
# parallel=True              ---> 0.010709047317504883

With parallel=False the speed I get matches numpy's, which is what I expected. And with parallel=True I got a ~1.54x speed increase, which is nice, but for a simple operation like this I was expecting a larger impact on my 8 core machine.

QUESTION

Can parallelism be used in a more effective way to do vectorized operations similar to the example above?

CodePudding user response：

If I read the documentation https://numba.pydata.org/numba-doc/dev/user/parallel.html correctly, the parallel option does not exploit your multiple cores: it mostly seems to generate SIMD (AVX, what-have-you) instructions. And in that case your operation is not likely to be sped up too far since it's bandwidth bound, so there is not much to be gained.

CodePudding user response：

prange does use multiple threads if parallel=True is used, and so multiple cores (if available) of the target machine.

That being said, your computation is memory bound. For large arrays, array need to be read/written from/to the RAM which is a shared resource. Thus, few cores are generally enough to saturate the RAM (and this is likely not gonna be better in the future since the situation is getting worse for the last 3 decades). This is especially true on PC. On servers, this is a bit more complex often due to NUMA architecture.

The only way to improve the scalability is to make the code less-memory bound. For example, you can operate on chunks in the L1/L2 cache rather than in RAM (ie. better memory locality). But this requires to compute more than a sum of two arrays. Actually, this is useful when you want to merge few basic operations in one. Also note that in-place sums can be faster on most architectures (typically the ones with a write-allocate cache policy like x86-64).