Why is a simple in-place addition much faster with numba than numpy?-CodePudding

According to the snippet below, performing an in-place addition with a numba jit-compiled function is ~10 times faster than with numpy's ufunc.

This would be understandable with a function performing multiple numpy operations as explained in this question.

But here the improvement concern 1 simple numpy ufunc... So why is numba so much faster ? I'm (naively ?) expecting that the numpy ufunc somehow internally uses some compiled code and that a task as simple as an addition would already be close to optimally optimized ?

More generally : should I expect such dramatic performance differences for other numpy functions ? Is there a way to predict when it's worth to re-write a function and numba-jit it ?

the code :

import numpy as np
import timeit
import numba

N = 200
target1 = np.ones( N )
target2 = np.ones( N )

# we're going to add these values :
addedValues = np.random.uniform( size=1000000  )
# into these positions : 
indices = np.random.randint(N,size=1000000) 


@numba.njit
def addat(target, index, tobeadded):
    for i in range( index.size):        
        target[index[i]]  = tobeadded[i]

# pre-run to jit compile the function
addat( target2, indices, addedValues)
target2 = np.ones( N ) # reset

npaddat = np.add.at
t1 = timeit.timeit( "npaddat( target1, indices, addedValues)", number=3, globals=globals())
t2 = timeit.timeit( "addat( target2, indices, addedValues)", number=3,globals=globals())
assert( (target1==target2).all() )

print("np.add.at time=",t1, )
print("jit-ed addat time =",t2 )

on my computer I get :

np.add.at time= 0.21222890191711485
jit-ed addat time = 0.003389443038031459

so more than a factor 10 improvement...

CodePudding user response：

The ufunc.add.at() is much more generic then your addat(). It iterates over the array elements and calls some unit operation function for each element. Let the unit operation function be add_vectors(). It adds two input vectors, where a vector means array elements in C-contiguous order and aligned. It utilizes SIMD operations if possible.

Because the ufunc.add.at() accesses elements randomly(not sequentially), the add_vectors() should be called multiple times for each pair of input elements. But your addat() does not have this penalty because Numba generates a machine code that accesses Numpy array elements directly.

You can see the overhead in the Numpy source at this and this for example.

For your second question on the performance of other Numpy functions, I recommend to experiment by yourself, because both Numpy and Numba do so complex operations behind the scene.(My naive opinion is that a well written Numba implementation for a ufunc operation will perform better than the Numpy implementation, because Numba also utilizes SIMD operations if possible.)