I'm trying to speed up a piece of code convolving a 1D array (filter) over each column of a 2D array. Somehow, when I run it with numba's njit
, I get a 7x slow down. My thoughts:
- Maybe column indexing is slowing it down, but switching to row indexing didn't affect performance
- Maybe slice indexing the results of the convolution is slow, but removing it didn't change anything
- I've checked that numba understands all the types properly
(tested on Windows 10, python 3.9.4 from conda, numpy 1.12.2, numba 0.53.1)
Can anyone tell me why this code is slow?
import numpy as np
from numba import njit
def f1(a1, filt):
l2 = filt.size // 2
res = np.empty(a1.shape)
for i in range(a1.shape[1]):
res[:, i] = np.convolve(a1[:, i], filt)[l2:-l2]
return res
@njit
def f1_jit(a1, filt):
l2 = filt.size // 2
res = np.empty(a1.shape)
for i in range(a1.shape[1]):
res[:, i] = np.convolve(a1[:, i], filt)[l2:-l2]
return res
a1 = np.random.random((6400, 1000))
filt = np.random.random((65))
f1(a1, filt)
f1_jit(a1, filt)
%timeit f1(a1, filt) # 404 ms ± 19.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f1_jit(a1, filt) # 2.8 s ± 66.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
CodePudding user response:
The problem comes from the Numba implementation of np.convolve
. This is a known issue. It turns out that the current Numba implementation is much slower than the one of Numpy (version <=0.54.1 tested on Windows).
Under the hood
On one hand, the Numpy implementation call correlate
which itself performs a dot product that should be implemented by the fast BLAS library available on your system. On the other hand, the Numba implementation calls _get_inner_prod
which use np.dot
that should also use the same BLAS library (assuming a BLAS is detected which should be the case)...
That being said, there are multiple issues related to the dot product:
First of all, if the internal variable _HAVE_BLAS
of numba/np/arraymath.py
is manually disabled, Numba use a fallback implementation of the dot product supposed to be significantly slower. However, it turns out that using the fallback dot product implementation used by np.convolve
result in a 5 times faster execution than with the BLAS wrapper on my machine! Using additionally the parameter fastmath=True
in the njit
Numba decorator results in an overall 8.7 times faster execution! Here is the testing code:
import numpy as np
import numba as nb
def npConvolve(a, b):
return np.convolve(a, b)
@nb.njit('float64[:](float64[:], float64[:])')
def nbConvolveUncont(a, b):
return np.convolve(a, b)
@nb.njit('float64[::1](float64[::1], float64[::1])')
def nbConvolveCont(a, b):
return np.convolve(a, b)
a = np.random.random(6400)
b = np.random.random(65)
%timeit -n 100 npConvolve(a, b)
%timeit -n 100 nbConvolveUncont(a, b)
%timeit -n 100 nbConvolveCont(a, b)
Here are raw interesting results:
With _HAVE_BLAS=True (default):
126 µs ± 292 ns per loop
1.6 ms ± 21.3 µs per loop
1.6 ms ± 18.5 µs per loop
With _HAVE_BLAS=False:
125 µs ± 359 ns per loop
311 µs ± 1.18 µs per loop
268 µs ± 4.26 µs per loop
With _HAVE_BLAS=False and fastmath=True:
125 µs ± 757 ns per loop
327 µs ± 3.69 µs per loop
183 µs ± 654 ns per loop
Moreover, np_convolve
of Numba internally flip some array parameter and then perform the dot product using a flipped array that have a non trivial stride (ie. not 1). Such a non-trivial stride may have an impact on the dot product performance. More generally, any transformation preventing the compiler to know that arrays are contiguous will certainly strongly impact the performances. Indeed, the following test shows the impact of working on a contiguous array with the dot product implementation of Numba:
import numpy as np
import numba as nb
def np_dot(a, b):
return np.dot(a, b)
@nb.njit('float64(float64[::1], float64[::1])')
def nb_dot_cont(a, b):
return np.dot(a, b)
@nb.njit('float64(float64[::1], float64[:])')
def nb_dot_stride(a, b):
return np.dot(a, b)
v = np.random.random(128*1024)
%timeit -n 200 np_dot(v, v) # 36.5 µs ± 4.9 µs per loop
%timeit -n 200 nb_dot_stride(v, v) # 361.0 µs ± 17.1 µs per loop (x10 !!!)
%timeit -n 200 nb_dot_cont(v, v) # 34.1 µs ± 2.9 µs per loop
Some general notes about Numpy and Numba
Note that Numba can hardly accelerate the Numpy calls when they work on pretty-big arrays since Numba re-implement Numpy functions mostly in Python and use a JIT compiler (LLVM-Lite) to speed them up, while Numpy is mostly implemented in plain-C (with a rather-slow Python wrapping code). The Numpy code uses low-level processor features like SIMD instructions to speed up a lot the execution of many functions. Both appear to use BLAS libraries known to be highly optimized. Numpy tends to be generally more optimized since Numpy is currently more mature than Numba: Numpy has much more contributors working since a longer time.