I have the following snippet of code where I've used numba in order to speed up my code:
import numpy as np
from numba import jit
Sigma = np.array([
[1, 1, 0.5, 0.25],
[1, 2.5, 1, 0.5],
[0.5, 1, 0.5, 0.25],
[0.25, 0.5, 0.25, 0.25]
])
Z = np.array([0.111, 0.00658])
@jit(nopython=True)
def mean(Sigma, Z):
return np.dot(np.dot(Sigma[0:2, 2:4], linalg.inv(Sigma[2:4, 2:4])), Z)
print(mean(Sigma, Z))
However, numba is complaining
NumbaPerformanceWarning: np.dot() is faster on contiguous arrays, called on (array(float64, 2d, A), array(float64, 2d, F))
return np.dot(np.dot(Sigma[0:2, 2:4], linalg.inv(Sigma[2:4, 2:4])), Z)
If I'm not mistaken (after reading this), the contiguous structure of the numpy arrays is broken due to the slicing of sub-matrices from Sigma (i.e., "Sigma[0:2, 2:4]"). Is this correct? If so is there any way to resolve this warning? I believe resolving this warning would help speed up my code which is my main goal. Thanks.
CodePudding user response:
You get this error because dot
and inv
are optimized for contiguous arrays. However, regarding the small input size, this is not a huge problem. Still, you can at least specify that the input array are contiguous using the signature 'float64[:](float64[:,::1], float64[::1])'
in the decorator @jit(...)
. This also cause the function to be compiled eagerly.
The biggest performance issue in this function is the creation of few temporary array and the call to linalg.inv
which is not designed to be fast for very small matrices. The inverse matrix can be obtained by computing a simple expression based on its determinant.
Here is the resulting code:
import numba as nb
@nb.njit('float64[:](float64[:,::1], float64[::1])')
def fast_mean(Sigma, Z):
# Compute the inverse matrix
mat_a = Sigma[2, 2]
mat_b = Sigma[2, 3]
mat_c = Sigma[3, 2]
mat_d = Sigma[3, 3]
invDet = 1.0 / (mat_a*mat_d - mat_b*mat_c)
inv_a = invDet * mat_d
inv_b = -invDet * mat_b
inv_c = -invDet * mat_c
inv_d = invDet * mat_a
# Compute the matrix multiplication
mat_a = Sigma[0, 2]
mat_b = Sigma[0, 3]
mat_c = Sigma[1, 2]
mat_d = Sigma[1, 3]
tmp_a = mat_a*inv_a mat_b*inv_c
tmp_b = mat_a*inv_b mat_b*inv_d
tmp_c = mat_c*inv_a mat_d*inv_c
tmp_d = mat_c*inv_b mat_d*inv_d
# Final dot product
z0, z1 = Z
result = np.empty(2, dtype=np.float64)
result[0] = tmp_a*z0 tmp_b*z1
result[1] = tmp_c*z0 tmp_d*z1
return result
This is about 3 times faster on my machine. Note that >60% of the time is spend in the overhead of calling the Numba function and creating the output temporary array. Thus, it is probably wise to use Numba in the caller functions so to remove this overhead.
You can pass the result
array as parameter so to avoid the creation of the array which is quite expensive as pointed out by @max9111. This is useful only if you could preallocate the output buffer in the caller functions (once if possible). This is nearly 6 times faster with this.