I am trying to find the fastest way to sum 2 matrices of the same size using Numba. I came up with 3 different approaches but none of them could beat Numpy. Here is my code:
import numpy as np
from numba import njit,vectorize, prange,float64
import timeit
import time
# function 1:
def sum_numpy(A,B):
return A B
# function 2:
sum_numba_simple= njit(cache=True,fastmath=True) (sum_numpy)
# function 3:
@vectorize([float64(float64, float64)])
def sum_numba_vectorized(A,B):
return A B
# function 4:
@njit('(float64[:,:],float64[:,:])', cache=True, fastmath=True, parallel=True)
def sum_numba_loop(A,B):
n=A.shape[0]
m=A.shape[1]
C = np.empty((n, m), A.dtype)
for i in prange(n):
for j in prange(m):
C[i,j]=A[i,j] B[i,j]
return C
#Test the functions with 2 matrices of size 1,000,000x3:
N=1000000
np.random.seed(123)
A=np.random.uniform(low=-10, high=10, size=(N,3))
B=np.random.uniform(low=-5, high=5, size=(N,3))
t1=min(timeit.repeat(stmt='sum_numpy(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
t2=min(timeit.repeat(stmt='sum_numba_simple(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
t3=min(timeit.repeat(stmt='sum_numba_vectorized(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
t4=min(timeit.repeat(stmt='sum_numba_loop(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
print("function 1 (sum_numpy): t1= ",t1,"\n")
print("function 2 (sum_numba_simple): t2= ",t2,"\n")
print("function 3 (sum_numba_vectorized): t3= ",t3,"\n")
print("function 4 (sum_numba_loop): t4= ",t4,"\n")
Here are the results:
function 1 (sum_numpy): t1= 0.1655790419999903
function 2 (sum_numba_simple): t2= 0.3019776669998464
function 3 (sum_numba_vectorized): t3= 0.16486266700030683
function 4 (sum_numba_loop): t4= 0.1862256660001549
As you can see, the results show that there isn't any advantage in using Numba in this case. Therefore, my question is:
Is there any other implementation that would increase the speed of the summation ?
CodePudding user response:
Your code is bound by page-faults (see here, here and there for more information about this). Page-faults happens because the array is newly allocated. A solution is to preallocate it and then write within it so to no cause pages to be remapped in physical memory. np.add(A, B, out=C)
does this as indicated by @August in the comments. Another solution could be to adapt the standard allocator so not to give the memory back to the OS at the expense of a significant memory usage overhead (AFAIK TC-Malloc can do that for example).
There is another issue on most platforms (especially x86 ones): the cache-line write allocations of write-back caches are expensive during writes. The typical solution to avoid this is to do non-temporal store (if available on the target processor, which is the case on x86-64 one but maybe not others). That being said, neither Numpy nor Numba are able to do that yet. For Numba, I filled an issue covering a simple use-case. Compilers themselves (GCC for Numpy and Clang for Numba) tends not to generate such instructions because they can be detrimental in performance when arrays fit in cache and compilers do not know the size of the array at compile time (they could generate a specific code when they can evaluate the amount of data computed but this is not easy and can slow-down some other codes). AFAIK, the only possible way to fix this is to write a C code and use low-level instructions or to use compiler directives. In your case, about 25% of the bandwidth is lost due to this effect, causing a slowdown up to 33%.
Using multiple threads do not always make memory-bound code faster. In fact, it generally barely scale because using more core do not speed up the execution when the RAM is already saturated. Few cores are generally required so to saturate the RAM on most platforms. Page faults can benefit from using multiple cores regarding the target system (Linux does that in parallel quite well, Windows generally does not scale well, IDK for MacOS).
Finally, there is another issue: the code is not vectorized (at least not on my machine while it can be). On solution is to flatten the array view and do one big loop that the compiler can more easily vectorize (the j-based loop is too small for SIMD instructions to be effective). The contiguity of the input array should also be specified for the compiler to generate a fast SIMD code. Here is the resulting Numba code:
@njit('(float64[:,::1], float64[:,::1], float64[:,::1])', cache=True, fastmath=True, parallel=True)
def sum_numba_fast_loop(A, B, C):
n, m = A.shape
assert C.shape == A.shape
A_flat = A.reshape(n*m)
B_flat = B.reshape(n*m)
C_flat = C.reshape(n*m)
for i in prange(n*m):
C_flat[i]=A_flat[i] B_flat[i]
return C
Here are results on my 6-core i5-9600KF processor with a ~42 GiB/s RAM:
sum_numpy: 0.642 s 13.9 GiB/s
sum_numba_simple: 0.851 s 10.5 GiB/s
sum_numba_vectorized: 0.639 s 14.0 GiB/s
sum_numba_loop serial: 0.759 s 11.8 GiB/s
sum_numba_loop parallel: 0.472 s 18.9 GiB/s
Numpy "np.add(A, B, out=C)": 0.281 s 31.8 GiB/s <----
Numba fast: 0.288 s 31.0 GiB/s <----
Optimal time: 0.209 s 32.0 GiB/s
The Numba code and the Numpy one saturate my RAM. Using more core does not help (in fact it is a bit slower certainly due to the contention of the memory controller). Both are sub-optimal since they do not use non-temporal store instructions that can prevent cache-line write allocations (causing data to be fetched from the RAM before being written back). The optimal time is the one expected using such instruction. Note that it is expected to reach only 65-80% of the RAM bandwidth because of RAM mixed read/writes. Indeed, interleaving reads and writes cause low-level overheads preventing the RAM to be saturated. For more information about how RAM works, please consider reading Introduction to High Performance Scientific Computing -- Chapter 1.3 and What Every Programmer Should Know About Memory (and possibly this).