I have 2 arrays, one is of size:
A = np.random.uniform(size=(48, 1000000, 2))
and the other is
B = np.random.uniform(size=(48))
I want to do the following summation:
np.einsum("i, ijk -> jk", B, A)
as fast as possible.
The summation would need to be done tens of millions of times, so speed matters. At each iteration, B
would change and A
would stay the same. I have tried 2 options (see below) but both of them are comparable. Is there a way to speed it up?
Code:
import numpy as np
def computation_einsum(x,y):
z = np.einsum("i, ijk -> jk", y, x)
return z
def computation_dot(x,y):
z = y @ x
return z
A = np.random.uniform(size=(48, 1000000, 2))
B = np.random.uniform(size=(48))
C = A.transpose(1,0,2)
Timings:
%timeit -n 10 computation_einsum(A, B)
100 ms ± 238 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit -n 10 computation_dot(C, B)
107 ms ± 2.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I know that for larger matrices there are options which would make a difference, but I have these specific shapes and sizes.
CodePudding user response:
We can improve matmul
by giving it arrays that it can pass 'intact' to the compiled code. A transpose reorders the shape and strides, but does not change the underlying data. copy
takes time, but may be worth it if you are reusing A
many times.
In [213]: C = A.transpose(1,0,2)
In [214]: D = C.copy()
Alternate transpose, putting 48
last:
In [215]: E = A.transpose(1,2,0).copy()
timings:
In [217]: timeit B@C
486 ms ± 347 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [218]: timeit B@D
272 ms ± 14.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [219]: timeit E@B
155 ms ± 1.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I'm a little surprised that this einsum
is faster, since I thought it pass the calculation on to matmul in simple cases:
In [223]: timeit np.einsum('jki,i->jk', E,B)
127 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
And for reference, on my machine:
In [228]: timeit np.einsum('i,ijk->jk', B,A)
361 ms ± 639 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
It's worth playing with the optimize parameter. The effect seems to vary with the numpy version, so I can't predict when it will help, but here:
In [229]: timeit np.einsum('i,ijk->jk', B,A, optimize=True)
195 ms ± 3.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It doesn't help in the E
case:
In [232]: timeit np.einsum('jki,i->jk', E,B,optimize=True)
129 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
CodePudding user response:
By default np.einsum
does not optimize computations done since the cost to perform optimizations can sometimes be bigger than the computation itself (especially for small arrays). You can solve this using:
np.einsum("i, ijk -> jk", B, A, optimize='optimal')
With this, np.einsum
use internally a BLAS implementation (like OpenBLAS by default on most system including on my machine) which is far more efficient: it should run in parallel and use faster SIMD instructions (especially on x86 with AVX instead of SSE).
This operation is mostly memory bound and the above solution succeed to reach a memory throughput of ~30 GiB/s while the peak is about ~40/s GiB. This is very good (for a generic Numpy code) but sub-optimal.
Note the solution of @MichaelSzczesny gives similar performance results (with 64-bit floats).
Defeating OpenBLAS
While OpenBLAS is sub-optimal here, writing a faster code is far from being easy as such library is aggressively optimized to reach high-performance (AFAIK, they make use of architecture-specific kernels and even some hand-written assembly codes). That being said, they are written for generic cases and assume the computation is mostly compute bound which is not the case here.
A solution to defeat OpenBLAS is to write a parallel Numba code solving your specific case where: B
has a small fixed size, the last dimension of A
is always 2 and A
is relatively big. Numba is a Python module that can compile specific Python codes to native binary using a just-in-time compiler.
The main idea is to virtually split A
in blocks along the second dimension and then perform a cache-friendly weighted sum along the first dimension (that is not contiguously stored in memory). The code use assertions and compile-time constants so to help the compiler aggressively unrolling and vectorizing loops. The output matrix is passed in parameter of the function so not to pay the cost of expensive page-faults. Here is the final code:
import numpy as np
import numba as nb
block_size = 1024
B_size = 48
@nb.njit(fastmath=True, inline='never')
def compute_block(A, B, out, iStart, iEnd, complete):
n = A.shape[1]
# Check everything is Ok and help the compiler
# to optimize the loops more aggressively.
assert A.shape[0] == B_size
assert A.shape[2] == 2
assert out.shape[1] == 2
assert B.size == B_size
assert out.shape[0] == n
for i in range(iStart, iEnd):
out[i, 0] = out[i, 1] = 0.0
for j in range(B_size):
factor = B[j]
# Help the compiler to perform (better) loop unrolling
if complete:
for i in range(iStart, iStart block_size):
out[i, 0] = A[j, i, 0] * factor
out[i, 1] = A[j, i, 1] * factor
else:
for i in range(iStart, iEnd):
out[i, 0] = A[j, i, 0] * factor
out[i, 1] = A[j, i, 1] * factor
@nb.njit('void(float64[:,:,::1], float64[::1], float64[:,::1])', parallel=True)
def compute(A, B, out):
n = A.shape[1]
# Parallel computation of many blocks
for bi in nb.prange(n // block_size):
iStart = bi * block_size
iEnd = iStart block_size
compute_block(A, B, out, iStart, iEnd, True)
# Remaining part
bi = n // block_size
iStart = bi * block_size
if iStart < n:
compute_block(A, B, out, iStart, n, False)
# Example of use:
A = np.random.uniform(size=(48, 1000000, 2))
B = np.random.uniform(size=(48))
C = np.empty((A.shape[1], 2), dtype=np.float64)
compute(A, B, C)
Benchmark
Here are results on my i5-9600KF processor and a ~40 GiB DRAM:
np.einsum("i, ijk -> jk", B, A): 54.4 ms
np.einsum("i, ijk -> jk", B, A, optimize='optimal'): 25.2 ms
compute(A, B, C) 19.5 ms
Theoretical optimal time (lower bound): 18.6 ms
The final implementation succeed to reach a memory throughput of 38.2 GiB/s which is very close to the optimal on my machine.
Note that as pointed by @MichaelSzczesny, using 32-bit floats results in about 2 time faster timings since 32-bit floats takes 2 times less space in memory and the operation is memory bound. 32-bit floats are less precise though.