Speeding up einsum for specific matrix and vector size-CodePudding

I have 2 arrays, one is of size:

A = np.random.uniform(size=(48, 1000000, 2))

and the other is

B = np.random.uniform(size=(48))

I want to do the following summation:

np.einsum("i, ijk -> jk", B, A)

as fast as possible.

The summation would need to be done tens of millions of times, so speed matters. At each iteration, B would change and A would stay the same. I have tried 2 options (see below) but both of them are comparable. Is there a way to speed it up?

Code:

import numpy as np

def computation_einsum(x,y):
    z = np.einsum("i, ijk -> jk", y, x)
    return z

def computation_dot(x,y):
    z = y @ x
    return z

A = np.random.uniform(size=(48, 1000000, 2))
B = np.random.uniform(size=(48))
C = A.transpose(1,0,2)

Timings:

%timeit -n 10 computation_einsum(A, B)
100 ms ± 238 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 computation_dot(C, B)
107 ms ± 2.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

I know that for larger matrices there are options which would make a difference, but I have these specific shapes and sizes.

CodePudding user response：

We can improve matmul by giving it arrays that it can pass 'intact' to the compiled code. A transpose reorders the shape and strides, but does not change the underlying data. copy takes time, but may be worth it if you are reusing A many times.

In [213]: C = A.transpose(1,0,2)
In [214]: D = C.copy()

Alternate transpose, putting 48 last:

In [215]: E = A.transpose(1,2,0).copy()

timings:

In [217]: timeit B@C
486 ms ± 347 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [218]: timeit B@D
272 ms ± 14.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [219]: timeit E@B
155 ms ± 1.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

I'm a little surprised that this einsum is faster, since I thought it pass the calculation on to matmul in simple cases:

In [223]: timeit np.einsum('jki,i->jk', E,B)
127 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

And for reference, on my machine:

In [228]: timeit np.einsum('i,ijk->jk', B,A)
361 ms ± 639 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

It's worth playing with the optimize parameter. The effect seems to vary with the numpy version, so I can't predict when it will help, but here:

In [229]: timeit np.einsum('i,ijk->jk', B,A, optimize=True)
195 ms ± 3.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It doesn't help in the E case:

In [232]: timeit np.einsum('jki,i->jk', E,B,optimize=True)
129 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

CodePudding user response：

By default np.einsum does not optimize computations done since the cost to perform optimizations can sometimes be bigger than the computation itself (especially for small arrays). You can solve this using:

np.einsum("i, ijk -> jk", B, A, optimize='optimal')

With this, np.einsum use internally a BLAS implementation (like OpenBLAS by default on most system including on my machine) which is far more efficient: it should run in parallel and use faster SIMD instructions (especially on x86 with AVX instead of SSE).

This operation is mostly memory bound and the above solution succeed to reach a memory throughput of ~30 GiB/s while the peak is about ~40/s GiB. This is very good (for a generic Numpy code) but sub-optimal.

Note the solution of @MichaelSzczesny gives similar performance results (with 64-bit floats).

Defeating OpenBLAS

While OpenBLAS is sub-optimal here, writing a faster code is far from being easy as such library is aggressively optimized to reach high-performance (AFAIK, they make use of architecture-specific kernels and even some hand-written assembly codes). That being said, they are written for generic cases and assume the computation is mostly compute bound which is not the case here.

A solution to defeat OpenBLAS is to write a parallel Numba code solving your specific case where: B has a small fixed size, the last dimension of A is always 2 and A is relatively big. Numba is a Python module that can compile specific Python codes to native binary using a just-in-time compiler.

The main idea is to virtually split A in blocks along the second dimension and then perform a cache-friendly weighted sum along the first dimension (that is not contiguously stored in memory). The code use assertions and compile-time constants so to help the compiler aggressively unrolling and vectorizing loops. The output matrix is passed in parameter of the function so not to pay the cost of expensive page-faults. Here is the final code:

import numpy as np
import numba as nb

block_size = 1024
B_size = 48

@nb.njit(fastmath=True, inline='never')
def compute_block(A, B, out, iStart, iEnd, complete):
    n = A.shape[1]

    # Check everything is Ok and help the compiler 
    # to optimize the loops more aggressively.
    assert A.shape[0] == B_size
    assert A.shape[2] == 2
    assert out.shape[1] == 2
    assert B.size == B_size
    assert out.shape[0] == n

    for i in range(iStart, iEnd):
        out[i, 0] = out[i, 1] = 0.0
    for j in range(B_size):
        factor = B[j]

        # Help the compiler to perform (better) loop unrolling
        if complete:
            for i in range(iStart, iStart block_size):
                out[i, 0]  = A[j, i, 0] * factor
                out[i, 1]  = A[j, i, 1] * factor
        else:
            for i in range(iStart, iEnd):
                out[i, 0]  = A[j, i, 0] * factor
                out[i, 1]  = A[j, i, 1] * factor

@nb.njit('void(float64[:,:,::1], float64[::1], float64[:,::1])', parallel=True)
def compute(A, B, out):
    n = A.shape[1]

    # Parallel computation of many blocks
    for bi in nb.prange(n // block_size):
        iStart = bi * block_size
        iEnd = iStart   block_size
        compute_block(A, B, out, iStart, iEnd, True)

    # Remaining part
    bi = n // block_size
    iStart = bi * block_size
    if iStart < n:
        compute_block(A, B, out, iStart, n, False)

# Example of use:
A = np.random.uniform(size=(48, 1000000, 2))
B = np.random.uniform(size=(48))

C = np.empty((A.shape[1], 2), dtype=np.float64)
compute(A, B, C)

Benchmark

Here are results on my i5-9600KF processor and a ~40 GiB DRAM:

np.einsum("i, ijk -> jk", B, A):                        54.4 ms
np.einsum("i, ijk -> jk", B, A, optimize='optimal'):    25.2 ms
compute(A, B, C)                                        19.5 ms
Theoretical optimal time (lower bound):                 18.6 ms

The final implementation succeed to reach a memory throughput of 38.2 GiB/s which is very close to the optimal on my machine.

Note that as pointed by @MichaelSzczesny, using 32-bit floats results in about 2 time faster timings since 32-bit floats takes 2 times less space in memory and the operation is memory bound. 32-bit floats are less precise though.