Translating np.einsum to something more performant-CodePudding

Using python/numpy, I have the following np.einsum:

np.einsum('abde,abc->bcde', X, Y)

Y is sparse: for each [a,b], only one c == 1; all others := 0. For an example of relative size of the axes, X.shape is on the order of (1000, 5, 30, 30), and Y.shape is equivalently (1000, 5, 300).

This operation is extremely costly; I want to make this more performant. For one thing, einsum is not parallelized. For another, beecause Y is sparse, I'm effectively computing 300x the number of multiplication operations I should be doing. In fact, when I wrote the equivalent of this einsum using a loop over n, I got a speed-up of around 3x. But that's clearly not very good.

How should I approach making this more performant? I've tried using np.tensordot, but I could not figure out how to get what I want from it (and I still run into the sparse/dense problem).

CodePudding user response：

You can do that pretty easily with Numba:

import numba

@numba.njit('float64[:,:,:,::1](float64[:,:,:,::1], float64[:,:,::1])', fastmath=True, parallel=True)
def compute(x, y):
    na, nb, nd, ne = x.shape
    nc = y.shape[2]
    assert y.shape == (na, nb, nc)

    out = np.zeros((nb, nc, nd, ne))

    for b in numba.prange(nb):
        for a in range(na):
            for c in range(nc):
                yVal = y[a, b, c]
                if np.abs(yVal) != 0:
                    for d in range(nd):
                        for e in range(ne):
                            out[b, c, d, e]  = x[a, b, d, e] * yVal

    return out

Note that it is faster to iterate over a and then b for a sequential code. That being said, for the code to be parallel, the loop have been swapped and the parallelization is performed over b (which is a small axis). A parallel reduction over the axis a would be more efficient, but this is unfortunately not easy to do with Numba (one need to split matrices in multiple chunks since there is no simple way to create thread-local matrices).

Note you can replace values like nd and ne by the actual value (ie. 30) so for the compiler to generate a faster code specifically for this matrix size.

Here is the testing code:

np.random.seed(0)
x = np.random.rand(1000, 5, 30, 30)
y = np.random.rand(1000, 5, 300)
y[np.random.rand(*y.shape) > 0.1] = 0.0 # Make it sparse (90% of 0)

%time res = np.einsum('abde,abc->bcde', x, y)  # 2.350 s
%time res2 = compute(x, y)                     # 0.074 s (0.061 s with hand-written sizes)
print(np.allclose(res, res2))

This is about 32 times faster on a 10-core Intel Skylake Xeon processor. It reaches a 38x speed up with hand-written sizes. It does not scale very well due to the parallelization over the b axis but using other axis will cause a less efficient memory accesses.

If this is not enough, it may be a good idea to transpose x and y first so to improve data locality (thanks to a more contiguous access pattern along the a axis) and a better scaling (by parallelizing both the b and c axis). That being said, transpositions are generally expensive so one certainly need to optimize it so to get an even better speed up.

CodePudding user response：

If Y only contains 1 and 0 then the einsum basically does this:

result = np.zeros(Y.shape[1:]   X.shape[2:], X.dtype)
I, J, K = np.nonzero(Y)
result[J, K]  = X[I, J]

But this doesn't give the correct result due to duplicate j, k indices. I couldn't get numpy.add.at to work, but a loop over just these indices is still pretty fast, at least for the given shapes and sparsity.

result = np.zeros(Y.shape[1:]   X.shape[2:], X.dtype)
for i, j, k in zip(*np.nonzero(Y)):
    result[j, k]  = X[i, j]

This is the test code that I used:

a, b, c, d, e = 1000, 5, 300, 30, 30
X = np.random.randint(10, size=(a,b,d,e))
R = np.random.rand(a, b, c)
K = np.argmax(R, axis=2)
I, J = np.indices((a, b), sparse=True)
Y = np.zeros((a, b, c), int)
Y[I, J, K] = 1