I have an odd case where I can see numpy.einsum
speeding up a computation but can't see the same in einsum_path
. I'd like to quantify/explain this possible speed-up but am missing something somewhere...
In short, I have a matrix multiplication where only the diagonal of the final product is needed.
a = np.arange(9).reshape(3,3)
print('input array')
print(a)
print('normal method')
print(np.diag(a.dot(a)))
print('einsum method')
print(np.einsum('ij,ji->i', a, a))
which produces the output:
input array
[[0 1 2]
[3 4 5]
[6 7 8]]
normal method
[ 15 54 111]
einsum method
[ 15 54 111]
When running on a large matrix, numpy.einsum
is substantially faster.
A = np.random.randn(2000, 300)
B = np.random.randn(300, 2000)
print('normal method')
%timeit np.diag(A.dot(B))
print('einsum method')
%timeit np.einsum('ij,ji->i', A, B)
which produces:
normal method
17.2 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
einsum method
1.02 ms ± 7.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
My intuition is that this speed up is possible as numpy.einsum
is able to drop computations that would eventually be dropped by taking the diagonal - but, if I'm reading it correctly, the output of numpy.einsum_path
is showing no speed up at all.
print(np.einsum_path('ij,ji->i',A,B,optimize=True)[1])
Complete contraction: ij,ji->i
Naive scaling: 2
Optimized scaling: 2
Naive FLOP count: 1.200e 06
Optimized FLOP count: 1.200e 06
Theoretical speedup: 1.000
Largest intermediate: 2.000e 03 elements
--------------------------------------------------------------------------
scaling current remaining
--------------------------------------------------------------------------
2 ji,ij->i i->i
Question
- Why can I see a practical speed-up that isn't reflected in the computational path?
- Is there a way to quantify the speed up by 'ij,ji->i' path in
numpy.einsum
?
CodePudding user response:
That path
just looks at alternative orders when working with more than 2 arguments. With just 2 arguments that analysis does nothing. Your diag(dot)
In [113]: np.diag(a.dot(a))
Out[113]: array([ 15, 54, 111])
The equivalent using einsum
is:
In [115]: np.einsum('ii->i',np.einsum('ij,jk->ik',a,a))
Out[115]: array([ 15, 54, 111])
But we can skip the intermediate step:
In [116]: np.einsum('ij,ji->i',a,a)
Out[116]: array([ 15, 54, 111])
The indexed notation is flexible enough that it doesn't need to go through the full dot
calculation.
Another way to get the same result is:
In [117]: (a*a.T).sum(axis=1)
Out[117]: array([ 15, 54, 111])
With matmul
we can do the calc without the diag
, treating the first dimension as a 'batch'. But requires some reshaping first:
In [121]: a[:,None,:]@a.T[:,:,None]
Out[121]:
array([[[ 15]],
[[ 54]],
[[111]]])
In [122]: np.squeeze(a[:,None,:]@a.T[:,:,None])
Out[122]: array([ 15, 54, 111])
My times
normal method
135 ms ± 3.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
einsum method
3.09 ms ± 46.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [130]: timeit (A*B.T).sum(axis=0)
8.06 ms ± 78.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [131]: timeit np.squeeze(A[:,None,:]@B.T[:,:,None])
3.52 ms ± 195 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
dot
creates a (2000,2000) array, and then extracts 2000 elements.
In terms of element wise multiplication, the dot is:
In [136]: (a[:,:,None]*a[None,:,:]).sum(axis=1)
Out[136]:
array([[ 15, 18, 21],
[ 42, 54, 66],
[ 69, 90, 111]])
With A
and B
, the intermediate product would be (2000,300,2000), which is summed down to (2000,2000). The einsum
does (effectively) 2000 calulations of size (300,300) reduced to (1,).
The einsum
is closer to this calculation than the diag/dot
, treating the size 2000 dimension as a 'batch' for 1d dot
calculations:
In [140]: timeit np.array([A[i,:].dot(B[:,i]) for i in range(2000)])
9.46 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)