I'm using numpy (ideally Numba) to perform a tensor contraction that involves three tensors, one of which is a vector that should multiply only one index of the others. For example,
A = np.random.normal(size=(20,20,20,20))
B = np.random.normal(size=(20,20,20,20))
v = np.sqrt(np.arange(20))
# e.g. v on the 3rd index
>>> %timeit np.vdot(A * v[None, None, :, None], B)
125 µs ± 5.14 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
compare with
C = np.random.normal(size=(20,20,20,20))
>>> %timeit np.vdot(A * C, B)
76.8 µs ± 4.25 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Is there a more efficient way of including the product with v
? It feels wrong that it should be slower than multiplying by the full tensor C
.
CodePudding user response:
I could squeeze some performance by using numba with parallel=True
import numba as nb
import numpy as np
N = 50
@nb.njit('float64(float64[:,:,:,:], float64[:,:,:,:],float64[:])',parallel=True)
def dotABv(a, b,vv):
res = 0.0
for i in nb.prange(a.shape[0]):
for j in range(a.shape[1]):
for k in range(a.shape[2]):
res = vv[k]*np.dot(a[i,j,k,:],b[i,j,k,:])
return res
v = np.sqrt(np.arange(N))
A = np.random.normal(size=(N,N,N,N))
B = np.random.normal(size=(N,N,N,N))
C = np.random.normal(size=(N,N,N,N))
%timeit dotABv(A,B,v)
%timeit np.vdot(A , B) ## just to compare with dot
%timeit np.vdot(A * v[None, None, :, None], B)
# Output :
# 501 µs ± 1.42 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# 1.1 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# 14.7 ms ± 239 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
print(dotABv(A,B,v), np.vdot(A * v[None, None, :, None], B))
# 5105.504508154087 5105.5045081541075