I have written the following code, which does what I need, but I want to optimize it.
import numpy as np
N_rs = 100
N_thetas = 30
Nlm_container = np.ones( (N_thetas, 2*N_thetas-1), dtype=np.complex128 )
lpmv_container = np.ones ( (N_thetas, 2*N_thetas-1, N_thetas), dtype=np.complex128 )
Psi_m_res1 = np.zeros( (2*N_thetas - 1, N_rs - 1, N_thetas) , dtype=np.complex128 )
Psi_m_res2 = np.zeros( (2*N_thetas - 1, N_rs - 1, N_thetas) , dtype=np.complex128 )
Psi_m_res3 = np.zeros( (2*N_thetas - 1, N_rs - 1, N_thetas) , dtype=np.complex128 )
Psi_lm = np.ones ( ( N_rs - 1, 2*N_thetas-1, N_thetas), dtype=np.complex128 )
for m in range(-N_thetas 1, N_thetas, 1):
for j in range(N_rs - 1):
for k in range(N_thetas):
Psi_m_res1[m (N_thetas-1), j, k] = np.sum( Psi_lm[j, m (N_thetas-1), np.abs(m):] * Nlm_container[np.abs(m):, m (N_thetas-1)] * lpmv_container[abs(m):, m (N_thetas-1), k] )
I want to avoid the triple nested for loop.
I thought that it would be beneficial to use :
, rather than the loop over j
and the loop over k
. I thus wrote (as a first step):
for m in range(-N_thetas 1, N_thetas, 1):
for k in range(N_thetas):
Psi_m_res2[m (N_thetas-1), :, k] = np.sum( Psi_lm[:, m (N_thetas-1), np.abs(m):] * Nlm_container[np.abs(m):, m (N_thetas-1)] * lpmv_container[abs(m):, m (N_thetas-1), k] )
Still, this does not produce what I want:
print(np.testing.assert_allclose(Psi_m_res1, Psi_m_res2))
Mismatched elements: 175230 / 175230 (100%)
Max absolute difference: 2940.
Max relative difference: 0.98989899
Even more, when also eliminating the k
index for loop, the program does not compute anything at all as the shapes mismatch.
for m in range(-N_thetas 1, N_thetas, 1):
Psi_m_res3[m (N_thetas-1), :, :] = np.sum( Psi_lm[:, m (N_thetas-1), np.abs(m):] * Nlm_container[np.abs(m):, m (N_thetas-1)] * lpmv_container[abs(m):, m (N_thetas-1), :] )
ValueError: operands could not be broadcast together with shapes (99,2) (2,30)
I could create threads and send pieces of Psi_lm
, Nlm_container
and lpmv_container
to them and ask to perform the work, then reunite the results, but I thought maybe numpy magic would do better.
In practice, the N_rs
and N_thetas
would be higher - they can go up to N_rs = 1600
and N_thetas=100
, this being set before the code starts to run and fixed throughout the code's runtime.
Is there a way for this to work using numpy magic?
Thank you!
CodePudding user response:
You can vectorize the two inner loops using np.einsum
so to get a much faster code (and simpler). Here is the code:
for m in range(-N_thetas 1, N_thetas, 1):
m_abs = np.abs(m)
idx = m (N_thetas - 1)
Psi_m_res1[idx] = np.einsum('ji,i,ik->jk', Psi_lm[:, idx, m_abs:], Nlm_container[m_abs:, idx], lpmv_container[m_abs:, idx, :], optimize='optimal')
This takes 4.7 ms on my machine instead of 956 ms. This means the above code is about 200 times faster! Vectorizing efficiently the outer loop is very hard if not even possible because the array are of different size. Note np.einsum
is smart enough so to use matrix multiplication internally in this case. If you want a faster code, then I advise you to use a multi-threaded Numba code.