I have code which heavily uses np.einsum
calls.
I have installed numpy-intel
in a python virtual environment using:
$ pip install mkl
$ pip install intel-numpy
np.show_config()
prints:
blas_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/home/sat_bot/base/conda-bld/numpy_and_dev_1643279478844/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/home/sat_bot/base/conda-bld/numpy_and_dev_1643279478844/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/include']
blas_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/home/sat_bot/base/conda-bld/numpy_and_dev_1643279478844/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/home/sat_bot/base/conda-bld/numpy_and_dev_1643279478844/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/include']
lapack_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/home/sat_bot/base/conda-bld/numpy_and_dev_1643279478844/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/home/sat_bot/base/conda-bld/numpy_and_dev_1643279478844/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/include']
lapack_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/home/sat_bot/base/conda-bld/numpy_and_dev_1643279478844/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/home/sat_bot/base/conda-bld/numpy_and_dev_1643279478844/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/include']
Supported SIMD extensions in this NumPy install:
baseline = SSE,SSE2,SSE3
found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
not found = AVX512F,AVX512CD,AVX512_KNL,AVX512_KNM,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL
I have an Intel chip with 16 logical processors (8 cores) in 1 socket. I want to make sure that the python code's np calls use all the cores on my machine. What I have tried:
inside the WSL I use to run the python scripts, I wrote
export MKL_NUM_THREADS=8
then ran my script normally aspython script.py
.inside the
script.py
, before importingnumpy
I wrote:import os os.environ["MKL_NUM_THREADS"] = '8' # or 16
None of these trials resulted in a visible usage of all the cores when I monitor them in the terminal using htop
.
There is always a core at 100% utilization, and from time to time (when inside a np.einsum
call) some other cores (it seems randomly selected from the set not including the constant 100% utilization one - two or three usually) just light to 3% utilization for a few seconds or so, then drop to 0% utilization again.
Does anyone have any idea if this a normal behavior, or is it me who is doing something wrong?
Thank you!
EDIT TO INCLUDE THE ACTUAL np.einsum
CALLS (MWE)
N_thetas = 20
N_rs = 600
N_phis = 18
Legendre_matr = np.random.rand(N_thetas, 2*N_thetas-1, N_thetas)
after_first_int = np.random.rand(N_rs - 1, 2*N_thetas - 1, N_thetas)
eigenvectors_memory = np.random.rand(N_rs - 1, N_rs - 1, N_thetas)
normaliz = np.random.rand(N_rs - 1, )
Cilmoft = np.random.rand(2*N_thetas-1, N_rs-1, N_thetas)
Psi_m = np.random.rand( 2*N_thetas - 1, N_rs - 1, gl.N_thetas )
arange_for_m_values = np.arange(-N_thetas 1, N_thetas)
phis = np.linspace(0.0, 2*np.pi, N_phis)
after_second_int = np.einsum('ijk,rjk->rij', Legendre_matr, after_first_int, optimize=True)
Psi_lm = np.einsum('jkl,j,mkl->jml', eigenvectors_memory, normaliz , Cilmoft, optimize=True)
Psi_on_grid_from_Cilmoft = np.einsum('xjk,xn->jkn', Psi_m, np.exp(1j * np.outer(arange_for_m_values, phis)), optimize=True )
CodePudding user response:
In [3]: Legendre_matr.shape, after_first_int.shape
Out[3]: ((20, 39, 20), (599, 39, 20))
Your first einsum
:
In [4]: np.einsum('ijk,rjk->rij', Legendre_matr, after_first_int).shape
Out[4]: (599, 20, 39)
time - optimize doesn't help (it's a bit slower because of the extra time taking to evaluate a path):
In [5]: timeit np.einsum('ijk,rjk->rij', Legendre_matr, after_first_int)
14.1 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [6]: timeit np.einsum('ijk,rjk->rij', Legendre_matr, after_first_int, optimize=True)
14.3 ms ± 13 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
My reading of the optimize
parameter matters when there are more than 2 arguments, and the order of evaluation matters. There's nothing in the docs about using BLAS or not.
In [9]: print(np.einsum_path('ijk,rjk->rij', Legendre_matr, after_first_int)[1])
Complete contraction: ijk,rjk->rij
Naive scaling: 4
Optimized scaling: 4
Naive FLOP count: 1.869e 07
Optimized FLOP count: 1.869e 07
Theoretical speedup: 1.000
Largest intermediate: 4.672e 05 elements
--------------------------------------------------------------------------
scaling current remaining
--------------------------------------------------------------------------
4 rjk,ijk->rij rij->rij
matmul
By adding some dimensions, I can do the same thing with matmul
:
In [19]: after_first_int[:,None,:,None].shape,Legendre_matr[None,:,:,:,None].shape
Out[19]: ((599, 1, 39, 1, 20), (1, 20, 39, 20, 1))
In [21]: (after_first_int[:,None,:,None]@Legendre_matr[None,:,:,:,None]).squeeze().shape
Out[21]: (599, 20, 39)
time is essentially the same:
In [22]: timeit (after_first_int[:,None,:,None]@Legendre_matr[None,:,:,:,None]).squeeze().shape
14.6 ms ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Values match as well as shape:
In [27]: x1=np.einsum('ijk,rjk->rij', Legendre_matr, after_first_int)
In [28]: x2=(after_first_int[:,None,:,None]@Legendre_matr[None,:,:,:,None]).squeeze()
In [29]: np.allclose(x1,x2)
Out[29]: True
I'm not a position to evaluate the use of cores (only 2 cores).
third einsum
This one has the sum-of-products first. That requires a transpose to use matmul:
In [56]: x1 = np.einsum('xjk,xn->jkn', Psi_m, np.exp(1j * np.outer(arange_for_m_values, phis)))
...:
In [57]: x2=(Psi_m.transpose(1,2,0)@np.exp(1j * np.outer(arange_for_m_values, phis)))
In [58]: np.allclose(x1,x2)
Out[58]: True
timings:
In [59]: timeit x2=(Psi_m.transpose(1,2,0)@np.exp(1j * np.outer(arange_for_m_values, phis)))
13.6 ms ± 16.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [60]: timeit np.einsum('xjk,xn->jkn', Psi_m, np.exp(1j * np.outer(arange_for_m_values, phis)
...: ), optimize=True)
9.37 ms ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [61]: timeit np.einsum('xjk,xn->jkn', Psi_m, np.exp(1j * np.outer(arange_for_m_values, phis)
...: ), optimize=False)
133 ms ± 3.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Interesting that the True einsum
case is faster than the matmul. The einsum
False case is significantly slower, indicating a much slower evaluate route (not using BLAS at all). The einsum_path
tells us nothing since there are only two arguments.
Years ago I contributed a patch to einsum
, enabling correct use of the ellipsis
. At that time all calculations went through nditer
in cython
code. Since then it has gone through several patches, adding for example the optimize
and some sort of link to matmul
. The default value for optimize
has flip-flopped a few times.
Timings such as what I've done give us some idea of what's happening, but it's not the same as actually studying the code (a mix of python and compiled).
CodePudding user response:
As suggested earlier, one way is to break it into chucks and use einsumt. But I still feel that you can dispatch into multiple dot products and squeeze some more performance using Numba too.
import numpy as np
N_thetas = 20
N_rs = 600
N_phis = 18
Legendre_matr = np.random.rand(N_thetas, 2*N_thetas-1, N_thetas)
after_first_int = np.random.rand(N_rs - 1, 2*N_thetas - 1, N_thetas)
eigenvectors_memory = np.random.rand(N_rs - 1, N_rs - 1, N_thetas)
normaliz = np.random.rand(N_rs - 1, )
Cilmoft = np.random.rand(2*N_thetas-1, N_rs-1, N_thetas)
Psi_m = np.random.rand( 2*N_thetas - 1, N_rs - 1, N_thetas )
arange_for_m_values = np.arange(-N_thetas 1, N_thetas)
phis = np.linspace(0.0, 2*np.pi, N_phis)
1st einsum:
from numba import njit, prange
@njit(parallel=True, cache=True, nogil=True)
def first_einsum(a,b):
c= np.zeros((b.shape[0],a.shape[0],a.shape[1]))
for r in prange(b.shape[0]):
for i in prange(a.shape[0]):
for j in prange(a.shape[1]):
c[r,i,j] = np.dot(a[i,j,:],b[r,j,:])
return c
%timeit np.einsum('ijk,rjk->rij', Legendre_matr, after_first_int, optimize=True)
# 30 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit first_einsum(Legendre_matr, after_first_int)
# 11.2 ms ± 544 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Clearly I could squeeze some perfomance out of it
2nd einsum:
You can write second einsum in terms off first_einsum function
def second_einsum(a,b,c):
temp_arr = np.einsum('jkl,j->jlk',a,b) # 'jkl,j->jlk'
return first_einsum(np.ascontiguousarray(c.transpose(0,2,1)),temp_arr) # 'mlk,jlk->jml'
%timeit np.einsum('jkl,j,mkl->jml', eigenvectors_memory, normaliz , Cilmoft, optimize=True)
# 676 ms ± 13.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit second_einsum(eigenvectors_memory, normaliz , Cilmoft)
# 287 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3rd einsum
When optimize=True
it is dispatching to blas so it is already well optimized.
np.einsum('xjk,xn->jkn', Psi_m, np.exp(1j * np.outer(arange_for_m_values, phis)), optimize=True )
%timeit np.einsum('xjk,xn->jkn', Psi_m, np.exp(1j * np.outer(arange_for_m_values, phis)), optimize=True )
# 16.3 ms ± 754 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
To disprove @hpaulj statement that BLAS is slower than einsu,. you also write the above einsum in matrix multiplications
(np.exp(1j * np.outer(arange_for_m_values, phis)).T @ Psi_m.reshape(Psi_m.shape[0],-1)).reshape(-1,Psi_m.shape[1],Psi_m.shape[2]).transpose(1,2,0)
%timeit (np.exp(1j * np.outer(arange_for_m_values, phis)).T @ Psi_m.reshape(Psi_m.shape[0],-1)).reshape(-1,Psi_m.shape[1],Psi_m.shape[2]).transpose(1,2,0)
# 15.1 ms ± 151 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Clearly BLAS WINS by a little margin due to overhead of einsum
when optimize=True
How do I know if einsum
dispatches to BLAS when optimize=True
?
In [37]: np.einsum_path('xjk,xn->jkn', Psi_m, np.exp(1j * np.outer(arange_for_m_values, phis)))
Out[37]:
(['einsum_path', (0, 1)],
' Complete contraction: xjk,xn->jkn\n Naive scaling: 4\n Optimized scaling: 4\n Naive FLOP count: 1.682e 07\n Optimized FLOP count: 1.682e 07\n Theoretical speedup: 1.000\n Largest intermediate: 2.156e 05 elements\n--------------------------------------------------------------------------\nscaling current remaining\n--------------------------------------------------------------------------\n 4 xn,xjk->jkn jkn->jkn')
In [38]: np.einsum_path('xjk,xn->jkn', Psi_m, np.exp(1j * np.outer(arange_for_m_values, phis)),optimize=True)
Out[38]:
(['einsum_path', (0, 1)],
' Complete contraction: xjk,xn->jkn\n Naive scaling: 4\n Optimized scaling: 4\n Naive FLOP count: 1.682e 07\n Optimized FLOP count: 1.682e 07\n Theoretical speedup: 1.000\n Largest intermediate: 2.156e 05 elements\n--------------------------------------------------------------------------\nscaling current remaining\n--------------------------------------------------------------------------\n 4 xn,xjk->jkn jkn->jkn')
Clearly einsum_path
doesn't give you any information whether it is calling BLAS.
The evaluation order in einsum is mostly taken from the opt_einsum
library. This can tell if einsum is actually using BLAS
In [45]: import opt_einsum as oe
In [46]: path_info = oe.contract_path('xjk,xn->jkn', Psi_m, np.exp(1j * np.outer(arange_for_m_values, phis)))
In [47]: print(path_info)
([(0, 1)], Complete contraction: xjk,xn->jkn
Naive scaling: 4
Optimized scaling: 4
Naive FLOP count: 1.682e 7
Optimized FLOP count: 1.682e 7
Theoretical speedup: 1.000e 0
Largest intermediate: 2.156e 5 elements
--------------------------------------------------------------------------------
scaling BLAS current remaining
--------------------------------------------------------------------------------
4 GEMM xn,xjk->jkn jkn->jkn)
And you see a gemm call, It is very likely that einsum does the same as the implementation is more or less borrowed from opt_einsum