usage of all cores in numpy einsum-CodePudding

I have code which heavily uses np.einsum calls. I have installed numpy-intel in a python virtual environment using:

$ pip install mkl
$ pip install intel-numpy

np.show_config() prints:

blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/sat_bot/base/conda-bld/numpy_and_dev_1643279478844/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/sat_bot/base/conda-bld/numpy_and_dev_1643279478844/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/include']
blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/sat_bot/base/conda-bld/numpy_and_dev_1643279478844/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/sat_bot/base/conda-bld/numpy_and_dev_1643279478844/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/include']
lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/sat_bot/base/conda-bld/numpy_and_dev_1643279478844/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/sat_bot/base/conda-bld/numpy_and_dev_1643279478844/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/include']
lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/sat_bot/base/conda-bld/numpy_and_dev_1643279478844/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/sat_bot/base/conda-bld/numpy_and_dev_1643279478844/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/include']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
    not found = AVX512F,AVX512CD,AVX512_KNL,AVX512_KNM,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL

I have an Intel chip with 16 logical processors (8 cores) in 1 socket. I want to make sure that the python code's np calls use all the cores on my machine. What I have tried:

inside the WSL I use to run the python scripts, I wrote export MKL_NUM_THREADS=8 then ran my script normally as python script.py.
inside the script.py, before importing numpy I wrote:

import os os.environ["MKL_NUM_THREADS"] = '8' # or 16

None of these trials resulted in a visible usage of all the cores when I monitor them in the terminal using htop.

There is always a core at 100% utilization, and from time to time (when inside a np.einsum call) some other cores (it seems randomly selected from the set not including the constant 100% utilization one - two or three usually) just light to 3% utilization for a few seconds or so, then drop to 0% utilization again.

Does anyone have any idea if this a normal behavior, or is it me who is doing something wrong?

Thank you!

EDIT TO INCLUDE THE ACTUAL np.einsum CALLS (MWE)

N_thetas = 20
N_rs = 600
N_phis = 18

Legendre_matr = np.random.rand(N_thetas,  2*N_thetas-1,  N_thetas)
after_first_int = np.random.rand(N_rs - 1, 2*N_thetas - 1, N_thetas)
eigenvectors_memory = np.random.rand(N_rs - 1, N_rs - 1, N_thetas)
normaliz = np.random.rand(N_rs - 1, )
Cilmoft = np.random.rand(2*N_thetas-1, N_rs-1, N_thetas)
Psi_m = np.random.rand( 2*N_thetas - 1,    N_rs - 1,    gl.N_thetas )
arange_for_m_values = np.arange(-N_thetas 1, N_thetas)
phis = np.linspace(0.0, 2*np.pi, N_phis)    

after_second_int = np.einsum('ijk,rjk->rij', Legendre_matr, after_first_int, optimize=True)


Psi_lm = np.einsum('jkl,j,mkl->jml', eigenvectors_memory, normaliz , Cilmoft, optimize=True) 

Psi_on_grid_from_Cilmoft = np.einsum('xjk,xn->jkn',   Psi_m,  np.exp(1j * np.outer(arange_for_m_values, phis)), optimize=True )

CodePudding user response：

In [3]: Legendre_matr.shape, after_first_int.shape
Out[3]: ((20, 39, 20), (599, 39, 20))

Your first einsum:

In [4]: np.einsum('ijk,rjk->rij', Legendre_matr, after_first_int).shape
Out[4]: (599, 20, 39)

time - optimize doesn't help (it's a bit slower because of the extra time taking to evaluate a path):

In [5]: timeit np.einsum('ijk,rjk->rij', Legendre_matr, after_first_int)
14.1 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [6]: timeit np.einsum('ijk,rjk->rij', Legendre_matr, after_first_int, optimize=True)
14.3 ms ± 13 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

My reading of the optimize parameter matters when there are more than 2 arguments, and the order of evaluation matters. There's nothing in the docs about using BLAS or not.

In [9]: print(np.einsum_path('ijk,rjk->rij', Legendre_matr, after_first_int)[1])
  Complete contraction:  ijk,rjk->rij
         Naive scaling:  4
     Optimized scaling:  4
      Naive FLOP count:  1.869e 07
  Optimized FLOP count:  1.869e 07
   Theoretical speedup:  1.000
  Largest intermediate:  4.672e 05 elements
--------------------------------------------------------------------------
scaling                  current                                remaining
--------------------------------------------------------------------------
   4                rjk,ijk->rij                                rij->rij

`matmul`

By adding some dimensions, I can do the same thing with matmul:

In [19]: after_first_int[:,None,:,None].shape,Legendre_matr[None,:,:,:,None].shape
Out[19]: ((599, 1, 39, 1, 20), (1, 20, 39, 20, 1))
In [21]: (after_first_int[:,None,:,None]@Legendre_matr[None,:,:,:,None]).squeeze().shape
Out[21]: (599, 20, 39)

time is essentially the same:

In [22]: timeit (after_first_int[:,None,:,None]@Legendre_matr[None,:,:,:,None]).squeeze().shape
14.6 ms ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Values match as well as shape:

In [27]: x1=np.einsum('ijk,rjk->rij', Legendre_matr, after_first_int)
In [28]: x2=(after_first_int[:,None,:,None]@Legendre_matr[None,:,:,:,None]).squeeze()
In [29]: np.allclose(x1,x2)
Out[29]: True

I'm not a position to evaluate the use of cores (only 2 cores).

third einsum

This one has the sum-of-products first. That requires a transpose to use matmul:

In [56]: x1 = np.einsum('xjk,xn->jkn',   Psi_m,  np.exp(1j * np.outer(arange_for_m_values, phis)))
    ...: 
In [57]: x2=(Psi_m.transpose(1,2,0)@np.exp(1j * np.outer(arange_for_m_values, phis)))
In [58]: np.allclose(x1,x2)
Out[58]: True

timings:

In [59]: timeit x2=(Psi_m.transpose(1,2,0)@np.exp(1j * np.outer(arange_for_m_values, phis)))
13.6 ms ± 16.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [60]: timeit np.einsum('xjk,xn->jkn',   Psi_m,  np.exp(1j * np.outer(arange_for_m_values, phis)
    ...: ), optimize=True)
9.37 ms ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [61]: timeit np.einsum('xjk,xn->jkn',   Psi_m,  np.exp(1j * np.outer(arange_for_m_values, phis)
    ...: ), optimize=False)
133 ms ± 3.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Interesting that the True einsum case is faster than the matmul. The einsum False case is significantly slower, indicating a much slower evaluate route (not using BLAS at all). The einsum_path tells us nothing since there are only two arguments.

Years ago I contributed a patch to einsum, enabling correct use of the ellipsis. At that time all calculations went through nditer in cython code. Since then it has gone through several patches, adding for example the optimize and some sort of link to matmul. The default value for optimize has flip-flopped a few times.

Timings such as what I've done give us some idea of what's happening, but it's not the same as actually studying the code (a mix of python and compiled).

CodePudding user response：

As suggested earlier, one way is to break it into chucks and use einsumt. But I still feel that you can dispatch into multiple dot products and squeeze some more performance using Numba too.

import numpy as np

N_thetas = 20
N_rs = 600
N_phis = 18

Legendre_matr = np.random.rand(N_thetas,  2*N_thetas-1,  N_thetas)
after_first_int = np.random.rand(N_rs - 1, 2*N_thetas - 1, N_thetas)
eigenvectors_memory = np.random.rand(N_rs - 1, N_rs - 1, N_thetas)
normaliz = np.random.rand(N_rs - 1, )
Cilmoft = np.random.rand(2*N_thetas-1, N_rs-1, N_thetas)
Psi_m = np.random.rand( 2*N_thetas - 1,    N_rs - 1,    N_thetas )
arange_for_m_values = np.arange(-N_thetas 1, N_thetas)
phis = np.linspace(0.0, 2*np.pi, N_phis)

1st einsum:

from numba import njit, prange

@njit(parallel=True, cache=True, nogil=True)
def first_einsum(a,b):
    c= np.zeros((b.shape[0],a.shape[0],a.shape[1]))
    for r in prange(b.shape[0]):
        for i in prange(a.shape[0]):
            for j in prange(a.shape[1]):
                c[r,i,j] = np.dot(a[i,j,:],b[r,j,:])
    return c

%timeit np.einsum('ijk,rjk->rij', Legendre_matr, after_first_int, optimize=True)
# 30 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit first_einsum(Legendre_matr, after_first_int)
# 11.2 ms ± 544 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Clearly I could squeeze some perfomance out of it

2nd einsum:

You can write second einsum in terms off first_einsum function


def second_einsum(a,b,c):
    temp_arr = np.einsum('jkl,j->jlk',a,b) # 'jkl,j->jlk'
    return first_einsum(np.ascontiguousarray(c.transpose(0,2,1)),temp_arr) # 'mlk,jlk->jml'
   
%timeit np.einsum('jkl,j,mkl->jml', eigenvectors_memory, normaliz , Cilmoft, optimize=True)
# 676 ms ± 13.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit second_einsum(eigenvectors_memory, normaliz , Cilmoft)
# 287 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

3rd einsum

When optimize=True it is dispatching to blas so it is already well optimized.

np.einsum('xjk,xn->jkn',   Psi_m,  np.exp(1j * np.outer(arange_for_m_values, phis)), optimize=True )
%timeit np.einsum('xjk,xn->jkn',   Psi_m,  np.exp(1j * np.outer(arange_for_m_values, phis)), optimize=True )

# 16.3 ms ± 754 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

To disprove @hpaulj statement that BLAS is slower than einsu,. you also write the above einsum in matrix multiplications

(np.exp(1j * np.outer(arange_for_m_values, phis)).T @ Psi_m.reshape(Psi_m.shape[0],-1)).reshape(-1,Psi_m.shape[1],Psi_m.shape[2]).transpose(1,2,0)

%timeit (np.exp(1j * np.outer(arange_for_m_values, phis)).T @ Psi_m.reshape(Psi_m.shape[0],-1)).reshape(-1,Psi_m.shape[1],Psi_m.shape[2]).transpose(1,2,0)

# 15.1 ms ± 151 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Clearly BLAS WINS by a little margin due to overhead of einsum when optimize=True

How do I know if einsum dispatches to BLAS when optimize=True?

In [37]: np.einsum_path('xjk,xn->jkn',   Psi_m,  np.exp(1j * np.outer(arange_for_m_values, phis)))
Out[37]:
(['einsum_path', (0, 1)],
 '  Complete contraction:  xjk,xn->jkn\n         Naive scaling:  4\n     Optimized scaling:  4\n      Naive FLOP count:  1.682e 07\n  Optimized FLOP count:  1.682e 07\n   Theoretical speedup:  1.000\n  Largest intermediate:  2.156e 05 elements\n--------------------------------------------------------------------------\nscaling                  current                                remaining\n--------------------------------------------------------------------------\n   4                 xn,xjk->jkn                                 jkn->jkn')

In [38]: np.einsum_path('xjk,xn->jkn',   Psi_m,  np.exp(1j * np.outer(arange_for_m_values, phis)),optimize=True)
Out[38]:
(['einsum_path', (0, 1)],
 '  Complete contraction:  xjk,xn->jkn\n         Naive scaling:  4\n     Optimized scaling:  4\n      Naive FLOP count:  1.682e 07\n  Optimized FLOP count:  1.682e 07\n   Theoretical speedup:  1.000\n  Largest intermediate:  2.156e 05 elements\n--------------------------------------------------------------------------\nscaling                  current                                remaining\n--------------------------------------------------------------------------\n   4                 xn,xjk->jkn                                 jkn->jkn')

Clearly einsum_path doesn't give you any information whether it is calling BLAS.

The evaluation order in einsum is mostly taken from the opt_einsum library. This can tell if einsum is actually using BLAS

In [45]: import opt_einsum as oe

In [46]: path_info = oe.contract_path('xjk,xn->jkn',   Psi_m,  np.exp(1j * np.outer(arange_for_m_values, phis)))

In [47]: print(path_info)
([(0, 1)],   Complete contraction:  xjk,xn->jkn
         Naive scaling:  4
     Optimized scaling:  4
      Naive FLOP count:  1.682e 7
  Optimized FLOP count:  1.682e 7
   Theoretical speedup:  1.000e 0
  Largest intermediate:  2.156e 5 elements
--------------------------------------------------------------------------------
scaling        BLAS                current                             remaining
--------------------------------------------------------------------------------
   4           GEMM            xn,xjk->jkn                              jkn->jkn)

And you see a gemm call, It is very likely that einsum does the same as the implementation is more or less borrowed from opt_einsum