Fastest way to do loop over 2D arrays in Cython-CodePudding

I am trying to loop over 2 2d arrays in Cython. The arrays have the following shape: ranges_1 is a 6000x3 array of int64, while ranges_2 is a 2000x2 of int64. This iteration needs to be perfomed around 10000 times. This would mean that the total number of calculations inside the nested for loop would be around 2000x6000x10000 = 120 billions times.

This is the code I am using to generate the "dummy" data:

import numpy as np
ranges_1 = np.stack([np.random.randint(0, 10_000, 6_000), np.random.randint(0, 10_000, 6_000), np.arange(0, 6_000)], axis=1)
ranges_2 = np.stack([np.random.randint(0, 10_000, 2_000), np.random.randint(0, 10_000, 2_000)], axis=1)

Which gives 2 arrays like these:

array([[6131, 1478,    0],
       [9317, 7263,    1],
       [7938, 6249,    2],
       ...,
       [5153,  426, 5997],
       [9164, 9211, 5998],
       [1695, 1792, 5999]])

and:

array([[ 433,  558],
       [3420, 2494],
       [6367, 7916],
       ...,
       [8693, 1692],
       [1256, 9013],
       [4096, 1860]])

The first implementation I tried is a "naive" version and it's the following (the function inside is just a test function which uses all the data in the array):

import numpy as np
cimport numpy as np
cimport cython
ctypedef np.int_t DTYPE_t

def test_func(np.ndarray[DTYPE_t, ndim = 2] ranges_1, np.ndarray[DTYPE_t, ndim=2] ranges_2,  int n ):
    k = 0 
    for i in range(n):

        for j in range(len(ranges_1)):
            r1 = ranges_1[j]
            a = r1[0]
            b = r1[1]
            c = r1[2]

            for f in range(len(ranges_2)):
                r2 = ranges_2[f]
                d = r2[0]
                e = r2[1]

                k = (a   b   c   d   e)/(d e)
            
    return k

This takes about 5 seconds per each of the 10_000 outside loops. So I then tried flatteing out the arrays and, since I know the dimension on the other axis, accessing the items like this:

import numpy as np
cimport numpy as np
cimport cython
ctypedef np.int_t DTYPE_t

def test_func_flattened(np.ndarray[DTYPE_t, ndim = 1] ranges_1_, np.ndarray[DTYPE_t, ndim=1] ranges_2_,  int n ):
    k = 0 
    for i in range(n):

        for j in range(0, len(ranges_1_), 3):
            a = ranges_1_[j]
            b = ranges_1_[j 1]
            c = ranges_1_[j 2]
                
            for f in range(0, len(ranges_2_), 2):
                d = ranges_2_[f]
                e = ranges_2_[f 1]

                k = (a   b   c   d   e)/(d e)
            
    return k

But that provided no speed up at all. The time to perform one single iteration of the 10_000 seems too high, considered that for one single iteration it's just 12_000_000 operations inside the loop. I also tried implementing a much simpler example both in Cython and in python which was then compiled with numba:

import numpy as np
cimport numpy as np
cimport cython
ctypedef np.int_t DTYPE_t

def test_1(int n ):
    cdef k = 0 
    cdef a = 0
    for i in range(n):
        a =  i  1
            
    return a

This took 15s to run with n = 1_000_000_000.

While with numba:

def test_1_python(n ):
    k = 0 
    a = 0
    
    for i in range(n):
        if i % 2 == 0:
            a = a   1
        else:
            a = a - 1
            
    return a

test_1_numba= numba.jit(test_1_python)

%%time
test_1_numba(120_000_000_000)

The full run with n = 120bln took about 6s, (albeit the function inside is simpler) this means that it would be 500 times faster than Cython, could this be possible?

I am new to Cython so I probably am missing something obvious, but since the numba version (without the array accessing) is tha much faster I think that the difference in speed might come from the overhead associated with accessing the items in the array.

Is this a wrong assumption?

If not, what would be the best of going about looping over a 2D list of integers in Cython?

CodePudding user response：

What you measure in your benchmarks is mainly compilation artefacts and overheads.

First of all, Cython use the default compiler installed on your machine preferred by the Python stack. On Linux, it should be GCC. On Windows, it is certainly MSVC if installed, otherwise MinGW (if any). Meanwhile, Numba is based on LLVM-Lite which is based on the LLVM stack like Clang. Thus, in your case, it is very likely that different compilers are used resulting in different binary with different performance. If you want to make a fair benchmark, you need to use Clang to build you Cython program.

Additionally, the default optimization for Cython is -O2 while it is -O3 for Numba. The former should not enable auto-vectorization while the later does (this is dependent of the target compilers -- newer version of GCC changes this behaviour). Furthermore, Cython does not enable machine-specific non-portable optimizations by default (since binaries may be packaged for other machines like with pip). This means Cython can only use the old SSE2 SIMD instruction set by default on x86-64 processors. Meanwhile, LLVM-JIT can use of the much faster AVX2/AVX-512 SIMD instruction set. You need to enable such optimization manually with Cython so for the benchmark to be fair (ie. -march=native on GCC/Clang).

In fact, on my x86-64 mainstream Intel machine, Numba does use the AVX2 instruction set and Cython does not on your last benchmark. Here is for example the main loop generated by the Numba JIT:

.LBB0_7:
        vptestnmq       %ymm5, %ymm4, %k1
        vpblendmq       %ymm5, %ymm6, %ymm18 {%k1}
        vpaddq  %ymm0, %ymm18, %ymm0
        vpaddq  %ymm1, %ymm18, %ymm1
        vpaddq  %ymm2, %ymm18, %ymm2
        vpaddq  %ymm3, %ymm18, %ymm3
        vpaddq  %ymm16, %ymm4, %ymm18
        vptestnmq       %ymm5, %ymm18, %k1
        vpblendmq       %ymm5, %ymm6, %ymm18 {%k1}
        vpaddq  %ymm0, %ymm18, %ymm0
        vpaddq  %ymm1, %ymm18, %ymm1
        vpaddq  %ymm2, %ymm18, %ymm2
        vpaddq  %ymm3, %ymm18, %ymm3
        vpaddq  %ymm17, %ymm4, %ymm4
        addq    $-2, %rdx
        jne     .LBB0_7

For the benchmark doing a = i 1 in a loop, it is flawed as a good compiler can just optimize out the whole loop (ie. remove it) and replace it with only one assignment since only the last iteration matter. In fact, the same thing applies with k = (a b c d e)/(d e): only the last iteration matters. The i variable of for i in range(n) is not even used. Clang and GCC often do such kind of optimizations.

Finally, the speed of your initial first code will be memory-bound if it is modified to compute something meaningful in a real-world use-case and multiple threads are used.

Note that division are very expensive and you can precompute the reciprocal so to perform multiplications instead in your main loop.