Fastest way to iterate through multiple 2d numpy arrays with numba-CodePudding

When using numba and accessing elements in multiple 2d numpy arrays, is it better to use the index or to iterate the arrays directly, because I'm finding that a combination of the two is the fastest which seems counterintuitive to me? Or is there another better way to do it?

For context, I am trying to speed up the implementation of the raytracing approach in this paper https://iopscience.iop.org/article/10.1088/1361-6560/ac1f38/pdf.

I have a function which takes the intensity before propagation and the displacement maps that result from the propagation. The resulting intensity is then the original intensity displaced by the displacement maps pixel by pixel with sub-pixel displacements being proportionately shared between the respective adjacent pixels. On a side note, can this be implemented directly in numpy or in another library, as I've noticed it is similar to opencv's remap function.

import numpy as np
from numba import njit

@njit
def raytrace_range(intensity_0, d_y, d_x):
    """

    Args:

        intensity_0 (2d numpy array): intensity before propagation
        d_y (2d numpy array): Displacement along y in pixels
        d_x (2d numpy array): Displacement along x in pixels

    Returns:
        intensity_z (2d numpy array): intensity after propagation 

    """
    n_y, n_x = intensity_0.shape
    intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
    for i in range(n_x):
        for j in range(n_y):
            i_ij = intensity_0[i, j]
            dx_ij=d_x[i,j]
            dy_ij=d_y[i,j]


            # Always the same from here down
            if not dx_ij and not dy_ij:
                intensity_z[i,j] =i_ij
                continue
            i_new=i
            j_new=j
            #Calculating displacement bigger than a pixel
            if np.abs(dx_ij)>1:
                x = np.floor(dx_ij)
                i_new=int(i x)
                dx_ij=dx_ij-x
            if np.abs(dy_ij)>1:
                y = np.floor(dy_ij)
                j_new=int(j y)
                dy_ij=dy_ij-y
            # Calculating sub-pixel displacement
            if 0<=i_new and i_new<n_y and 0<=j_new and j_new<n_x:
                intensity_z[i_new,j_new] =i_ij*(1-np.abs(dx_ij))*(1-np.abs(dy_ij))
                if i_new<n_y-1 and dx_ij>=0:
                    if j_new<n_y-1 and dy_ij>=0:
                        intensity_z[i_new 1, j_new] =i_ij*dx_ij*(1-dy_ij)
                        intensity_z[i_new 1, j_new 1] =i_ij*dx_ij*dy_ij
                        intensity_z[i_new, j_new 1] =i_ij*(1-dx_ij)*dy_ij
                    if j_new and dy_ij<0:
                        intensity_z[i_new 1,j_new] =i_ij*dx_ij*(1-np.abs(dy_ij))
                        intensity_z[i_new 1,j_new-1] =i_ij*dx_ij*np.abs(dy_ij)
                        intensity_z[i_new,j_new-1] =i_ij*(1-dx_ij)*np.abs(dy_ij)
                if i_new and dx_ij<0:
                    if j_new<n_x-1 and dy_ij>=0:
                        intensity_z[i_new-1,j_new] =i_ij*np.abs(dx_ij)*(1-dy_ij)
                        intensity_z[i_new-1,j_new 1] =i_ij*np.abs(dx_ij)*dy_ij
                        intensity_z[i_new,j_new 1] =i_ij*(1-np.abs(dx_ij))*dy_ij
                    if j_new and dy_ij<0:
                        intensity_z[i_new-1,j_new] =i_ij*np.abs(dx_ij)*(1-np.abs(dy_ij))
                        intensity_z[i_new-1,j_new-1] =i_ij*dx_ij*dy_ij
                        intensity_z[i_new,j_new-1] =i_ij*(1-np.abs(dx_ij))*np.abs(dy_ij)
    return intensity_z

I've tried a few other approaches of which this is the fastest (includes the code from above after the comment # Always the same from here down which I've omitted to keep the question relatively short):

@njit
def raytrace_enumerate(intensity_0, d_y, d_x):
    n_y, n_x = intensity_0.shape
    intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
    for i, i_i in enumerate(intensity_0):
        for j, i_ij in enumerate(i_i):
            dx_ij=d_x[i,j]
            dy_ij=d_y[i,j]

@njit
def raytrace_npndenumerate(intensity_0, d_y, d_x):
    n_y, n_x = intensity_0.shape
    intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
    for (i, j), i_ij  in np.ndenumerate(intensity_0):
            dx_ij=d_x[i,j]
            dy_ij=d_y[i,j]

@njit
def raytrace_zip(intensity_0, d_y, d_x):
    n_y, n_x = intensity_0.shape
    intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
    for i, (i_i, dy_i, dx_i) in enumerate(zip(intensity_0, d_y, d_x)):
        for j, (i_ij, dy_ij, dx_ij) in enumerate(zip(i_i, dy_i, dx_i)):

@njit
def raytrace_stack1(idydx):
    n_y, _, n_x = idydx.shape
    intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
    for i, (i_i, dy_i, dx_i) in enumerate(idydx):
        for j, (i_ij, dy_ij, dx_ij) in enumerate(zip(i_i, dy_i, dx_i)):

@njit
def raytrace_stack2(idydx):
    n_y, n_x, _ = idydx.shape
    intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
    for i, k in enumerate(idydx):
        for j, (i_ij, dy_ij, dx_ij) in enumerate(k):

Make up some test data and time:

import timeit
rng = np.random.default_rng()
size = (2010, 2000)
margin = 10
test_data = np.pad(10000*rng.random(size=size), margin)
dx = np.pad(10*(rng.random(size=size)-0.5), margin)
dy = np.pad(10*(rng.random(size=size)-0.5), margin)

# Check results are the same
L = [raytrace_range(test_data, dy, dx), raytrace_enumerate(test_data, dy, dx), raytrace_npndenumerate(test_data, dy, dx), raytrace_zip(test_data, dy, dx), raytrace_stack1(np.stack([test_data, dy, dx], axis=1)), raytrace_stack2(np.stack([test_data, dy, dx], axis=2))]
print((np.diff(np.vstack(L).reshape(len(L),-1),axis=0)==0).all())

%timeit raytrace_range(test_data, dy, dx)
%timeit raytrace_enumerate(test_data, dy, dx)
%timeit raytrace_npndenumerate(test_data, dy, dx)
%timeit raytrace_zip(test_data, dy, dx)
%timeit raytrace_stack1(np.stack([test_data, dy, dx], axis=1)) #Note this would be the fastest if the arrays were pre-stacked
%timeit raytrace_stack2(np.stack([test_data, dy, dx], axis=2))

Output:

True
40.4 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
37.5 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
46.8 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
38.6 ms ± 243 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
42 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) #Note this would be the fastest if the arrays were pre-stacked
47.4 ms ± 203 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

CodePudding user response：

Edit 3: Turns out that removing if statements make range faster than enumerate. See edit 2 below

Interestingly, in my machine times get awful in the stack1 and stack2 options, and indeed enumerate seems to be fastest. Maybe thanks to enumerate numba understands it is a looping variable?:

In [1]: %timeit raytrace_range(test_data, dy, dx)
   ...: %timeit raytrace_enumerate(test_data, dy, dx)
   ...: %timeit raytrace_npndenumerate(test_data, dy, dx)
   ...: %timeit raytrace_zip(test_data, dy, dx)
   ...: %timeit raytrace_stack1(np.stack([test_data, dy, dx], axis=1)) #Note this would be the fastest if the arrays we
   ...: re pre-stacked
   ...: %timeit raytrace_stack2(np.stack([test_data, dy, dx], axis=2))
61 ms ± 785 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
53.9 ms ± 998 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
69.9 ms ± 471 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
57.5 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
109 ms ± 478 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
146 ms ± 1.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Edit: Using fastmath=True did not shove up much time, only ~3ms

Edit 2: Although it is not related to the OP's question, after playing a bit with the functions, turns out that removing "superfluous"(*) conditional statements makes it notably faster. Around 20% on my machine. Turns out the implementation works without them (at least the supplied test returns True):

(*) The operations seem to work regardless, as they are "caught" by the lower operations. At least, provided test vector did not report any issues.

#! Using this it is faster:
# Always the same from here down
# if dx_ij==0 and dy_ij==0:
#     intensity_z[i,j] =i_ij
#     continue
#Calculating displacement bigger than a pixel
x = np.floor(dx_ij)
i_new=int(i x)
dx_ij=dx_ij-x
y = np.floor(dy_ij)
j_new=int(j y)
dy_ij=dy_ij-y
# Calculating sub-pixel displacement


In [2]: %timeit raytrace_range(test_data, dy, dx)
   ...: %timeit raytrace_range2(test_data, dy, dx)
   ...: %timeit raytrace_enumerate(test_data, dy, dx)
64.8 ms ± 1.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
52.9 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
56.1 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

CodePudding user response：

In general, the fastest way to iterate over an array is a basic low-level integer iterator. Such a pattern cause the minimum number of transformation in Numba so the compiler should be able to optimize the code pretty well. Functions likes zip and enumerate often add an additional overhead due to indirect code transformations that are not perfectly optimized out.

Here is a basic example:

import numba as nb

@nb.njit('(int_[::1],)')
def test(arr):
    s1 = s2 = 0
    for i in range(arr.shape[0]):
        s1  = i
        s2  = arr[i]
    return (s1, s2)

arr = np.arange(200_000)
test(arr)

However, things are more complex when you read/write to multiple arrays simultaneously (which is your case). Indeed, Numpy array can be indexed with negative indices so Numba need to perform bound checking every time. This check is expensive compared to the actual access and it can even break some other optimizations (eg. vectorization).

Consequently, Numba has been optimized so to analyse the code and detect cases where bound checking is not needed and prevent adding expensive checks at runtime. This is the case in the above code but not in your raytrace_range function. enumerate and enumerate zip can help a lot to remove bound checking because Numba can easily prove that the index lies in the bound of the array (theoretically, it could prove this for raytrace_range but the current implementation is unfortunately not smart enough). You can mostly solve this problem using assertions. It is not only good for optimization but also to make your code more robust!

Moreover, the indexing of multidimensional arrays is sometimes not perfectly optimized by the underlying JIT (LLVM-Lite). There is no reason for them not to be optimized but compiler use heuristics to optimize the code that are far from being perfect (though pretty good in average). You can help by computing views of lines. This generally result in a tiny improvement though.

Here is the improved code:

@njit
def raytrace_range_opt(intensity_0, d_y, d_x):
    n_y, n_x = intensity_0.shape
    assert intensity_0.shape == d_y.shape
    assert intensity_0.shape == d_x.shape

    intensity_z = np.zeros((n_y, n_x), dtype=np.float64)

    for i in range(n_x):
        row_intensity_0 = intensity_0[i, :]
        row_d_x = d_x[i, :]
        row_d_y = d_y[i, :]

        for j in range(n_y):
            assert j >= 0  # Crazy optimization (see later)

            i_ij = row_intensity_0[j]
            dx_ij = row_d_x[j]
            dy_ij = row_d_y[j]

            # Always the same from here down
            if not dx_ij and not dy_ij:
                row_intensity_0[j]  = i_ij
                continue

            # Remaining code left unmodified

Notes

Note that I think the indexing of the function raytrace_enumerate is bogus: It should be for i in range(n_y): for j in range(n_x): instead since the access are done with intensity_0[i, j] and you wrote n_y, n_x = intensity_0.shape. Note that swaping the axis also gives correct results based on your validation function (which is suspicious).

The assert j >= 0 instruction alone results in a 8% speed up which is crazy since the integer iterator j is guaranteed to be positive if the n_x is positive which is always the case since it is a shape! This is clearly a missed optimization of Numba that LLVM-Lite cannot optimize (since LLVM-Lite does not know what is a shape and that they are always positive too). This apparent missing assumption in the Numba code causes additional bound checking (of each of the three arrays) that are pretty expensive.

Benchmark

Here are results on my machine:

raytrace_range:           47.8 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raytrace_enumerate:       38.9 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raytrace_npndenumerate:   54.1 ms ± 363 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raytrace_zip:             41 ms ± 657 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raytrace_stack1:          86.7 ms ± 268 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raytrace_stack2:          84 ms ± 432 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

raytrace_range_opt:       38.6 ms ± 421 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

As you can see raytrace_range_opt is the fastest implementation on my machine.