Iterating over regions of numpy array in parallel-CodePudding

I have an 3D array and need to iterate over it, extract a 2x2x2 voxel large region and check if any voxel is non-zero. Of these locations, I need the unique elements of the region and the index:

import time
import numpy as np
np.random.seed(1234)

def _naive_iterator(array):
    lookup = np.pad(array, (1, 1), 'constant')  # Border is filled with 0
    nx, ny, nz = lookup.shape
    for i in range(nx - 1):
        for j in range(ny - 1):
            for k in range(nz - 1):
                n = lookup[i:i   2, j:j   2, k:k   2]
                if n.any():  # check if any value in the region is non-zero
                    yield n.ravel(), i, j, k
                    # yield set(n.ravel()), i, j, k  # `set()` alone takes some time - for testing purposes exclude this

# arrays that shall be used here are in the region of (1000, 1000, 1000) or larger
# arrays are asserted to contain only integer values >= 0.
array = np.random.randint(0, 2, (200, 200, 200), dtype=np.uint8)

for fun in (_naive_iterator, ):
    print(f"* {fun}")
    for _ in range(2):
        tic = time.time()
        [x for x in fun(array)]
        print(f" ** execution took {time.time() - tic}")

On my PC, this loop takes about 24s to run. (Interesting sidenote: without the n.any(), the loop needs only 8s, so maybe there is some optimization potential as well?)

I thought about how I could make this faster, potentially by running it in parallel. But, I can not figure out how I could do that, without pre-generating all the 2x2x2 arrays. I also thought about using scipy.ndimage.generic_filter but with that I can only get an image which has for example 1 on all pixels that I want to include, but I would had to iterate over the original image to get n.ravel() (Ideally, one would use generic_filter directly, but I can not get the index inside the called function).

How can I speed up this loop, potentially by parallelizing the iteration?

CodePudding user response：

Whenever you're working with numpy, you should try to avoid explicit loops. These loops are written in python and therefore usually slower than anything you can do with vectorization. That way you defer the looping to the underlying C functions that are pretty much as fast as anything can be. So I would approach your problem with something like the following. This function does roughly the same thing as your _naive_iterator but in a vectorized manner without any python loops:

from numpy.lib.stride_tricks import sliding_window_view
def get_windows_and_indices(array):
    lookup = np.pad(array, (1, 1), 'constant')  # Border is filled with 0
    nx, ny, nz = lookup.shape
    x, y, z = np.mgrid[0:nx, 0:ny, 0:nz]
    lookup = np.stack([lookup, x, y, z])
    out = sliding_window_view(lookup, (2, 2, 2), axis=(1, 2, 3)).reshape(4, -1, 2, 2, 2)
    windows = out[0, ...]
    ic = out[1, ..., 0, 0, 0]
    jc = out[2, ..., 0, 0, 0]
    kc = out[3, ..., 0, 0, 0]
    mask = windows.any(axis=(1, 2, 3))
    return windows[mask], ic[mask], jc[mask], kc[mask]

Of course you will also need to think abou the rest of the code a little bit differently but vectorization is really something you need to get used to if you want to (efficiently) work with numpy. Also I'm pretty sure that even this function above is not optimal and can definitely be improved further.

CodePudding user response：

without the n.any(), the loop needs only 8s, so maybe there is some optimization potential as well?

This is because Numpy function have a big overhead for very small arrays like 2x2x2. The overhead of a Numpy function is about few microseconds while the actual n.any() computation should take no more than a dozen of nanoseconds on a mainstream processor. The usual solution is to vectorize the operation so to avoid many Numpy calls. You can use Numba to speed up this code and removes most of the CPython/Numpy overheads. Note that Numba does not support all function like pad currently so a workaround is needed. Here is the resulting code:

import time
import numpy as np
import numba as nb
np.random.seed(1234)

@nb.njit('(uint8[:,:,::1],)')
def numba_iterator(lookup):
    nx, ny, nz = lookup.shape
    for i in range(nx - 1):
        for j in range(ny - 1):
            for k in range(nz - 1):
                n = lookup[i:i   2, j:j   2, k:k   2]
                if n.any():
                    yield n.ravel(), i, j, k

array = np.random.randint(0, 2, (200, 200, 200), dtype=np.uint8)

for fun in (numba_iterator, ):
    print(f"* {fun}")
    for _ in range(2):
        tic = time.time()
        lookup = np.pad(array, (1, 1), 'constant')  # Border is filled with 0
        [x for x in fun(lookup)]
        print(f" ** execution took {time.time() - tic}")

This is significantly times faster on my machine (but still quite slow).

I thought about how I could make this faster, potentially by running it in parallel.

This is not possible as long as the yield is used since generators are inherently sequential.

How can I speed up this loop

One solution could be to generate the whole output as a Numpy array in Numba so to avoid the creation of 8 million Numpy objects stored in a CPython list which is the main source of slowdown of the code once optimized with Numba (each call to n.ravel creates a new array). Note that generators are generally slow since they often requires a context-switch (of a kind of lightweight-thread / coroutine). The best solution in term of performance is to compute data on-the-fly in the loop.

Additionally, n.any and n.ravel can be manually rewritten in Numba so to be more efficient. Indeed, the n array views are very small and using 3 nested loops with a constant compile-time bound help the compiler to produce a fast code (ie. it can unroll the loops and generate only few instructions the processor can execute very efficiently).

Here is a modified improved code (that compute the padded array manually):

@nb.njit('(uint8[:,:,::1],)')
def fast_compute(array):
    nx, ny, nz = array.shape

    # Padding (with zeros)

    lookup = np.zeros((nx 2, ny 2, nz 2), dtype=np.uint8)
    for i in range(nx):
        for j in range(ny):
            for k in range(nz):
                lookup[i 1, j 1, k 1] = array[i, j, k]

    # Actual computation

    size = (nx   1) * (ny   1) * (nz   1)
    result = np.empty((size, 8), dtype=np.uint8)
    indices = np.empty((size, 3), dtype=np.uint32)
    cur = 0

    for i in range(nx   1):
        for j in range(ny   1):
            for k in range(nz   1):
                n = lookup[i:i 2, j:j 2, k:k 2]

                # Fast manual n.any()
                found = False
                for i2 in range(2):
                    for j2 in range(2):
                        for k2 in range(2):
                            found |= n[i2, j2, k2]

                if found:
                    # Fast manual n.ravel()
                    cur2 = 0
                    for i2 in range(2):
                        for j2 in range(2):
                            for k2 in range(2):
                                result[cur, cur2] = n[i2, j2, k2]
                                cur2  = 1

                    indices[cur, 0] = i
                    indices[cur, 1] = j
                    indices[cur, 2] = k
                    cur  = 1

    return result[:cur].reshape(cur,2,2,2), indices[:cur]

The resulting code is quite big, but this is the price to pay for high performance computing.

Benchmark

_naive_iterator:           21.25 s
numba_iterator:             8.10 s
get_windows_and_indices:    1.35 s
fast_compute:               0.13 s

The last Numba function is 163 times faster than the initial one and 10 times faster than the vectorized Numpy implementation of @flawr.

The Numba implementation could certainly be multi-threaded, but it is not easy to do since threads need to write the output and the location is of the written value is dependent of the other threads. Moreover it would make the code significantly more complex.