Neighbors sum of numpy array with mask-CodePudding

I have two large arrays, one containing values, and one being a mask basically. The code below shows the function I want to implement.

from scipy.signal import convolve2d
import numpy as np

sample = np.array([[6, 4, 5, 5, 5],
                   [7, 1, 0, 8, 3],
                   [2, 5, 4, 8, 4],
                   [2, 0, 2, 6, 0],
                   [5, 7, 2, 3, 2]])

mask = np.array([[1, 0, 1, 1, 0],
                 [0, 0, 1, 0, 1],
                 [0, 1, 0, 0, 0],
                 [0, 0, 0, 1, 0],
                 [1, 1, 0, 0, 1]])

neighbors_sum = convolve2d(sample, np.ones((3,3), dtype=int), mode='same', boundary='wrap')

# neighbors_sum = np.array([[40, 37, 35, 33, 44],
#                           [37, 34, 40, 42, 48],
#                           [24, 23, 34, 35, 40],
#                           [27, 29, 37, 31, 32],
#                           [31, 33, 34, 30, 34]])

result = np.where(mask, neighbors_sum, 0)

print(result)

This code works, and gets me what I expects:

np.array([[40,  0, 35, 33,  0],
          [ 0,  0, 40,  0, 48],
          [ 0, 23,  0,  0,  0],
          [ 0,  0,  0, 31,  0],
          [31, 33,  0,  0, 34]])

So far, so good. However, where I'm encountering some large issue is when I increase the size of the arrays. In my case, instead of a 5x5 input and a 3x3 summing mask, I need a 50,000x20,000 input and a 100x100 summing mask. And when I move to that, the convolve2d function is in all kinds of trouble and the calculation is extremely long.

Given that I only care about the masked result, and thus only care about the summation from convolve2d at those points, can anyone think of a smart approach to take here? Going to a for loop and selecting only the points of interest would lose the speed advantage of the vectorization so I'm not convinced this would be worth it.

Any suggestion welcome!

CodePudding user response：

convolve2d is very inefficient in this case. Since the mask is np.ones, you can split the filter in two trivial ones thanks to separable filtering: one np.ones(100, 1) filter and one np.ones(1, 100) filter. Moreover, a rolling sum can be used to speed up even more the computation.

Here is a simple solution without a rolling sum:

# Simple faster implementation

tmp = convolve2d(sample, np.ones((1,100), dtype=int), mode='same', boundary='wrap')
neighbors_sum = convolve2d(tmp, np.ones((100,1), dtype=int), mode='same', boundary='wrap')
result = np.where(mask, neighbors_sum, 0)

You can compute the rolling sum efficiently using Numba. The strategy is to split the computation in 3 parts: the horizontal rolling sum, the vertical rolling sum and the final masking. Each step can be fully parallelized using multiple threads (although parallelizing the vertical rolling sum is harder with Numba). Each part needs to work line by line so to be cache friendly.

# Complex very-fast implementation

import numba as nb

# Numerical results may diverge if the input contains big 
# values with many small ones.
# Does not support inputs containing NaN values or  /- Inf ones.
@nb.njit('float64[:,::1](float64[:,::1], int_)', parallel=True, fastmath=True)
def horizontalRollingSum(sample, filterSize):
    n, m = sample.shape
    fs = filterSize

    # Make the wrapping part of the rolling sum much simpler
    assert fs >= 1
    assert n >= fs and m >= fs

    # Horizontal rolling sum.
    tmp = np.empty((n, m), dtype=np.float64)
    for i in nb.prange(n):
        s = 0.0
        lShift = fs//2
        rShift = (fs-1)//2
        for j in range(m-lShift, m):
            s  = sample[i, j]
        for j in range(0, rShift 1):
            s  = sample[i, j]
        tmp[i, 0] = s
        for j in range(1, m):
            jLeft, jRight = (j-1-lShift)%m, (j rShift)%m
            s  = sample[i, jRight] - sample[i, jLeft]
            tmp[i, j] = s

    return tmp

@nb.njit('float64[:,::1](float64[:,::1], int_)', fastmath=True)
def verticaltalRollingSum(sample, filterSize):
    n, m = sample.shape
    fs = filterSize

    # Make the wrapping part of the rolling sum much simpler
    assert fs >= 1
    assert n >= fs and m >= fs

    # Horizontal rolling sum.
    tmp = np.empty((n, m), dtype=np.float64)
    tShift = fs//2
    bShift = (fs-1)//2
    for j in range(m):
        tmp[0, j] = 0.0
    for i in range(n-tShift, n):
        for j in range(m):
            tmp[0, j]  = sample[i, j]
    for i in range(0, bShift 1):
        for j in range(m):
            tmp[0, j]  = sample[i, j]
    for i in range(1, n):
        iTop = (i-1-tShift)%n
        iBot = (i bShift)%n
        for j in range(m):
            tmp[i, j] = tmp[i-1, j]   (sample[iBot, j] - sample[iTop, j])

    return tmp

@nb.njit('float64[:,::1](float64[:,::1], int_[:,::1], int_)', parallel=True, fastmath=True)
def compute(sample, mask, filterSize):
    n, m = sample.shape
    tmp = horizontalRollingSum(sample, filterSize)
    neighbors_sum = verticaltalRollingSum(tmp, filterSize)
    res = np.empty((n, m), dtype=np.float64)

    for i in nb.prange(n):
        for j in range(n):
            res[i, j] = neighbors_sum[i, j] * mask[i, j]

    return res

Benchmark & Notes

Here is the testing code:

n, m = 5000, 2000
sample = np.random.rand(n, m)
mask = (np.random.rand(n, m) < 0.05).astype(int)

Here are the results on my 6-core machine:

Initial solution:           174366 ms   (x1)
With separate filters:        5710 ms   (x31)
Final Numba solution:           40 ms   (x4359)
Optimal theoretical time:       10 ms   (optimistic)

Thus, the Numba implementation is 4359 times faster than the initial one.

That being said, be careful of possible numerical issues that this last implementation can have regarding the input array (see the comments in the code). It should be fine as long as np.std(sample) is relatively small and np.all(np.isfinite(sample)) is true.

Note that the code can be further optimized: the vertical rolling sum can be parallelized; modulus operations can be avoided in the horizontal rolling sum; the vertical rolling sum and the masking steps can be merged together (ie. by computing res on-the-fly and not storing tmp); tiling can be used to compute all the steps simultaneously in a more cache-friendly way. However, these optimizations make the code more complex and some of them are very hard to perform (especially the last one with Numba).

Note that using a boolean mask (instead of an integer-based one) should make the algorithm faster since it takes less memory and processors can fetch values faster.