Is there a better way to remove parts with consecutive zeros that have length equal or above thresho-CodePudding

Problem statement:

As stated by the title, I want to remove parts from an 1D array that have consecutive zeros and length equal or above a threshold.

My solution:

I produced the solution shown in the following MRE:

import numpy as np

THRESHOLD = 4

a = np.array((1,1,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,1))

print("Input: "   str(a))

# Find the indices of the parts that meet threshold requirement
gaps_above_threshold_inds = np.where(np.diff(np.nonzero(a)[0]) - 1 >= THRESHOLD)[0]

# Delete these parts from array
for idx in gaps_above_threshold_inds:
    a = np.delete(a, list(range(np.nonzero(a)[0][idx]   1, np.nonzero(a)[0][idx   1])))
    
print("Output: "   str(a))

Output:

Input:  [1 1 0 1 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1]
Output: [1 1 0 1 1 1 0 0 0 1 1]

Question:

Is there a less complicated and more efficient way to do this on a numpy array?

Edit:

Based on @mozway comments, I'm editing my question providing some more information.

Basically, the problem domain is:

I have 1D signals of length ~20.000 samples
Some parts of the signals have been zeroed due to noise
The rest of the signal has non-zero values, in the range ~[50, 250]

My goal is to remove the zero parts above a length threshold as I have already said.

More detailed questions:

As far as numpy efficient handling is concerned, is there a better solution from the one above?
As far as efficient signal processing techniques are concerned, is there more suitable way to achieve the desired goal than using numpy?

CodePudding user response：

A quite efficient method is to use itertools.groupby itertools.chain:

from itertools import groupby, chain
a2 = np.array(list(chain(*(l for k,g in groupby(a)
                           if len(l:=list(g))<THRESHOLD or k))))

output:

array([1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1])

This works relatively fast, for instance on 1 million items:

# A = np.random.randint(2, size=1000000)
%%timeit
np.array(list(chain(*(l for k,g in groupby(a)
                      if len(l:=list(g))<THRESHOLD or k))))

# 254 ms ± 3.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

CodePudding user response：

np.delete create a new array every time it is called which is very inefficient. A faster solution is to store all the value to keep in a mask/boolean array and then filter the input array at once. However, this will still likely require a pure-Python loop if done only with Numpy. A simpler and faster solution is to use Numba (or Cython) to do that. Here is an implementation:

import numpy as np
import numba as nb

@nb.njit('int_[:](int_[:], int_)')
def filterZeros(arr, threshold):
    n = len(arr)
    res = np.empty(n, dtype=arr.dtype)
    count = 0
    j = 0
    for i in range(n):
        if arr[i] == 0:
            count  = 1
        else:
            if count >= threshold:
                j -= count
            count = 0
        res[j] = arr[i]
        j  = 1
    if n > 0 and arr[n-1] == 0 and count >= threshold:
        j -= count
    return res[0:j]

a = np.array((1,1,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,1))
a = filterZeros(a, 4)
print("Output: "   str(a))

Here are the result with a random binary array containing 100_000 items on my machine:

Reference implementation: 5982    ms
Mozway's solution:          23.4  ms
This implementation:         0.11 ms

Thus, the solution is about 54381 faster than the initial solution and 212 times faster than the one of Mozway. The code can even be ~30% faster by working in-place (destroy the input array) and by telling Numba the array is contiguous in memory (using ::1 instead of :).