Home > Enterprise >  Is there a better way to remove parts with consecutive zeros that have length equal or above thresho
Is there a better way to remove parts with consecutive zeros that have length equal or above thresho

Time:11-27

Problem statement:

As stated by the title, I want to remove parts from an 1D array that have consecutive zeros and length equal or above a threshold.


My solution:

I produced the solution shown in the following MRE:

import numpy as np

THRESHOLD = 4

a = np.array((1,1,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,1))

print("Input: "   str(a))

# Find the indices of the parts that meet threshold requirement
gaps_above_threshold_inds = np.where(np.diff(np.nonzero(a)[0]) - 1 >= THRESHOLD)[0]

# Delete these parts from array
for idx in gaps_above_threshold_inds:
    a = np.delete(a, list(range(np.nonzero(a)[0][idx]   1, np.nonzero(a)[0][idx   1])))
    
print("Output: "   str(a))

Output:

Input:  [1 1 0 1 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1]
Output: [1 1 0 1 1 1 0 0 0 1 1]

Question:

Is there a less complicated and more efficient way to do this on a numpy array?


Edit:

Based on @mozway comments, I'm editing my question providing some more information.

Basically, the problem domain is:

  • I have 1D signals of length ~20.000 samples
  • Some parts of the signals have been zeroed due to noise
  • The rest of the signal has non-zero values, in the range ~[50, 250]

My goal is to remove the zero parts above a length threshold as I have already said.

More detailed questions:

  • As far as numpy efficient handling is concerned, is there a better solution from the one above?
  • As far as efficient signal processing techniques are concerned, is there more suitable way to achieve the desired goal than using numpy?

CodePudding user response:

A quite efficient method is to use itertools.groupby itertools.chain:

from itertools import groupby, chain
a2 = np.array(list(chain(*(l for k,g in groupby(a)
                           if len(l:=list(g))<THRESHOLD or k))))

output:

array([1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1])

This works relatively fast, for instance on 1 million items:

# A = np.random.randint(2, size=1000000)
%%timeit
np.array(list(chain(*(l for k,g in groupby(a)
                      if len(l:=list(g))<THRESHOLD or k))))

# 254 ms ± 3.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

CodePudding user response:

np.delete create a new array every time it is called which is very inefficient. A faster solution is to store all the value to keep in a mask/boolean array and then filter the input array at once. However, this will still likely require a pure-Python loop if done only with Numpy. A simpler and faster solution is to use Numba (or Cython) to do that. Here is an implementation:

import numpy as np
import numba as nb

@nb.njit('int_[:](int_[:], int_)')
def filterZeros(arr, threshold):
    n = len(arr)
    res = np.empty(n, dtype=arr.dtype)
    count = 0
    j = 0
    for i in range(n):
        if arr[i] == 0:
            count  = 1
        else:
            if count >= threshold:
                j -= count
            count = 0
        res[j] = arr[i]
        j  = 1
    if n > 0 and arr[n-1] == 0 and count >= threshold:
        j -= count
    return res[0:j]

a = np.array((1,1,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,1))
a = filterZeros(a, 4)
print("Output: "   str(a))

Here are the result with a random binary array containing 100_000 items on my machine:

Reference implementation: 5982    ms
Mozway's solution:          23.4  ms
This implementation:         0.11 ms

Thus, the solution is about 54381 faster than the initial solution and 212 times faster than the one of Mozway. The code can even be ~30% faster by working in-place (destroy the input array) and by telling Numba the array is contiguous in memory (using ::1 instead of :).

  • Related