Problem statement:
As stated by the title, I want to remove parts from an 1D array that have consecutive zeros and length equal or above a threshold.
My solution:
I produced the solution shown in the following MRE:
import numpy as np
THRESHOLD = 4
a = np.array((1,1,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,1))
print("Input: " str(a))
# Find the indices of the parts that meet threshold requirement
gaps_above_threshold_inds = np.where(np.diff(np.nonzero(a)[0]) - 1 >= THRESHOLD)[0]
# Delete these parts from array
for idx in gaps_above_threshold_inds:
a = np.delete(a, list(range(np.nonzero(a)[0][idx] 1, np.nonzero(a)[0][idx 1])))
print("Output: " str(a))
Output:
Input: [1 1 0 1 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1]
Output: [1 1 0 1 1 1 0 0 0 1 1]
Question:
Is there a less complicated and more efficient way to do this on a numpy array?
Edit:
Based on @mozway comments, I'm editing my question providing some more information.
Basically, the problem domain is:
- I have 1D signals of length ~20.000 samples
- Some parts of the signals have been zeroed due to noise
- The rest of the signal has non-zero values, in the range ~[50, 250]
My goal is to remove the zero parts above a length threshold as I have already said.
More detailed questions:
- As far as numpy efficient handling is concerned, is there a better solution from the one above?
- As far as efficient signal processing techniques are concerned, is there more suitable way to achieve the desired goal than using numpy?
CodePudding user response:
A quite efficient method is to use itertools.groupby
itertools.chain
:
from itertools import groupby, chain
a2 = np.array(list(chain(*(l for k,g in groupby(a)
if len(l:=list(g))<THRESHOLD or k))))
output:
array([1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1])
This works relatively fast, for instance on 1 million items:
# A = np.random.randint(2, size=1000000)
%%timeit
np.array(list(chain(*(l for k,g in groupby(a)
if len(l:=list(g))<THRESHOLD or k))))
# 254 ms ± 3.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
CodePudding user response:
np.delete
create a new array every time it is called which is very inefficient. A faster solution is to store all the value to keep in a mask/boolean array and then filter the input array at once. However, this will still likely require a pure-Python loop if done only with Numpy. A simpler and faster solution is to use Numba (or Cython) to do that. Here is an implementation:
import numpy as np
import numba as nb
@nb.njit('int_[:](int_[:], int_)')
def filterZeros(arr, threshold):
n = len(arr)
res = np.empty(n, dtype=arr.dtype)
count = 0
j = 0
for i in range(n):
if arr[i] == 0:
count = 1
else:
if count >= threshold:
j -= count
count = 0
res[j] = arr[i]
j = 1
if n > 0 and arr[n-1] == 0 and count >= threshold:
j -= count
return res[0:j]
a = np.array((1,1,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,1))
a = filterZeros(a, 4)
print("Output: " str(a))
Here are the result with a random binary array containing 100_000 items on my machine:
Reference implementation: 5982 ms
Mozway's solution: 23.4 ms
This implementation: 0.11 ms
Thus, the solution is about 54381 faster than the initial solution and 212 times faster than the one of Mozway. The code can even be ~30% faster by working in-place (destroy the input array) and by telling Numba the array is contiguous in memory (using ::1
instead of :
).