Given a binary sequence, say: 000110111100011, I wish to mark all sections of contiguous 1s.
000110111100011
0123456789abcdef
I'll use hex notation, as 16 is a good length for an example sequence.
This should return (3 5) (6 a) (d f)
The first region occupies positions 3 and 4, therefore (using Python's convention), (3,5)
denotes the region.
However occasionally a bit is corrupted, so I wish to add a tolerance so that it ignores up to k
consecutive 0s.
So, setting min_gap=2
in the above example, I would like to get: (3 a) (d f)
.
i.e. the 5) (6
is removed, as that gap comprises just one zero, and is 'patched over'.
Seems simple, but I struggled to fully vectorize a solution in numpy.
def get_segments(segment_of_sums, mingap=1):
nonempty_padded = np.concatenate(( [0], segment_of_sums, [0] ))
b = nonempty_padded > 0
edges = b[:-1] ^ b[1:]
indices = np.argwhere(edges)[:,0]
# indices[0] will always be the first on-ramp, and indices[-1] will be the last off-ramp
# if segment ends with a 1, indices[-1] will be len(segment)
# len(indices) will ALWAYS be an even number
if mingap>1 and len(indices) > 2:
gap_lengths = indices[2::2] - indices[1::2][:-1]
gap_keeps = gap_lengths >= mingap
index_keeps = np.zeros_like(indices, dtype=np.bool8)
index_keeps[[0, -1]] = True
for i, flag in enumerate(gap_keeps):
if flag:
index_keeps[[2*i 1, 2*i 2]] = True
indices = indices[np.argwhere(index_keeps)[:,0]]
return indices.reshape((-1,2))
TEST = True
if TEST:
for L in [
[0,1,0, 1,1,1, 0,0,1, 1,1,0],
[0,1,0, 1,1,1, 0,0,1, 1,1,1],
[1,1,0, 1,1,1, 0,0,1, 1,1,1]
]:
A = np.array(L, dtype=np.bool8)
print(
get_segments(A, mingap=2)
)
This code correctly prints:
[[ 1 6]
[ 8 11]]
[[ 1 6]
[ 8 12]]
[[ 0 6]
[ 8 12]]
However the for
loop feels clumsy.
Can anyone see a cleaner technique?
CodePudding user response:
Not sure you can easily vectorize this, but here is a solution using itertools.groupby
that should be quite fast:
s = '000110111100011'
from itertools import groupby
[('%x' % (x:=list(g))[0][0], '%x' % (x[-1][0] 1))
for k,g in groupby(enumerate(s), lambda x: x[1])
if k == '1']
Output:
[('3', '5'), ('6', 'a'), ('d', 'f')]
CodePudding user response:
If you consider another package, pandas
offers good vectorized forward fill and groupby operators, which is ideal for your use case. Plus, overhead over numpy is small:
import pandas as pd
d=pd.DataFrame({'a':list('0001101111000110'), # extra `0` at end
'b':list('0123456789abcdef')}
)
thresh = 2
s = d['a'].eq('1')
s = s.where(s).ffill(limit=thresh-1) # here's where forward fills comes into play
(d.where(s.notna()).groupby(s.isna().cumsum())
.agg(min_idx=('b','first'), max_idx=('b','last'))
.dropna()
)
You would get something like this:
min_idx max_idx
a
3 3 a
5 d f