numpy vectorization of algorithm to return indices of contiguous sections in binary sequence-CodePudding

Given a binary sequence, say: 000110111100011, I wish to mark all sections of contiguous 1s.

000110111100011
0123456789abcdef

I'll use hex notation, as 16 is a good length for an example sequence.

This should return (3 5) (6 a) (d f) The first region occupies positions 3 and 4, therefore (using Python's convention), (3,5) denotes the region.

However occasionally a bit is corrupted, so I wish to add a tolerance so that it ignores up to k consecutive 0s.

So, setting min_gap=2 in the above example, I would like to get: (3 a) (d f).

i.e. the 5) (6 is removed, as that gap comprises just one zero, and is 'patched over'.

Seems simple, but I struggled to fully vectorize a solution in numpy.

def get_segments(segment_of_sums, mingap=1):
    nonempty_padded = np.concatenate(( [0], segment_of_sums, [0] ))
    b = nonempty_padded > 0
    edges = b[:-1] ^ b[1:]
    indices = np.argwhere(edges)[:,0]
    # indices[0] will always be the first on-ramp, and indices[-1] will be the last off-ramp
    # if segment ends with a 1, indices[-1] will be len(segment)
    # len(indices) will ALWAYS be an even number
 
    if mingap>1 and len(indices) > 2:
        gap_lengths = indices[2::2] - indices[1::2][:-1]
        gap_keeps = gap_lengths >= mingap

        index_keeps = np.zeros_like(indices, dtype=np.bool8)
        index_keeps[[0, -1]] = True
        for i, flag in enumerate(gap_keeps):
            if flag:
                index_keeps[[2*i 1, 2*i 2]] = True
        
        indices = indices[np.argwhere(index_keeps)[:,0]]

    return indices.reshape((-1,2))

TEST = True
if TEST:
    for L in [
        [0,1,0, 1,1,1, 0,0,1, 1,1,0],
        [0,1,0, 1,1,1, 0,0,1, 1,1,1],
        [1,1,0, 1,1,1, 0,0,1, 1,1,1]
    ]:
        A = np.array(L, dtype=np.bool8)
        print(
            get_segments(A, mingap=2)
        )

This code correctly prints:

[[ 1  6]
 [ 8 11]]
[[ 1  6]
 [ 8 12]]
[[ 0  6]
 [ 8 12]]

However the for loop feels clumsy.

Can anyone see a cleaner technique?

CodePudding user response：

Not sure you can easily vectorize this, but here is a solution using itertools.groupby that should be quite fast:

s = '000110111100011'

from itertools import groupby

[('%x' % (x:=list(g))[0][0], '%x' % (x[-1][0] 1))
 for k,g in groupby(enumerate(s), lambda x: x[1])
 if k == '1']

Output:

[('3', '5'), ('6', 'a'), ('d', 'f')]

CodePudding user response：

If you consider another package, pandas offers good vectorized forward fill and groupby operators, which is ideal for your use case. Plus, overhead over numpy is small:

import pandas as pd

d=pd.DataFrame({'a':list('0001101111000110'),  # extra `0` at end
                'b':list('0123456789abcdef')}
              )

thresh = 2
s = d['a'].eq('1')

s = s.where(s).ffill(limit=thresh-1)           # here's where forward fills comes into play

(d.where(s.notna()).groupby(s.isna().cumsum())
  .agg(min_idx=('b','first'), max_idx=('b','last'))
  .dropna()
)

You would get something like this:

  min_idx max_idx
a                
3       3       a
5       d       f