Numpy: Get indices of boundaries in an array where starts of boundaries always start with a particul-CodePudding

Problem:

The most computationally efficient solution to getting the indices of boundaries in an array where starts of boundaries always start with a particular number and non-boundaries are indicated by a different particular number.

Differences between this question and other boundary-based numpy questions on SO:

here are some other boundary based numpy questions

Numpy 1D array - find indices of boundaries of subsequences of the same number

Getting the boundary of numpy array shape with a hole

Extracting boundary of a numpy array

The difference between the question I am asking and other stackoverflow posts in my attempt to search for a solution is that the other boundaries are indicated by a jump in value, or a 'hole' of values.

What seems to be unique to my case is the starts of boundaries always start with a particular number.

Motivation:

This problem is inspired by IOB tagging in natural language processing. In IOB tagging, the start of a word is tagged with B [beginning] is the tag of the first letter in an entity, I [inside] is the tag for all other characters besides the first character in a word, and [O] is used to tag all non-entity characters

Example:

import numpy as np

a = np.array(
    [
     0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
    ]
)

1 is the start of each boundary. If a boundary has a length greater than one, then 2 makes up the rest of the boundary. 0 are non-boundary numbers.

The entities of these boundaries are 1, 2, 2, 2, 1, 1,2, 1, 1, 1, 1, 1

So the desired solution; the indices of the indices boundary values for a are

desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]

Current Solution:

If flattened, the numbers in the desired solution are in ascending order. So the raw indices numbers can be calculated, sorted, and reshaped later.

I can get the start indices using

starts = np.where(a==1)[0]
starts

array([ 3, 10, 13, 15, 16, 19, 20, 21])

So what's left is 6, 10, 14, 15,16,19,20,21

I can get all except 1 using 3 different conditionals where I can compare a shifted array to the original by decreases in values and the values of the non-shifted array.

first = np.where(a[:-1] - 2 == a[1:])[0]
first

array([6])

second = np.where((a[:-1] - 1 == a[1:]) & 
    ((a[1:]==1) | (a[1:]==0)))[0]
second

array([10, 14, 16])

third = np.where(
    (a[:-1] == a[1:]) &
    (a[1:]==1)
    )[0]
third

array([15, 19, 20])

The last number I need is 21, but since I needed to shorten the length of the array by 1 to do the shifted comparisons, I'm not sure how to get that particular value using logic, so I just used a simple if statement for that.

Using the rest of the retrieved values for the indices, I can concatenate all the values and reshape them.

if (a[-1] == 1) | (a[-1] == 2):
    pen = np.concatenate((
        starts, first, second, third, np.array([a.shape[0]-1])
    ))
else:
    pen = np.concatenate((
        starts, first, second, third, 
    ))
np.sort(pen).reshape(-1,2)

array([[ 3,  6],
       [10, 10],
       [13, 14],
       [15, 15],
       [16, 16],
       [19, 19],
       [20, 20],
       [21, 21]])

Is this the most computationally efficient solution for my answer? I realize the four where statements can be combined with or operators but wanted to have each separate for the reader to see each result in this post. But I am wondering if there is a more computationally efficient solution since I have not mastered all of numpy's functions and am unsure of the computational efficiency of each.

CodePudding user response：

A standard trick for this type of problem is to pad the input appropriately. In this case, it is helpful to append a 0 to the end of the array:

In [55]: a1 = np.concatenate((a, [0]))

In [56]: a1
Out[56]: 
array([0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1,
       0])

Then your starts calculation still works:

In [57]: starts = np.where(a1 == 1)[0]

In [58]: starts
Out[58]: array([ 3, 10, 13, 15, 16, 19, 20, 21])

The condition for the end is that the value is a 1 or a 2 followed by a value that is not 2. You've already figured out that to handle the "followed by" condition, you can use a shifted version of the array. To implement the and and or conditions, use the bitwise binary operators & and |, respectiveley. In code, it looks like:

In [61]: ends = np.where((a1[:-1] != 0) & (a1[1:] != 2))[0]

In [62]: ends
Out[62]: array([ 6, 10, 14, 15, 16, 19, 20, 21])

Finally, put starts and ends into a single array:

In [63]: np.column_stack((starts, ends))
Out[63]: 
array([[ 3,  6],
       [10, 10],
       [13, 14],
       [15, 15],
       [16, 16],
       [19, 19],
       [20, 20],
       [21, 21]])