How to get every value after certain character in a list/array-CodePudding

For example if i have a list

data = ['O', 'O', 'B', 'I', 'I', 'B', 'I', 'O', 'B', 'I']

How can i get every index after B (including B), until it meets another B or O? For example

output => [[2,3,4],[5,6],[8,9]]

Because the first B, I, I is on index 2, 3, 4 the second is B, I which on index 5 and 6 and the last is B, I which on index 8 and 9

Another example

data = ['B', 'I', 'I', 'O', 'O', 'B', 'I', 'B', 'I', 'I', 'O']
output => [[0, 1, 2], [5, 6], [7, 8, 9]]

I am thinking to iterate the list and check one by one. But is there any cleaner and more effective way to do it? Thank you

CodePudding user response：

You can find the indices of 'B' and slice the list by it. Every sublist can be than sliced by the index of 'O' and the indices returned by simple loop

def sub_list(lst, index):
    lst = lst[:lst.index('O') if 'O' in lst else len(lst)]
    # or with itertools:
    # lst = list(itertools.takewhile(lambda x: x != 'O', lst))
    return [i   index for i in range(len(lst))]


def get_indices(data):
    indices = [i for i, x in enumerate(data) if x == 'B']   [len(data)]
    return [sub_list(data[indices[i]:indices[i   1]], indices[i]) for i in range(len(indices) - 1)]


data1 = ['O', 'O', 'B', 'I', 'I', 'B', 'I', 'O', 'B', 'I']
data2 = ['B', 'I', 'I', 'O', 'O', 'B', 'I', 'B', 'I', 'I', 'O']
print(get_indices(data1)) # [[2, 3, 4], [5, 6], [8, 9]]
print(get_indices(data2)) # [[0, 1, 2], [5, 6], [7, 8, 9]]

CodePudding user response：

Normally I would recommend trying to work out a solution using grouping functions from modules such as itertools or more_itertools. However, since you have two different splitter characters, 'B' and 'O', which behave differently, and since you want to produce lists of indices such as [[2,3,4],[5,6],[8,9]] rather than groups of values, such as ['BII', 'BI', 'BI'], it will be a bit cumbersome. You'll have to use enumerate to get the indices along with the values, split depending on the values, do some extra work to differentiate B and O, then discard the values and keep only the indices.

Using `more_itertools.split_before`

Module more_itertools has several functions that are very good for grouping/slicing/splitting/windowing, such as more_itertools.split_before:

Module more_itertools

from more_itertools import split_before

print(list(split_before('OOBIIBIOBI', lambda c: c in 'BO')))
[['O'], ['O'], ['B', 'I', 'I'], ['B', 'I'], ['O'], ['B', 'I']]

def split_B_O(seq):
    yield from (next(zip(*l)) for l in split_before(enumerate(seq), lambda p: p[1] in 'BO') if l[0][1] == 'B')

print(list(split_B_O('OOBIIBIOBI')))
# [(2, 3, 4), (5, 6), (8, 9)]

print(list(split_B_O('BIIOOBIBIIO')))
# [(0, 1, 2), (5, 6), (7, 8, 9)]

print(list(split_B_O('OBI-WAN KENOBI')))
# [(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), (12, 13)]

Using `itertools.groupby`

Grouping adjacent values is the purpose of function itertools.groupby. It groups the values according to a "key" which is a function passed as a parameter. Here we'll write a key that returns a different identified everytime it encounters an 'O' or a 'B', and returns the same identifier as previously if it encounters another character.

from itertools import groupby

def k(c):
    if c[1] in 'OB':
        k.idx  = 1
    return k.idx

k.idx = 0

def split_B_O(seq):
    k.idx = 0
    for _,g in groupby(enumerate(seq), k):
        g = list(g)
        if g[0][1] == 'B':
            yield next(zip(*g))

print(list(split_B_O('OOBIIBIOBI')))
# [(2, 3, 4), (5, 6), (8, 9)]

print(list(split_B_O('BIIOOBIBIIO')))
# [(0, 1, 2), (5, 6), (7, 8, 9)]

print(list(split_B_O('OBI-WAN KENOBI')))
# [(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), (12, 13)]

Writing your own generator

However, since 'B' and 'O' play a different role in the splitting, and most splitting functions that can be found in generic python module can't account for two different roles of splitting character, I think it's easier to write the function yourself with a for-loop and some variables.

def split_B_O(seq):
    inside_group = False
    j = 0
    for i,c in enumerate(seq):
        if c == 'O' and inside_group:
            yield range(j,i)
            inside_group = False
        elif c == 'B' and inside_group:
            yield range(j, i)
            j = i
        elif c == 'B':
            j = i
            inside_group = True
    if inside_group:
        yield range(j, i 1)

print(list(split_B_O(['O', 'O', 'B', 'I', 'I', 'B', 'I', 'O', 'B', 'I'])))
# [range(2, 5), range(5, 7), range(8, 10)]

print(list(split_B_O(['B', 'I', 'I', 'O', 'O', 'B', 'I', 'B', 'I', 'I', 'O'])))
# [range(0, 3), range(5, 7), range(7, 10)]

print([list(r) for r in split_B_O('OBI-WAN KENOBI')])
# [[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], [12, 13]]

If for some reason you absolutely dislike range objects, you can replace yield range(j,i 1) with yield list(range(j,i 1)) to get lists of indices instead. But range objects are great, so I recommend against it.

CodePudding user response：

Try this!

data = ['O', 'O', 'B', 'I', 'I', 'B', 'I', 'O', 'B', 'I']

res = []

i = -1
is_first = False
o_found = False
for index, value in enumerate(data):
    if value == 'O':
        o_found=True
    if value != 'B' and o_found:
        continue
    if value == 'B':
        i =1
        is_first = True
        o_found=False
        res.append([])
    if is_first and o_found is False:
        res[i].append(index)

print(res)

Using more_itertools.split_before

Using itertools.groupby

Writing your own generator

Using `more_itertools.split_before`

Using `itertools.groupby`