For example if i have a list
data = ['O', 'O', 'B', 'I', 'I', 'B', 'I', 'O', 'B', 'I']
How can i get every index after B (including B), until it meets another B or O? For example
output => [[2,3,4],[5,6],[8,9]]
Because the first B, I, I is on index 2, 3, 4 the second is B, I which on index 5 and 6 and the last is B, I which on index 8 and 9
Another example
data = ['B', 'I', 'I', 'O', 'O', 'B', 'I', 'B', 'I', 'I', 'O']
output => [[0, 1, 2], [5, 6], [7, 8, 9]]
I am thinking to iterate the list and check one by one. But is there any cleaner and more effective way to do it? Thank you
CodePudding user response:
You can find the indices of 'B'
and slice the list by it. Every sublist can be than sliced by the index of 'O'
and the indices returned by simple loop
def sub_list(lst, index):
lst = lst[:lst.index('O') if 'O' in lst else len(lst)]
# or with itertools:
# lst = list(itertools.takewhile(lambda x: x != 'O', lst))
return [i index for i in range(len(lst))]
def get_indices(data):
indices = [i for i, x in enumerate(data) if x == 'B'] [len(data)]
return [sub_list(data[indices[i]:indices[i 1]], indices[i]) for i in range(len(indices) - 1)]
data1 = ['O', 'O', 'B', 'I', 'I', 'B', 'I', 'O', 'B', 'I']
data2 = ['B', 'I', 'I', 'O', 'O', 'B', 'I', 'B', 'I', 'I', 'O']
print(get_indices(data1)) # [[2, 3, 4], [5, 6], [8, 9]]
print(get_indices(data2)) # [[0, 1, 2], [5, 6], [7, 8, 9]]
CodePudding user response:
Normally I would recommend trying to work out a solution using grouping functions from modules such as itertools
or more_itertools
. However, since you have two different splitter characters, 'B' and 'O', which behave differently, and since you want to produce lists of indices such as [[2,3,4],[5,6],[8,9]]
rather than groups of values, such as ['BII', 'BI', 'BI']
, it will be a bit cumbersome. You'll have to use enumerate
to get the indices along with the values, split depending on the values, do some extra work to differentiate B and O, then discard the values and keep only the indices.
Using more_itertools.split_before
Module more_itertools
has several functions that are very good for grouping/slicing/splitting/windowing, such as more_itertools.split_before
:
from more_itertools import split_before
print(list(split_before('OOBIIBIOBI', lambda c: c in 'BO')))
[['O'], ['O'], ['B', 'I', 'I'], ['B', 'I'], ['O'], ['B', 'I']]
def split_B_O(seq):
yield from (next(zip(*l)) for l in split_before(enumerate(seq), lambda p: p[1] in 'BO') if l[0][1] == 'B')
print(list(split_B_O('OOBIIBIOBI')))
# [(2, 3, 4), (5, 6), (8, 9)]
print(list(split_B_O('BIIOOBIBIIO')))
# [(0, 1, 2), (5, 6), (7, 8, 9)]
print(list(split_B_O('OBI-WAN KENOBI')))
# [(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), (12, 13)]
Using itertools.groupby
Grouping adjacent values is the purpose of function itertools.groupby
. It groups the values according to a "key" which is a function passed as a parameter. Here we'll write a key that returns a different identified everytime it encounters an 'O' or a 'B', and returns the same identifier as previously if it encounters another character.
from itertools import groupby
def k(c):
if c[1] in 'OB':
k.idx = 1
return k.idx
k.idx = 0
def split_B_O(seq):
k.idx = 0
for _,g in groupby(enumerate(seq), k):
g = list(g)
if g[0][1] == 'B':
yield next(zip(*g))
print(list(split_B_O('OOBIIBIOBI')))
# [(2, 3, 4), (5, 6), (8, 9)]
print(list(split_B_O('BIIOOBIBIIO')))
# [(0, 1, 2), (5, 6), (7, 8, 9)]
print(list(split_B_O('OBI-WAN KENOBI')))
# [(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), (12, 13)]
Writing your own generator
However, since 'B'
and 'O'
play a different role in the splitting, and most splitting functions that can be found in generic python module can't account for two different roles of splitting character, I think it's easier to write the function yourself with a for-loop and some variables.
def split_B_O(seq):
inside_group = False
j = 0
for i,c in enumerate(seq):
if c == 'O' and inside_group:
yield range(j,i)
inside_group = False
elif c == 'B' and inside_group:
yield range(j, i)
j = i
elif c == 'B':
j = i
inside_group = True
if inside_group:
yield range(j, i 1)
print(list(split_B_O(['O', 'O', 'B', 'I', 'I', 'B', 'I', 'O', 'B', 'I'])))
# [range(2, 5), range(5, 7), range(8, 10)]
print(list(split_B_O(['B', 'I', 'I', 'O', 'O', 'B', 'I', 'B', 'I', 'I', 'O'])))
# [range(0, 3), range(5, 7), range(7, 10)]
print([list(r) for r in split_B_O('OBI-WAN KENOBI')])
# [[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], [12, 13]]
If for some reason you absolutely dislike range
objects, you can replace yield range(j,i 1)
with yield list(range(j,i 1))
to get lists of indices instead. But range
objects are great, so I recommend against it.
CodePudding user response:
Try this!
data = ['O', 'O', 'B', 'I', 'I', 'B', 'I', 'O', 'B', 'I']
res = []
i = -1
is_first = False
o_found = False
for index, value in enumerate(data):
if value == 'O':
o_found=True
if value != 'B' and o_found:
continue
if value == 'B':
i =1
is_first = True
o_found=False
res.append([])
if is_first and o_found is False:
res[i].append(index)
print(res)