I have a column with annotations of sentences in IOB format. A row looks roughly like this:
data['labels'][0] = '['O', 'O', 'O', 'B-l1', 'O', 'B-l1', 'I-l2', 'I-l2', 'O', 'I-l2']'
I want to get the unique labels: 'O'
, 'B-l1'
, and 'I-l2'
. The idea is to remove all rows that are not annotated, meaning the only label in the list is 'O'
.
This is my current code:
list(set(data['labels][0]))
But it returns each symbol on a new row:
'O',
'B',
'-',
'l',
'1',
'I',
'2',
','
which is not what I am looking for.
I would appreciate some help here. Thanks.
CodePudding user response:
To filter your rows, you can use set operations:
S = {'O'}
data[[not S.issuperset(l) for l in data['labels']]]
Example input:
data = pd.DataFrame({'labels': [['O'], ['O', 'B-l1'], []]})
Output:
labels
1 [O, B-l1]
converting from strings
If you have strings representations of lists:
import ast
data['labels'] = [list(set(ast.literal_eval(l))) for l in data['labels']]
CodePudding user response:
Another possible solution, based on numpy.unique
:
lst = ['O', 'O', 'O', 'B-l1', 'O', 'B-l1', 'I-l2', 'I-l2', 'O', 'I-l2']
np.unique(lst).tolist()
Output:
['B-l1', 'I-l2', 'O']