Get unique elements from list row in pandas-CodePudding

I have a column with annotations of sentences in IOB format. A row looks roughly like this:

data['labels'][0] = '['O', 'O', 'O', 'B-l1', 'O', 'B-l1', 'I-l2', 'I-l2', 'O', 'I-l2']'

I want to get the unique labels: 'O', 'B-l1', and 'I-l2'. The idea is to remove all rows that are not annotated, meaning the only label in the list is 'O'.

This is my current code:

list(set(data['labels][0]))

But it returns each symbol on a new row:

'O',
'B',
'-',
'l',
'1',
'I',
'2',
','

which is not what I am looking for.

I would appreciate some help here. Thanks.

CodePudding user response：

To filter your rows, you can use set operations:

S = {'O'}

data[[not S.issuperset(l) for l in data['labels']]]

Example input:

data = pd.DataFrame({'labels': [['O'], ['O', 'B-l1'], []]})

Output:

      labels
1  [O, B-l1]

converting from strings

If you have strings representations of lists:

import ast

data['labels'] = [list(set(ast.literal_eval(l))) for l in data['labels']]

CodePudding user response：

Another possible solution, based on numpy.unique:

lst = ['O', 'O', 'O', 'B-l1', 'O', 'B-l1', 'I-l2', 'I-l2', 'O', 'I-l2']

np.unique(lst).tolist()

Output:

['B-l1', 'I-l2', 'O']