I created a function to extract sentences from a specific key in a nested file. Now I would like to include in this function a label each time it comes to a new dictionary.
Each time the the value HEADER appears marks the begining of a NEW story. So I would like to label the sentences that belong to the same story. And differentiate those that are different.
The data looks like the following:
sentences = [{'c': 'HEADER', 'a1': {'a': 'Opus dei, la vie en rose.', 'x': 'l'}},
{'d': 'm', 'a1': {'a': 'Ipsum lorem, Suspendisse posuere.', 'x': '4'}},
{'c': 'j', 'a1': {'a': 'Nulla elementum, augue fringilla tincidunt ullamcorper.'}},
{'c':'h', 'b': 'p'},
{'a1': {'a': 'Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum.'}},
{'c': 'HEADER', 'a1': {'a': 'NEW Opus dei, la vie en rose.', 'x': 'l'}},
{'d': 'm', 'a1': {'a': 'NEW Ipsum lorem, Suspendisse posuere.', 'x': '4'}},
{'c': 'j', 'a1': {'a': 'NEW Nulla elementum, augue fringilla tincidunt ullamcorper.'}},
{'c':'h', 'b': 'p'},
{'a1': {'a': 'NEW Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum.'}}]
The function
def prhases_and_labels(data):
a1 = [d for d in data if 'a1' in d]
text = []
for i in a1:
text.append(i['a1']['a'])
df = pd.DataFrame({'text': text})
return df
The result that I would like to obtain (with the labels in a new column)
CodePudding user response:
You can iterate over the records and increment the label every time the c
value is HEADER
.
sentences = [{'c': 'HEADER', 'a1': {'a': 'Opus dei, la vie en rose.', 'x': 'l'}},
{'d': 'm', 'a1': {'a': 'Ipsum lorem, Suspendisse posuere.', 'x': '4'}},
{'c': 'j', 'a1': {'a': 'Nulla elementum, augue fringilla tincidunt ullamcorper.'}},
{'c':'h', 'b': 'p'},
{'a1': {'a': 'Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum.'}},
{'c': 'HEADER', 'a1': {'a': 'NEW Opus dei, la vie en rose.', 'x': 'l'}},
{'d': 'm', 'a1': {'a': 'NEW Ipsum lorem, Suspendisse posuere.', 'x': '4'}},
{'c': 'j', 'a1': {'a': 'NEW Nulla elementum, augue fringilla tincidunt ullamcorper.'}},
{'c':'h', 'b': 'p'},
{'a1': {'a': 'NEW Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum.'}}]
def prhases_and_labels(data):
label = 0
res = {'text':[], 'label': []}
for record in data:
if 'a1' in record:
line = record['a1']['a']
if record.get('c') == 'HEADER':
label = 1
res['text'].append(line)
res['label'].append(label)
return pd.DataFrame(res)
Output:
>>> prhases_and_labels(sentences)
text label
0 Opus dei, la vie en rose. 1
1 Ipsum lorem, Suspendisse posuere. 1
2 Nulla elementum, augue fringilla tincidunt ullamcorper. 1
3 Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum. 1
4 NEW Opus dei, la vie en rose. 2
5 NEW Ipsum lorem, Suspendisse posuere. 2
6 NEW Nulla elementum, augue fringilla tincidunt ullamcorper. 2
7 NEW Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum. 2