How can I convert sentences column to multiple columns?
import pandas as pd
df = pd.DataFrame(data={'id': [0, 1, 2, 3], 'sentences': [
{0: ['first sentence0', 'second sentence0', 'label0']},
{1: ['first sentence1', 'second sentence1', 'label1']},
{2: ['first sentence2', 'second sentence2', 'label2']},
{3: ['first sentence3', 'second sentence3', 'label3']}]})
| | id | sentences |
|---:|-----:|:-------------------------------------------------------|
| 0 | 0 | {0: ['first sentence0', 'second sentence0', 'label0']} |
| 1 | 1 | {1: ['first sentence1', 'second sentence1', 'label1']} |
| 2 | 2 | {2: ['first sentence2', 'second sentence2', 'label2']} |
| 3 | 3 | {3: ['first sentence3', 'second sentence3', 'label3']} |
Expected output:
| id | sentences | label |
|-----:|:-----------------|:--------|
| 0 | first sentence0 | label0 |
| 0 | second sentence0 | label0 |
| 1 | first sentence1 | label1 |
| 1 | second sentence1 | label1 |
| 2 | first sentence2 | label2 |
| 2 | second sentence2 | label2 |
| 3 | first sentence3 | label3 |
| 3 | second sentence3 | label3 |
The dataframe
has over 20,000 rows / 2 columns. Open for efficient solution also with loops. Maybe pd.json_normalize
?
CodePudding user response:
One way could be:
from itertools import product
(df.assign(sentences=[list(product(v[-1:], v[:-1]))
for d in df['sentences'] for v in list(d.values())])
.explode('sentences')
.assign(labels=lambda d: d['sentences'].str[0],
sentences=lambda d: d['sentences'].str[1],
)
)
output:
id sentences labels
0 0 first sentence0 label0
0 0 second sentence0 label0
1 1 first sentence1 label1
1 1 second sentence1 label1
2 2 first sentence2 label2
2 2 second sentence2 label2
3 3 first sentence3 label3
3 3 second sentence3 label3