I have this dataframe df.
tweet_id | tweet_entites |
---|---|
1223395611921305601 | [{'label': 'NORP', 'term': 'Chinese'}, {'label': 'ORG', 'term': 'InnoCare'}, {'label': 'GPE', 'term': 'Hong Kong'}] |
1223395868092465153 | NaN |
1223396204093902849 | [{'label': 'ORG', 'term': 'LIVE Press'}, {'label': 'ORG', 'term': 'Emergency Committee'}] |
1223396269655089154 | [{'label': 'CARDINAL', 'term': '83'}, {'label': 'CARDINAL', 'term': '2019nCoV'}, {'label': 'CARDINAL', 'term': '83'}] |
I want to extract the list-dictionary into separate columns as such:
tweet_id | label | term |
---|---|---|
1223395611921305601 | NORP | Chinese |
1223395611921305601 | ORG | InnoCare |
1223395611921305601 | GPE | Hong Kong |
1223395868092465153 | NaN | NaN |
1223396204093902849 | ORG | LIVE Press |
1223396204093902849 | ORG | Emergency Committee |
1223396269655089154 | CARDINAL | 83 |
1223396269655089154 | CARDINAL | 2019nCoV |
1223396269655089154 | CARDINAL | 83 |
The new columns will be named as label and term. I've seen references but I haven't been able to find one that is similar to the output that I want.
CodePudding user response:
If there are already lists of dicts use nested list comprehension with replace misisng values:
zipped = zip(df['tweet_id'],
df['tweet_entites'].apply(lambda x: [{'label':np.nan}]
if isinstance(x, float)
else x))
df = [{**{'tweet_id': x}, **z} for x, y in zipped for z in y]
df = pd.DataFrame(df)
print (df)
tweet_id label term
0 1223395611921305601 NORP Chinese
1 1223395611921305601 ORG InnoCare
2 1223395611921305601 GPE Hong Kong
3 1223395868092465153 NaN NaN
4 1223396204093902849 ORG LIVE Press
5 1223396204093902849 ORG Emergency Committee
6 1223396269655089154 CARDINAL 83
7 1223396269655089154 CARDINAL 2019nCoV
8 1223396269655089154 CARDINAL 83
If there are strings repr of list use ast.literal_eval
:
import ast
df['tweet_entites'] = df['tweet_entites'].fillna('[{"label":None}]').apply(ast.literal_eval)
zipped = zip(df['tweet_id'], df['tweet_entites'])
df = [{**{'tweet_id': x}, **z} for x, y in zipped for z in y]
df = pd.DataFrame(df)
print (df)
tweet_id label term
0 1223395611921305601 NORP Chinese
1 1223395611921305601 ORG InnoCare
2 1223395611921305601 GPE Hong Kong
3 1223395868092465153 None NaN
4 1223396204093902849 ORG LIVE Press
5 1223396204093902849 ORG Emergency Committee
6 1223396269655089154 CARDINAL 83
7 1223396269655089154 CARDINAL 2019nCoV
8 1223396269655089154 CARDINAL 83
CodePudding user response:
if df['tweet_entites'] is a string then you can use eval to convert it into list:
import pandas as pd
df = df.fillna("[{'label': None, 'term': None}, {'label': None, 'term': None}, {'label': None, 'term': None}]")
frames = []
for row in df.to_dict(orient="records"):
for i in eval(row["tweet_entites"]):
i["tweet_id"] = int(row["tweet_id"])
frames.append(i)
new_df = pd.DataFrame(frames)
print(new_df)