Pandas: Extracting List-Dictionary column into separate columns and rows-CodePudding

I have this dataframe df.

tweet_id	tweet_entites
1223395611921305601	[{'label': 'NORP', 'term': 'Chinese'}, {'label': 'ORG', 'term': 'InnoCare'}, {'label': 'GPE', 'term': 'Hong Kong'}]
1223395868092465153	NaN
1223396204093902849	[{'label': 'ORG', 'term': 'LIVE Press'}, {'label': 'ORG', 'term': 'Emergency Committee'}]
1223396269655089154	[{'label': 'CARDINAL', 'term': '83'}, {'label': 'CARDINAL', 'term': '2019nCoV'}, {'label': 'CARDINAL', 'term': '83'}]

I want to extract the list-dictionary into separate columns as such:

tweet_id	label	term
1223395611921305601	NORP	Chinese
1223395611921305601	ORG	InnoCare
1223395611921305601	GPE	Hong Kong
1223395868092465153	NaN	NaN
1223396204093902849	ORG	LIVE Press
1223396204093902849	ORG	Emergency Committee
1223396269655089154	CARDINAL	83
1223396269655089154	CARDINAL	2019nCoV
1223396269655089154	CARDINAL	83

The new columns will be named as label and term. I've seen references but I haven't been able to find one that is similar to the output that I want.

CodePudding user response：

If there are already lists of dicts use nested list comprehension with replace misisng values:

zipped = zip(df['tweet_id'], 
             df['tweet_entites'].apply(lambda x: [{'label':np.nan}] 
                                                 if isinstance(x, float) 
                                                 else x))

df = [{**{'tweet_id': x}, **z} for x, y in zipped for z in y]

df = pd.DataFrame(df)
print (df)
              tweet_id     label                 term
0  1223395611921305601      NORP              Chinese
1  1223395611921305601       ORG             InnoCare
2  1223395611921305601       GPE            Hong Kong
3  1223395868092465153       NaN                  NaN
4  1223396204093902849       ORG           LIVE Press
5  1223396204093902849       ORG  Emergency Committee
6  1223396269655089154  CARDINAL                   83
7  1223396269655089154  CARDINAL             2019nCoV
8  1223396269655089154  CARDINAL                   83

If there are strings repr of list use ast.literal_eval:

import ast

df['tweet_entites'] = df['tweet_entites'].fillna('[{"label":None}]').apply(ast.literal_eval)


zipped = zip(df['tweet_id'], df['tweet_entites'])

df = [{**{'tweet_id': x}, **z} for x, y in zipped for z in y]
df = pd.DataFrame(df)
print (df)
              tweet_id     label                 term
0  1223395611921305601      NORP              Chinese
1  1223395611921305601       ORG             InnoCare
2  1223395611921305601       GPE            Hong Kong
3  1223395868092465153      None                  NaN
4  1223396204093902849       ORG           LIVE Press
5  1223396204093902849       ORG  Emergency Committee
6  1223396269655089154  CARDINAL                   83
7  1223396269655089154  CARDINAL             2019nCoV
8  1223396269655089154  CARDINAL                   83

CodePudding user response：

if df['tweet_entites'] is a string then you can use eval to convert it into list:

import pandas as pd

df = df.fillna("[{'label': None, 'term': None}, {'label': None, 'term': None}, {'label': None, 'term': None}]")

frames = []
for row in df.to_dict(orient="records"):
  for i in eval(row["tweet_entites"]):
    i["tweet_id"] = int(row["tweet_id"])
    frames.append(i)

new_df = pd.DataFrame(frames)
print(new_df)