Home > Software design >  Pandas: Extracting List-Dictionary column into separate columns and rows
Pandas: Extracting List-Dictionary column into separate columns and rows

Time:03-10

I have this dataframe df.

tweet_id tweet_entites
1223395611921305601 [{'label': 'NORP', 'term': 'Chinese'}, {'label': 'ORG', 'term': 'InnoCare'}, {'label': 'GPE', 'term': 'Hong Kong'}]
1223395868092465153 NaN
1223396204093902849 [{'label': 'ORG', 'term': 'LIVE Press'}, {'label': 'ORG', 'term': 'Emergency Committee'}]
1223396269655089154 [{'label': 'CARDINAL', 'term': '83'}, {'label': 'CARDINAL', 'term': '2019nCoV'}, {'label': 'CARDINAL', 'term': '83'}]

I want to extract the list-dictionary into separate columns as such:

tweet_id label term
1223395611921305601 NORP Chinese
1223395611921305601 ORG InnoCare
1223395611921305601 GPE Hong Kong
1223395868092465153 NaN NaN
1223396204093902849 ORG LIVE Press
1223396204093902849 ORG Emergency Committee
1223396269655089154 CARDINAL 83
1223396269655089154 CARDINAL 2019nCoV
1223396269655089154 CARDINAL 83

The new columns will be named as label and term. I've seen references but I haven't been able to find one that is similar to the output that I want.

CodePudding user response:

If there are already lists of dicts use nested list comprehension with replace misisng values:

zipped = zip(df['tweet_id'], 
             df['tweet_entites'].apply(lambda x: [{'label':np.nan}] 
                                                 if isinstance(x, float) 
                                                 else x))

df = [{**{'tweet_id': x}, **z} for x, y in zipped for z in y]

df = pd.DataFrame(df)
print (df)
              tweet_id     label                 term
0  1223395611921305601      NORP              Chinese
1  1223395611921305601       ORG             InnoCare
2  1223395611921305601       GPE            Hong Kong
3  1223395868092465153       NaN                  NaN
4  1223396204093902849       ORG           LIVE Press
5  1223396204093902849       ORG  Emergency Committee
6  1223396269655089154  CARDINAL                   83
7  1223396269655089154  CARDINAL             2019nCoV
8  1223396269655089154  CARDINAL                   83

If there are strings repr of list use ast.literal_eval:

import ast

df['tweet_entites'] = df['tweet_entites'].fillna('[{"label":None}]').apply(ast.literal_eval)


zipped = zip(df['tweet_id'], df['tweet_entites'])

df = [{**{'tweet_id': x}, **z} for x, y in zipped for z in y]
df = pd.DataFrame(df)
print (df)
              tweet_id     label                 term
0  1223395611921305601      NORP              Chinese
1  1223395611921305601       ORG             InnoCare
2  1223395611921305601       GPE            Hong Kong
3  1223395868092465153      None                  NaN
4  1223396204093902849       ORG           LIVE Press
5  1223396204093902849       ORG  Emergency Committee
6  1223396269655089154  CARDINAL                   83
7  1223396269655089154  CARDINAL             2019nCoV
8  1223396269655089154  CARDINAL                   83

CodePudding user response:

if df['tweet_entites'] is a string then you can use eval to convert it into list:

import pandas as pd

df = df.fillna("[{'label': None, 'term': None}, {'label': None, 'term': None}, {'label': None, 'term': None}]")

frames = []
for row in df.to_dict(orient="records"):
  for i in eval(row["tweet_entites"]):
    i["tweet_id"] = int(row["tweet_id"])
    frames.append(i)

new_df = pd.DataFrame(frames)
print(new_df)
  • Related