Expand selected keys in a json pandas column-CodePudding

I have this sample dataset:

the_df = pd.DataFrame(
    {'id':['AM','AN','AP'],
     'target':[130,60,180],
     'moves':[[{'date':'2022-08-01','amount':285.0,'name':'Cookie'},
               {'name':'Rush','amount':10,'date':'2022-08-02','type':'song'}],
              [{'amount':250.5,'date':'2022-08-01','source':{'data':'bing'}}],[]]})
the_df
    id  target  moves
0   AM  130     [{'date': '2022-08-01', 'amount': 285.0, 'name...
1   AN  60      [{'amount': 250.5, 'date': '2022-08-01', 'sour...
2   AP  180     []

And I want to 'expand' (or 'explode') each value in the json column, but only selecting some columns. This is the expected result:


    id  target  date        amount  name
0   AM  130     2022-08-01  285.0   Cookie
1   AM  130     2022-08-02  10.0    Rush
2   AN  60      2022-08-01  250.5   
3   AP  180

Firstly I tried using json_normalize and iterate over each row (even when the last row has no data), but I have to know before how many rows I'm going to expand:

pd.json_normalize(the_df.moves[0])
    date        amount  name    type
0   2022-08-01  285.0   Cookie  NaN
1   2022-08-02  10.0    Rush    song

pd.json_normalize(the_df.moves[1])
    amount  date    source.data
0   250.5   2022-08-01  bing

pd.json_normalize(the_df.moves[2])

I only want keys date,amount and name. So I tried this:

temp_df = pd.DataFrame(columns=['date','amount','name'])
for i in range(len(the_df)):
    temp_df = temp_df.append(pd.json_normalize(the_df.moves[i]))
temp_df
    date        amount  name    type    source.data
0   2022-08-01  285.0   Cookie  NaN     NaN
1   2022-08-02  10.0    Rush    song    NaN
0   2022-08-01  250.5   NaN     NaN     bing

But my data frame temp_df needs a reference to the original dataset the_df to apply a merge. Please, could you guide me to the right solution? I guess there must be a way to recall the id, or a method in pandas to do this without a for loop.

CodePudding user response：

You can explode the moves column first

df = df.explode('moves', ignore_index=True)
out = df.join(pd.json_normalize(df.pop('moves'))[['date','amount','name']])

print(df)

   id  target        date  amount    name
0  AM     130  2022-08-01   285.0  Cookie
1  AM     130  2022-08-02    10.0    Rush
2  AN      60  2022-08-01   250.5     NaN
3  AP     180         NaN     NaN     NaN

CodePudding user response：

Here are the steps you could follow

(1) define df

df = pd.DataFrame(
    {'id':['AM','AN','AP'],
     'target':[130,60,180],
     'moves':[[{'date':'2022-08-01','amount':285.0,'name':'Cookie'},
               {'name':'Rush','amount':10,'date':'2022-08-02','type':'song'}],
              [{'amount':250.5,'date':'2022-08-01','source':{'data':'bing'}}],[]]})
print(df)

   id  target                                                                                                                              moves
0  AM     130  [{'date': '2022-08-01', 'amount': 285.0, 'name': 'Cookie'}, {'name': 'Rush', 'amount': 10, 'date': '2022-08-02', 'type': 'song'}]
1  AN      60                                                              [{'amount': 250.5, 'date': '2022-08-01', 'source': {'data': 'bing'}}]
2  AP     180                                                                                                                                 []

(2) explode the column 'moves'

df1 = df.explode('moves', ignore_index=True)
print(df1)

   id  target                                                                 moves
0  AM     130             {'date': '2022-08-01', 'amount': 285.0, 'name': 'Cookie'}
1  AM     130  {'name': 'Rush', 'amount': 10, 'date': '2022-08-02', 'type': 'song'}
2  AN      60   {'amount': 250.5, 'date': '2022-08-01', 'source': {'data': 'bing'}}
3  AP     180                                                                   NaN

(3) json_normalize the column 'moves'

df2 = pd.json_normalize(df1['moves'])
print(df2)

         date  amount    name  type source.data
0  2022-08-01   285.0  Cookie   NaN         NaN
1  2022-08-02    10.0    Rush  song         NaN
2  2022-08-01   250.5     NaN   NaN        bing
3         NaN     NaN     NaN   NaN         NaN

(4) concat the 2 df with only the relevant columns

df3 = pd.concat([df1[['id', 'target']], df2[['date', 'amount', 'name']]], axis=1)
print(df3)

   id  target        date  amount    name
0  AM     130  2022-08-01   285.0  Cookie
1  AM     130  2022-08-02    10.0    Rush
2  AN      60  2022-08-01   250.5     NaN
3  AP     180         NaN     NaN     NaN