I have this sample dataset:
the_df = pd.DataFrame(
{'id':['AM','AN','AP'],
'target':[130,60,180],
'moves':[[{'date':'2022-08-01','amount':285.0,'name':'Cookie'},
{'name':'Rush','amount':10,'date':'2022-08-02','type':'song'}],
[{'amount':250.5,'date':'2022-08-01','source':{'data':'bing'}}],[]]})
the_df
id target moves
0 AM 130 [{'date': '2022-08-01', 'amount': 285.0, 'name...
1 AN 60 [{'amount': 250.5, 'date': '2022-08-01', 'sour...
2 AP 180 []
And I want to 'expand' (or 'explode') each value in the json column, but only selecting some columns. This is the expected result:
id target date amount name
0 AM 130 2022-08-01 285.0 Cookie
1 AM 130 2022-08-02 10.0 Rush
2 AN 60 2022-08-01 250.5
3 AP 180
Firstly I tried using json_normalize
and iterate over each row (even when the last row has no data), but I have to know before how many rows I'm going to expand:
pd.json_normalize(the_df.moves[0])
date amount name type
0 2022-08-01 285.0 Cookie NaN
1 2022-08-02 10.0 Rush song
pd.json_normalize(the_df.moves[1])
amount date source.data
0 250.5 2022-08-01 bing
pd.json_normalize(the_df.moves[2])
I only want keys date
,amount
and name
. So I tried this:
temp_df = pd.DataFrame(columns=['date','amount','name'])
for i in range(len(the_df)):
temp_df = temp_df.append(pd.json_normalize(the_df.moves[i]))
temp_df
date amount name type source.data
0 2022-08-01 285.0 Cookie NaN NaN
1 2022-08-02 10.0 Rush song NaN
0 2022-08-01 250.5 NaN NaN bing
But my data frame temp_df
needs a reference to the original dataset the_df
to apply a merge. Please, could you guide me to the right solution? I guess there must be a way to recall the id, or a method in pandas to do this without a for loop.
CodePudding user response:
You can explode
the moves
column first
df = df.explode('moves', ignore_index=True)
out = df.join(pd.json_normalize(df.pop('moves'))[['date','amount','name']])
print(df)
id target date amount name
0 AM 130 2022-08-01 285.0 Cookie
1 AM 130 2022-08-02 10.0 Rush
2 AN 60 2022-08-01 250.5 NaN
3 AP 180 NaN NaN NaN
CodePudding user response:
Here are the steps you could follow
(1) define df
df = pd.DataFrame(
{'id':['AM','AN','AP'],
'target':[130,60,180],
'moves':[[{'date':'2022-08-01','amount':285.0,'name':'Cookie'},
{'name':'Rush','amount':10,'date':'2022-08-02','type':'song'}],
[{'amount':250.5,'date':'2022-08-01','source':{'data':'bing'}}],[]]})
print(df)
id target moves
0 AM 130 [{'date': '2022-08-01', 'amount': 285.0, 'name': 'Cookie'}, {'name': 'Rush', 'amount': 10, 'date': '2022-08-02', 'type': 'song'}]
1 AN 60 [{'amount': 250.5, 'date': '2022-08-01', 'source': {'data': 'bing'}}]
2 AP 180 []
(2) explode the column 'moves'
df1 = df.explode('moves', ignore_index=True)
print(df1)
id target moves
0 AM 130 {'date': '2022-08-01', 'amount': 285.0, 'name': 'Cookie'}
1 AM 130 {'name': 'Rush', 'amount': 10, 'date': '2022-08-02', 'type': 'song'}
2 AN 60 {'amount': 250.5, 'date': '2022-08-01', 'source': {'data': 'bing'}}
3 AP 180 NaN
(3) json_normalize the column 'moves'
df2 = pd.json_normalize(df1['moves'])
print(df2)
date amount name type source.data
0 2022-08-01 285.0 Cookie NaN NaN
1 2022-08-02 10.0 Rush song NaN
2 2022-08-01 250.5 NaN NaN bing
3 NaN NaN NaN NaN NaN
(4) concat the 2 df
with only the relevant columns
df3 = pd.concat([df1[['id', 'target']], df2[['date', 'amount', 'name']]], axis=1)
print(df3)
id target date amount name
0 AM 130 2022-08-01 285.0 Cookie
1 AM 130 2022-08-02 10.0 Rush
2 AN 60 2022-08-01 250.5 NaN
3 AP 180 NaN NaN NaN