I have a DataFrame with a column that contains a dictionary as follows:
df:
date dictionary
0 2021-01-01 00:00:00 00:00 'Total':{'USD':100, 'size':20}, 'country':{'USA': {'income': 20000}, 'fees': {'total': 55}}
1 2021-01-01 00:00:00 00:00 'Total':{'EUR':200, 'size':40}, 'country':{'France': {'income': 10000}, 'fees': {'total': 30}}
1 2021-01-02 00:00:00 00:00 'Total':{'GBP':100, 'size':30}, 'country':{'UK': {'income': 23000}, 'fees': {'total': 24}}
What I want is to set USA
as a column name and take the value of total
from the fees
and set that as the value, to get the following:
df_final:
date USA France UK
0 2021-01-01 00:00:00 00:00 55 30 NaN
1 2021-01-02 00:00:00 00:00 NaN NaN 24
My DataFrame has hundreds of columns. I have tried the following:
df_list = []
for idx, row in df.iterrows():
for dct in row['dictionary']:
dct['date'] = row['date']
df_list.append(dct)
But I get the following error: TypeError: 'str' object does not support item assignment
. This happened specifically at dct['date']
.
How can this be done?
EDIT: I added a few more rows to my DataFrame to better represent my problem.
CodePudding user response:
A possible solution:
df.assign(USA=pd.json_normalize(
df['dict'], sep='_').loc[:, 'country_fees_total']).drop('dict', axis=1)
Output:
date USA
0 2021-01-01 00:00:00 00:00 55
CodePudding user response:
1)
The first possibility I see, is if your dataframe contains valid json strings like so:
df = pd.DataFrame({
'date': [
'2021-01-01 00:00:00',
'2021-01-01 00:00:00',
'2021-01-02 00:00:00',
],
'dictionary': [
'{"Total":{"USD":100, "size":20}, "country":{"USA": {"income": 20000}, "fees": {"total": 55}}}',
'{"Total":{"EUR":200, "size":40}, "country":{"France": {"income": 10000}, "fees": {"total": 30}}}',
'{"Total":{"GBP":100, "size":30}, "country":{"UK": {"income": 23000}, "fees": {"total": 24}}}',
]
})
df.date = pd.to_datetime(df.date)
df
Then you could do:
import json
for idx, row in df.iterrows():
dict = json.loads(row.dictionary)
dict_keys = list(dict["country"].keys())
df.loc[idx, dict_keys[0]] = dict["country"]["fees"]["total"]
df_final = df.groupby(df.date.dt.date) \
.agg('first') \
.drop(columns=['date', 'dictionary']) \
.reset_index()
df_final
2)
The second is if your df contained valid dictionaries like so:
df = pd.DataFrame({
'date': [
'2021-01-01 00:00:00',
'2021-01-01 00:00:00',
'2021-01-02 00:00:00',
],
'dictionary': [
{"Total":{"USD":100, "size":20}, "country":{"USA": {"income": 20000}, "fees": {"total": 55}}},
{"Total":{"EUR":200, "size":40}, "country":{"France": {"income": 10000}, "fees": {"total": 30}}},
{"Total":{"GBP":100, "size":30}, "country":{"UK": {"income": 23000}, "fees": {"total": 24}}},
]
})
df.date = pd.to_datetime(df.date)
df
Then you would:
import json
for idx, row in df.iterrows():
dict = row.dictionary
dict_keys = list(dict["country"].keys())
df.loc[idx, dict_keys[0]] = dict["country"]["fees"]["total"]
# df.loc[index, row]
df_final = df.groupby(df.date.dt.date) \
.agg('first') \
.drop(columns=['date', 'dictionary']) \
.reset_index()
df_final