Home > Net >  Python: Convert JSON from df column into individual df columns
Python: Convert JSON from df column into individual df columns

Time:04-22

I have an excel file that looks something like this:

Header1 Header2 Header3
data data [{"key1":"123","key2":"Value1"},{"key1":"123","key2":"Value2"}, {"key1":"123","key2":"Value3"}]
data data [{"key1":"123","key2":"Value1"},{"key1":"123","key2":"Value2"}, {"key1":"123","key2":"Value3"}]

Header3 contains JSON strings that look like this

[
{"key1":"123","key2":"Value1"},
{"key1":"123","key2":"Value2"}, 
{"key1":"123","key2":"Value3"}
]

I would like to parse the JSON Header3 column and for each key create a column with the name of the key appended with the value of key2, the keys are always the same throughout the file.

The end data frame should look something like this:

Header1 Header2 Key1.Value1 Key2.Value1 Key1.Value2 Key2.Value2
data data 123 Value1 123 Value2
data data 123 Value1 123 Value2

Actual example:

Company JSON
Amazon [{"charge1":"500", "charge2":"200","card":"Visa"},{"charge1":"234", "charge2":"654","card":"Amex"}
Apple [{"charge1":"689", "charge2":"433","card":"Visa"},{"charge1":"25434", "charge2":"6554644","card":"Amex"}]

Needs to become:

Company charge1.Visa charge2.Visa card.Visa charge1.AMEX charge2.AMEX card.AMEX
Amazon 500 200 Visa 234 654 Amex
Apple 689 433 Visa 25434 6554644 Amex

Before getting into fancy stuff I tried to at least normalize the data, but I'm returned with an empty series.

df = pd.read_excel('test.xlsx')
pd.json_normalize(df.JSON)

Output:
0
1
2
3
4
5
...
188 rows x 0 columns

CodePudding user response:

You can't normalize it because it's loaded as string from Excel. Try this:

import json

s = df["JSON"].apply(json.loads).explode()
tmp = (
    pd.DataFrame(s.to_list(), index=s.index)
    .set_index("card", append=True)
    .unstack()
)
tmp.columns = [".".join(col) for col in tmp.columns]

pd.concat([df[["Company"]], tmp], axis=1)

The card.* columns look kinda weird. If you know what column you are referring to, you already know its value so why include it in the output dataframe?

  • Related