I have an excel file that looks something like this:
Header1 | Header2 | Header3 |
---|---|---|
data | data | [{"key1":"123","key2":"Value1"},{"key1":"123","key2":"Value2"}, {"key1":"123","key2":"Value3"}] |
data | data | [{"key1":"123","key2":"Value1"},{"key1":"123","key2":"Value2"}, {"key1":"123","key2":"Value3"}] |
Header3 contains JSON strings that look like this
[
{"key1":"123","key2":"Value1"},
{"key1":"123","key2":"Value2"},
{"key1":"123","key2":"Value3"}
]
I would like to parse the JSON Header3 column and for each key create a column with the name of the key appended with the value of key2, the keys are always the same throughout the file.
The end data frame should look something like this:
Header1 | Header2 | Key1.Value1 | Key2.Value1 | Key1.Value2 | Key2.Value2 |
---|---|---|---|---|---|
data | data | 123 | Value1 | 123 | Value2 |
data | data | 123 | Value1 | 123 | Value2 |
Actual example:
Company | JSON |
---|---|
Amazon | [{"charge1":"500", "charge2":"200","card":"Visa"},{"charge1":"234", "charge2":"654","card":"Amex"} |
Apple | [{"charge1":"689", "charge2":"433","card":"Visa"},{"charge1":"25434", "charge2":"6554644","card":"Amex"}] |
Needs to become:
Company | charge1.Visa | charge2.Visa | card.Visa | charge1.AMEX | charge2.AMEX | card.AMEX |
---|---|---|---|---|---|---|
Amazon | 500 | 200 | Visa | 234 | 654 | Amex |
Apple | 689 | 433 | Visa | 25434 | 6554644 | Amex |
Before getting into fancy stuff I tried to at least normalize the data, but I'm returned with an empty series.
df = pd.read_excel('test.xlsx')
pd.json_normalize(df.JSON)
Output:
0
1
2
3
4
5
...
188 rows x 0 columns
CodePudding user response:
You can't normalize it because it's loaded as string from Excel. Try this:
import json
s = df["JSON"].apply(json.loads).explode()
tmp = (
pd.DataFrame(s.to_list(), index=s.index)
.set_index("card", append=True)
.unstack()
)
tmp.columns = [".".join(col) for col in tmp.columns]
pd.concat([df[["Company"]], tmp], axis=1)
The card.*
columns look kinda weird. If you know what column you are referring to, you already know its value so why include it in the output dataframe?