I have a json with some nested/array items like the one below I'm looking at flattening it before saving it into a csv
{'SKU':'TEST1','name':'test name 1',
'ItemSalesPrices':[{'SourceNumber': 'OEM', 'AssetNumber': 'TEST1', 'UnitPrice': 1600}, {'SourceNumber': 'RRP', 'AssetNumber': 'TEST1', 'UnitPrice': 1500}],
'ItemDiscounts':[{'SourceNumber': 'RETAIL', 'AssetNumber': 'AC', 'LineDiscountPercentage': 30}, {'SourceNumber': 'LARGE ACC', 'AssetNumber': 'AC', 'LineDiscountPercentage': 45}]
},
{'SKU':'TEST2','name':'test name 2',
'ItemSalesPrices':[{'SourceNumber': 'RRP', 'AssetNumber': 'TEST2', 'UnitPrice': 1500}],
'ItemDiscounts':[{'SourceNumber': 'RETAIL', 'AssetNumber': 'AC', 'LineDiscountPercentage': 30}]
}
I'm hoping to flatten it into more columns so when I save it to csv, it will still have 2 rows with set of columns representing the individual Source NUmber items in each array.
SKU | Name | ItemSalesPrices_OEM | ItemSalesPrices_RRP | ItemDiscounts_Retail | ItemDiscounts_LARGE ACC |
---|---|---|---|---|---|
TEST1 | test name 1 | 1600 | 1500 | 30 | 45 |
TEST2 | test name 2 | 1500 | 30 |
At the moment, why I'm doing is to go through each item, loop through the array and do all kinds of checking & comparing against the SourceNumber
I dont think I'm heading in the right direct and hope to hear from you guys to see how this should be tackled
Thank you
CodePudding user response:
Use json_normalize
with DataFrame.pivot
or DataFrame.pivot_table
and then join DataFrames by concat
:
data = [{'SKU':'TEST1','name':'test name 1',
'ItemSalesPrices':[{'SourceNumber': 'OEM', 'AssetNumber': 'TEST1', 'UnitPrice': 1600},
{'SourceNumber': 'RRP', 'AssetNumber': 'TEST1', 'UnitPrice': 1500}],
'ItemDiscounts':[{'SourceNumber': 'RETAIL', 'AssetNumber': 'AC', 'LineDiscountPercentage': 30},
{'SourceNumber': 'LARGE ACC', 'AssetNumber': 'AC', 'LineDiscountPercentage': 45}]
},
{'SKU':'TEST2','name':'test name 2',
'ItemSalesPrices':[{'SourceNumber': 'RRP', 'AssetNumber': 'TEST2', 'UnitPrice': 1500}],
'ItemDiscounts':[{'SourceNumber': 'RETAIL', 'AssetNumber': 'AC', 'LineDiscountPercentage': 30}]
}]
first = ['SKU','name']
df1 = (pd.json_normalize(data,'ItemSalesPrices', first)
.pivot_table(index=first,
columns='SourceNumber',
values='UnitPrice',
aggfunc='sum')
.add_prefix('ItemSalesPrices_'))
df2 = (pd.json_normalize(data,'ItemDiscounts', first)
.pivot_table(index=first,
columns='SourceNumber',
values='LineDiscountPercentage',
aggfunc='sum')
.add_prefix('ItemSalesPrices_'))
df3 = pd.concat([df1, df2], axis=1).rename_axis(None, axis=1).reset_index()
print (df3)
SKU name ItemSalesPrices_OEM ItemSalesPrices_RRP \
0 TEST1 test name 1 1600.0 1500.0
1 TEST2 test name 2 NaN 1500.0
ItemSalesPrices_LARGE ACC ItemSalesPrices_RETAIL
0 45.0 30.0
1 NaN 30.0
If no duplicates here is alternative solution:
from collections import defaultdict
d = defaultdict(dict)
for x in data:
for y in x['ItemSalesPrices']:
d['ItemSalesPrices_' y['SourceNumber']].update({(x['SKU'], x['name']):y['UnitPrice']})
for y in x['ItemDiscounts']:
d['ItemDiscounts_' y['SourceNumber']].update({(x['SKU'], x['name']):y['LineDiscountPercentage']})
df = pd.DataFrame(d).rename_axis(['SKU','name']).reset_index()
print(df)
SKU name ItemSalesPrices_OEM ItemSalesPrices_RRP \
0 TEST1 test name 1 1600.0 1500
1 TEST2 test name 2 NaN 1500
ItemDiscounts_RETAIL ItemDiscounts_LARGE ACC
0 30 45.0
1 30 NaN