flattern nested JSON and retain columns (or Panda json

I have a json with some nested/array items like the one below I'm looking at flattening it before saving it into a csv

{'SKU':'TEST1','name':'test name 1',
    'ItemSalesPrices':[{'SourceNumber': 'OEM', 'AssetNumber': 'TEST1', 'UnitPrice': 1600}, {'SourceNumber': 'RRP', 'AssetNumber': 'TEST1', 'UnitPrice': 1500}],
    'ItemDiscounts':[{'SourceNumber': 'RETAIL', 'AssetNumber': 'AC', 'LineDiscountPercentage': 30}, {'SourceNumber': 'LARGE ACC', 'AssetNumber': 'AC', 'LineDiscountPercentage': 45}]
    },
{'SKU':'TEST2','name':'test name 2',
    'ItemSalesPrices':[{'SourceNumber': 'RRP', 'AssetNumber': 'TEST2', 'UnitPrice': 1500}],
    'ItemDiscounts':[{'SourceNumber': 'RETAIL', 'AssetNumber': 'AC', 'LineDiscountPercentage': 30}]
    }

I'm hoping to flatten it into more columns so when I save it to csv, it will still have 2 rows with set of columns representing the individual Source NUmber items in each array.

SKU	Name	ItemSalesPrices_OEM	ItemSalesPrices_RRP	ItemDiscounts_Retail	ItemDiscounts_LARGE ACC
TEST1	test name 1	1600	1500	30	45
TEST2	test name 2		1500	30

At the moment, why I'm doing is to go through each item, loop through the array and do all kinds of checking & comparing against the SourceNumber

I dont think I'm heading in the right direct and hope to hear from you guys to see how this should be tackled

Thank you

CodePudding user response：

Use json_normalize with DataFrame.pivot or DataFrame.pivot_table and then join DataFrames by concat:

data = [{'SKU':'TEST1','name':'test name 1',
    'ItemSalesPrices':[{'SourceNumber': 'OEM', 'AssetNumber': 'TEST1', 'UnitPrice': 1600}, 
                       {'SourceNumber': 'RRP', 'AssetNumber': 'TEST1', 'UnitPrice': 1500}],
    'ItemDiscounts':[{'SourceNumber': 'RETAIL', 'AssetNumber': 'AC', 'LineDiscountPercentage': 30},
                     {'SourceNumber': 'LARGE ACC', 'AssetNumber': 'AC', 'LineDiscountPercentage': 45}]
    },
{'SKU':'TEST2','name':'test name 2',
    'ItemSalesPrices':[{'SourceNumber': 'RRP', 'AssetNumber': 'TEST2', 'UnitPrice': 1500}],
    'ItemDiscounts':[{'SourceNumber': 'RETAIL', 'AssetNumber': 'AC', 'LineDiscountPercentage': 30}]
    }]

first = ['SKU','name']
df1 = (pd.json_normalize(data,'ItemSalesPrices', first)
        .pivot_table(index=first, 
                     columns='SourceNumber', 
                     values='UnitPrice', 
                     aggfunc='sum')
        .add_prefix('ItemSalesPrices_'))
df2 = (pd.json_normalize(data,'ItemDiscounts', first)
        .pivot_table(index=first, 
                     columns='SourceNumber', 
                     values='LineDiscountPercentage', 
                     aggfunc='sum')
        .add_prefix('ItemSalesPrices_'))      
        
        
df3 = pd.concat([df1, df2], axis=1).rename_axis(None, axis=1).reset_index()
print (df3)
     SKU         name  ItemSalesPrices_OEM  ItemSalesPrices_RRP  \
0  TEST1  test name 1               1600.0               1500.0   
1  TEST2  test name 2                  NaN               1500.0   

   ItemSalesPrices_LARGE ACC  ItemSalesPrices_RETAIL  
0                       45.0                    30.0  
1                        NaN                    30.0

If no duplicates here is alternative solution:

from collections import defaultdict

d = defaultdict(dict)

for x in data:
    for y in x['ItemSalesPrices']:
        d['ItemSalesPrices_'   y['SourceNumber']].update({(x['SKU'], x['name']):y['UnitPrice']})
    for y in x['ItemDiscounts']:
        d['ItemDiscounts_'   y['SourceNumber']].update({(x['SKU'], x['name']):y['LineDiscountPercentage']})

df = pd.DataFrame(d).rename_axis(['SKU','name']).reset_index()
print(df)
     SKU         name  ItemSalesPrices_OEM  ItemSalesPrices_RRP  \
0  TEST1  test name 1               1600.0                 1500   
1  TEST2  test name 2                  NaN                 1500   

   ItemDiscounts_RETAIL  ItemDiscounts_LARGE ACC  
0                    30                     45.0  
1                    30                      NaN