Home > OS >  Python: Extracting a data frame form highly nested JSON file
Python: Extracting a data frame form highly nested JSON file

Time:03-30

I have JSON file that has many nested dictionaries/lists of excess information that I do not want to use when creating my data frame. All the unnecessary fluff I have either deleted or replaced with '---'.

{'ID': 1,
 'SPEC': {'Name': 'STOCK_VAL',
  '---': '---',
  '---': '---',
  'Info': {'---': [{'---': '---', '---': '---', '---': '---'}],
   '---': [{'---': '---', '---': '---', '---': '---'}]},
  '---': '---',
  'RELEVANT_AFTER_ALL': [{'---': '---',
    'Max': 140.00,
    'Min': 100.00,
    '---': '---',
    'Name': 'Calculated',
    'Units': 'USD/D',
    '---': '---',
    'Entries': [{'Timestamp': '2022-03-16T23:00:00Z', 'Value': 100.00},
     {'Timestamp': '2022-03-17T23:00:00Z', 'Value': 120.00},
     {'Timestamp': '2022-03-18T23:00:00Z', 'Value': 140.00}],
    '---': '---'},
   {'---': '---',
    'Max': 160.00,
    'Min': 80.00,
    '---': '---',
    'Name': 'Realised',
    'Units': 'USD/D',
    '---': '---',
    'Entries': [{'Timestamp': '2022-03-16T23:00:00Z', 'Value': 160.00},
     {'Timestamp': '2022-03-17T23:00:00Z', 'Value': 120.00},
     {'Timestamp': '2022-03-18T23:00:00Z', 'Value': 80.00}],
    '---': '---'}]}}

From the data above I want to create the following data frame:

Timestamp STOCK_VAL Calculated STOCK_VAL Realised
2022-03-16T23:00:00Z 100.00 160.00
2022-03-17T23:00:00Z 120.00 120.00
2022-03-18T23:00:00Z 140.00 80.00

I have tried using pandas.json_normalize() but failed to extract the table as I want it to be made in an efficient manner.

Thanks in advance for anyone who knows better!

CodePudding user response:

One of the strings you replaced with '---' is relevant after all.

First we find the array where the data is located. Each item of this array should be a series, from which we can build a dataframe.

import pandas as pd
table_data = data['SPEC']['RELEVANT_AFTER_ALL']
x = pd.DataFrame({
    f"STOCK_VAL {item['Name']}": pd.DataFrame(item['Entries']).set_index('Timestamp').squeeze()
    for item in table_data
})

EDIT: Replaced pd.json_normalize with pd.DataFrame, which suffices in this scenario.

EDIT 2: Added STOCK_VAL to the column names.

  • Related