I have JSON file that has many nested dictionaries/lists of excess information that I do not want to use when creating my data frame. All the unnecessary fluff I have either deleted or replaced with '---'.
{'ID': 1,
'SPEC': {'Name': 'STOCK_VAL',
'---': '---',
'---': '---',
'Info': {'---': [{'---': '---', '---': '---', '---': '---'}],
'---': [{'---': '---', '---': '---', '---': '---'}]},
'---': '---',
'RELEVANT_AFTER_ALL': [{'---': '---',
'Max': 140.00,
'Min': 100.00,
'---': '---',
'Name': 'Calculated',
'Units': 'USD/D',
'---': '---',
'Entries': [{'Timestamp': '2022-03-16T23:00:00Z', 'Value': 100.00},
{'Timestamp': '2022-03-17T23:00:00Z', 'Value': 120.00},
{'Timestamp': '2022-03-18T23:00:00Z', 'Value': 140.00}],
'---': '---'},
{'---': '---',
'Max': 160.00,
'Min': 80.00,
'---': '---',
'Name': 'Realised',
'Units': 'USD/D',
'---': '---',
'Entries': [{'Timestamp': '2022-03-16T23:00:00Z', 'Value': 160.00},
{'Timestamp': '2022-03-17T23:00:00Z', 'Value': 120.00},
{'Timestamp': '2022-03-18T23:00:00Z', 'Value': 80.00}],
'---': '---'}]}}
From the data above I want to create the following data frame:
Timestamp | STOCK_VAL Calculated | STOCK_VAL Realised |
---|---|---|
2022-03-16T23:00:00Z | 100.00 | 160.00 |
2022-03-17T23:00:00Z | 120.00 | 120.00 |
2022-03-18T23:00:00Z | 140.00 | 80.00 |
I have tried using pandas.json_normalize() but failed to extract the table as I want it to be made in an efficient manner.
Thanks in advance for anyone who knows better!
CodePudding user response:
One of the strings you replaced with '---'
is relevant after all.
First we find the array where the data is located. Each item of this array should be a series, from which we can build a dataframe.
import pandas as pd
table_data = data['SPEC']['RELEVANT_AFTER_ALL']
x = pd.DataFrame({
f"STOCK_VAL {item['Name']}": pd.DataFrame(item['Entries']).set_index('Timestamp').squeeze()
for item in table_data
})
EDIT: Replaced pd.json_normalize
with pd.DataFrame
, which suffices in this scenario.
EDIT 2: Added STOCK_VAL
to the column names.