background
I am trying to access data from the following API. This is a huge nested json with nested dicts. I am trying to make more readable. the format of the json file is as per this link (right hand side, click expand all)
what I've tried
I've searched SO and other websites, pd.json_normalize seems to be the answer, but I've tried several ways, and it only ever unpacks one layer.
# Attempt 1
url = 'https://api.opennem.org.au/station/'
response = requests.get(url).json()
df2 = pd.json_normalize(response, max_level=0)
print(df2)
# Attempt 2
url = 'https://api.opennem.org.au/station/'
response = requests.get(url).json()
df = pd.json_normalize(response, record_path=['facilities'])
print(df)
current incorrect output
version ... data
0 3.11.3 ... [{'id': 488, 'code': 'ADP', 'name': 'Adelaide ...
[1 rows x 5 columns]
help requested
Anyone know how I can unpack this large nested json into a dataframe?
CodePudding user response:
You can use json_normalize
with nested lists for normalize data
and facilities
:
df2 = pd.json_normalize(response, ['data',['facilities']])
print(df2.head(3))
id station_id code dispatch_type active capacity_registered \
0 689 488 ADPBA1L LOAD True 6.27
1 690 488 ADPBA1G GENERATOR True 6.27
2 516 372 ALBANY_WF1 GENERATOR True 21.60
network_region unit_number unit_capacity approved network.code \
0 SA1 1.0 6.27 True NEM
1 SA1 1.0 6.27 True NEM
2 WEM NaN NaN True WEM
network.country network.label \
0 au NEM
1 au NEM
2 au WEM
network.regions network.timezone \
0 [{'code': 'NSW1'}, {'code': 'QLD1'}, {'code': ... Australia/Sydney
1 [{'code': 'NSW1'}, {'code': 'QLD1'}, {'code': ... Australia/Sydney
2 [{'code': 'WEM'}] Australia/Perth
network.timezone_database network.offset network.interval_size \
0 AEST 600 5
1 AEST 600 5
2 AWST 480 30
network.interval_shift network.has_interconnectors \
0 5 False
1 5 False
2 0 False
network.intervals_per_hour fueltech.code fueltech.label \
0 12.0 battery_charging Battery (Charging)
1 12.0 battery_discharging Battery (Discharging)
2 2.0 wind Wind
fueltech.renewable status.code status.label registered \
0 True committed Committed NaN
1 True committed Committed NaN
2 True operating Operating 2018-10-12T00:00:00
approved_at emissions_factor_co2 approved_by
0 NaN NaN NaN
1 NaN NaN NaN
2 2020-12-09T15:34:49.465445 00:00 NaN NaN
Bonus:
If need also scalars in network.regions
:
df2 = pd.json_normalize(response, ['data',['facilities']])
df2['network.regions'] = [[y['code'] for y in x] for x in df2['network.regions']]
df2 = df2.explode('network.regions').reset_index(drop=True)