Home > front end >  Unpack nested json into a dataframe?
Unpack nested json into a dataframe?

Time:10-18

background

I am trying to access data from the following API. This is a huge nested json with nested dicts. I am trying to make more readable. the format of the json file is as per this link (right hand side, click expand all)

what I've tried

I've searched SO and other websites, pd.json_normalize seems to be the answer, but I've tried several ways, and it only ever unpacks one layer.

# Attempt 1
url = 'https://api.opennem.org.au/station/'
response = requests.get(url).json()
df2 = pd.json_normalize(response, max_level=0)
print(df2)

# Attempt 2
url = 'https://api.opennem.org.au/station/'
response = requests.get(url).json()
df = pd.json_normalize(response, record_path=['facilities'])
print(df)

current incorrect output

  version  ...                                               data
0  3.11.3  ...  [{'id': 488, 'code': 'ADP', 'name': 'Adelaide ...

[1 rows x 5 columns]

help requested

Anyone know how I can unpack this large nested json into a dataframe?

CodePudding user response:

You can use json_normalize with nested lists for normalize data and facilities:

df2 = pd.json_normalize(response, ['data',['facilities']])

print(df2.head(3))

    id  station_id        code dispatch_type  active  capacity_registered  \
0  689         488     ADPBA1L          LOAD    True                 6.27   
1  690         488     ADPBA1G     GENERATOR    True                 6.27   
2  516         372  ALBANY_WF1     GENERATOR    True                21.60   

  network_region  unit_number  unit_capacity  approved network.code  \
0            SA1          1.0           6.27      True          NEM   
1            SA1          1.0           6.27      True          NEM   
2            WEM          NaN            NaN      True          WEM   

  network.country network.label  \
0              au           NEM   
1              au           NEM   
2              au           WEM   

                                     network.regions  network.timezone  \
0  [{'code': 'NSW1'}, {'code': 'QLD1'}, {'code': ...  Australia/Sydney   
1  [{'code': 'NSW1'}, {'code': 'QLD1'}, {'code': ...  Australia/Sydney   
2                                  [{'code': 'WEM'}]   Australia/Perth   

  network.timezone_database  network.offset  network.interval_size  \
0                      AEST             600                      5   
1                      AEST             600                      5   
2                      AWST             480                     30   

   network.interval_shift  network.has_interconnectors  \
0                       5                        False   
1                       5                        False   
2                       0                        False   

   network.intervals_per_hour        fueltech.code         fueltech.label  \
0                        12.0     battery_charging     Battery (Charging)   
1                        12.0  battery_discharging  Battery (Discharging)   
2                         2.0                 wind                   Wind   

  fueltech.renewable status.code status.label           registered  \
0               True   committed    Committed                  NaN   
1               True   committed    Committed                  NaN   
2               True   operating    Operating  2018-10-12T00:00:00   

                        approved_at  emissions_factor_co2 approved_by  
0                               NaN                   NaN         NaN  
1                               NaN                   NaN         NaN  
2  2020-12-09T15:34:49.465445 00:00                   NaN         NaN  

Bonus:

If need also scalars in network.regions:

df2 = pd.json_normalize(response, ['data',['facilities']])
df2['network.regions'] = [[y['code'] for y in x] for x in df2['network.regions']]
df2 = df2.explode('network.regions').reset_index(drop=True)
  • Related