Home > Mobile >  Flattening nested JSON with pandas
Flattening nested JSON with pandas

Time:08-11

I'm trying to read a json file and put it into a dataframe:

#read JSON from Web Service...
raw_data = pd.read_json('https://ergast.com/api/f1/1996/1/laps/1.json')

Just after reading the complex file, we normalize it:

#normalize JSON
data = pd.json_normalize(raw_data['MRData']['RaceTable']['Races'])

This give us the following result:
(Not quite what we want, as we'd like to have each value of column Laps as a separated column and at the same level of the others, like Circuit.columns):

index season round url raceName date Laps Circuit.circuitId Circuit.url Circuit.circuitName Circuit.Location.lat Circuit.Location.long Circuit.Location.locality Circuit.Location.country
0 1996 1 http://en.wikipedia.org/wiki/1996_Australian_Grand_Prix Australian Grand Prix 1996-03-10 {'number': '1', 'Timings': [{'driverId': 'villeneuve', 'position': '1', 'time': '1:43.702'}, {'driverId': 'damon_hill', 'position': '2', 'time': '1:44.243'}, {'driverId': 'irvine', 'position': '3', 'time': '1:44.981'}, {'driverId': 'michael_schumacher', 'position': '4', 'time': '1:45.188'}, {'driverId': 'alesi', 'position': '5', 'time': '1:46.506'}, {'driverId': 'hakkinen', 'position': '6', 'time': '1:48.430'}, {'driverId': 'barrichello', 'position': '7', 'time': '1:49.058'}, {'driverId': 'frentzen', 'position': '8', 'time': '1:49.468'}, {'driverId': 'berger', 'position': '9', 'time': '1:49.726'}, {'driverId': 'salo', 'position': '10', 'time': '1:50.436'}, {'driverId': 'verstappen', 'position': '11', 'time': '1:50.930'}, {'driverId': 'panis', 'position': '12', 'time': '1:51.529'}, {'driverId': 'fisichella', 'position': '13', 'time': '1:52.690'}, {'driverId': 'katayama', 'position': '14', 'time': '1:53.435'}, {'driverId': 'lamy', 'position': '15', 'time': '1:54.047'}, {'driverId': 'rosset', 'position': '16', 'time': '1:54.831'}, {'driverId': 'diniz', 'position': '17', 'time': '1:56.143'}, {'driverId': 'brundle', 'position': '18', 'time': '1:56.481'}, {'driverId': 'coulthard', 'position': '19', 'time': '1:56.926'}]} albert_park http://en.wikipedia.org/wiki/Melbourne_Grand_Prix_Circuit Albert Park Grand Prix Circuit -37.8497 144.968 Melbourne Australia

I was able to extract the values of the column Laps using:

data = pd.json_normalize(raw_data['MRData']['RaceTable']['Races'][0]['Laps'], record_path = 'Timings', meta = ['number'])

The result is:

index driverId position time number
0 villeneuve 1 1:43.702 1
1 damon_hill 2 1:44.243 1
2 irvine 3 1:44.981 1
3 michael_schumacher 4 1:45.188 1
4 alesi 5 1:46.506 1
5 hakkinen 6 1:48.430 1
6 barrichello 7 1:49.058 1
7 frentzen 8 1:49.468 1
8 berger 9 1:49.726 1
9 salo 10 1:50.436 1
10 verstappen 11 1:50.930 1
11 panis 12 1:51.529 1
12 fisichella 13 1:52.690 1
13 katayama 14 1:53.435 1
14 lamy 15 1:54.047 1
15 rosset 16 1:54.831 1
16 diniz 17 1:56.143 1
17 brundle 18 1:56.481 1
18 coulthard 19 1:56.926 1

Would it be possible to transform the raw_data in a away that in the end we have at least the following example:

season | round | number |   driverId | position |     time |  
  1996 |     1 |      1 | villeneuve |        1 | 1:43.702 |  
  1996 |     1 |      1 | damon_hill |        2 | 1:44.243 |  
  1996 |     1 |      1 |     irvine |        3 | 1:44.981 |  

at the same level?

### EDIT ###

Testing and tweaking the code of the answer I discovered that if we put 0 (zero) on the raw data link we have all results for all laps of a single race. So, based on @Tranbi's clever answer the answer would be:

raw_data = pd.read_json(f'https://ergast.com/api/f1/1996/1/laps/0.json?limit=1000')
df = pd.json_normalize(raw_data['MRData']['RaceTable']['Races'])

df_lap = pd.json_normalize(df['Laps'].values[0], record_path='Timings', meta='number')
df_lap[['season', 'round']] = df.loc[0, ['season', 'round']]
df_lap = df_lap[['season', 'round', 'number', 'driverId', 'position', 'time']]

df_lap

CodePudding user response:

From your URL, it seems you are retrieving laps one by one. I would therefore create a lap dataframe from the 'Laps' column and add 'season' and 'round' from the main df:

laps = []
# enclose the code below in loop(s) to retrieve the desired years/championship/laps
raw_data = pd.read_json('https://ergast.com/api/f1/1996/1/laps/1.json')
df = pd.json_normalize(raw_data['MRData']['RaceTable']['Races'])

df_lap = pd.json_normalize(df['Laps'].values[0], record_path='Timings', meta='number')
df_lap[['season', 'round']] = df.loc[0, ['season', 'round']]
df_lap = df_lap[['season', 'round', 'number', 'driverId', 'position', 'time']]
print(df_lap)
laps.append(df_lap)
# end of your loop

final_df = pd.concat(laps)
  • Related