I'm trying to read a json file and put it into a dataframe:
#read JSON from Web Service...
raw_data = pd.read_json('https://ergast.com/api/f1/1996/1/laps/1.json')
Just after reading the complex file, we normalize it:
#normalize JSON
data = pd.json_normalize(raw_data['MRData']['RaceTable']['Races'])
This give us the following result:
(Not quite what we want, as we'd like to have each value of column Laps as a separated column and at the same level of the others, like Circuit.columns):
index | season | round | url | raceName | date | Laps | Circuit.circuitId | Circuit.url | Circuit.circuitName | Circuit.Location.lat | Circuit.Location.long | Circuit.Location.locality | Circuit.Location.country |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1996 | 1 | http://en.wikipedia.org/wiki/1996_Australian_Grand_Prix | Australian Grand Prix | 1996-03-10 | {'number': '1', 'Timings': [{'driverId': 'villeneuve', 'position': '1', 'time': '1:43.702'}, {'driverId': 'damon_hill', 'position': '2', 'time': '1:44.243'}, {'driverId': 'irvine', 'position': '3', 'time': '1:44.981'}, {'driverId': 'michael_schumacher', 'position': '4', 'time': '1:45.188'}, {'driverId': 'alesi', 'position': '5', 'time': '1:46.506'}, {'driverId': 'hakkinen', 'position': '6', 'time': '1:48.430'}, {'driverId': 'barrichello', 'position': '7', 'time': '1:49.058'}, {'driverId': 'frentzen', 'position': '8', 'time': '1:49.468'}, {'driverId': 'berger', 'position': '9', 'time': '1:49.726'}, {'driverId': 'salo', 'position': '10', 'time': '1:50.436'}, {'driverId': 'verstappen', 'position': '11', 'time': '1:50.930'}, {'driverId': 'panis', 'position': '12', 'time': '1:51.529'}, {'driverId': 'fisichella', 'position': '13', 'time': '1:52.690'}, {'driverId': 'katayama', 'position': '14', 'time': '1:53.435'}, {'driverId': 'lamy', 'position': '15', 'time': '1:54.047'}, {'driverId': 'rosset', 'position': '16', 'time': '1:54.831'}, {'driverId': 'diniz', 'position': '17', 'time': '1:56.143'}, {'driverId': 'brundle', 'position': '18', 'time': '1:56.481'}, {'driverId': 'coulthard', 'position': '19', 'time': '1:56.926'}]} | albert_park | http://en.wikipedia.org/wiki/Melbourne_Grand_Prix_Circuit | Albert Park Grand Prix Circuit | -37.8497 | 144.968 | Melbourne | Australia |
I was able to extract the values of the column Laps using:
data = pd.json_normalize(raw_data['MRData']['RaceTable']['Races'][0]['Laps'], record_path = 'Timings', meta = ['number'])
The result is:
index | driverId | position | time | number |
---|---|---|---|---|
0 | villeneuve | 1 | 1:43.702 | 1 |
1 | damon_hill | 2 | 1:44.243 | 1 |
2 | irvine | 3 | 1:44.981 | 1 |
3 | michael_schumacher | 4 | 1:45.188 | 1 |
4 | alesi | 5 | 1:46.506 | 1 |
5 | hakkinen | 6 | 1:48.430 | 1 |
6 | barrichello | 7 | 1:49.058 | 1 |
7 | frentzen | 8 | 1:49.468 | 1 |
8 | berger | 9 | 1:49.726 | 1 |
9 | salo | 10 | 1:50.436 | 1 |
10 | verstappen | 11 | 1:50.930 | 1 |
11 | panis | 12 | 1:51.529 | 1 |
12 | fisichella | 13 | 1:52.690 | 1 |
13 | katayama | 14 | 1:53.435 | 1 |
14 | lamy | 15 | 1:54.047 | 1 |
15 | rosset | 16 | 1:54.831 | 1 |
16 | diniz | 17 | 1:56.143 | 1 |
17 | brundle | 18 | 1:56.481 | 1 |
18 | coulthard | 19 | 1:56.926 | 1 |
Would it be possible to transform the raw_data in a away that in the end we have at least the following example:
season | round | number | driverId | position | time |
1996 | 1 | 1 | villeneuve | 1 | 1:43.702 |
1996 | 1 | 1 | damon_hill | 2 | 1:44.243 |
1996 | 1 | 1 | irvine | 3 | 1:44.981 |
at the same level?
### EDIT ###
Testing and tweaking the code of the answer I discovered that if we put 0 (zero) on the raw data link we have all results for all laps of a single race. So, based on @Tranbi's clever answer the answer would be:
raw_data = pd.read_json(f'https://ergast.com/api/f1/1996/1/laps/0.json?limit=1000')
df = pd.json_normalize(raw_data['MRData']['RaceTable']['Races'])
df_lap = pd.json_normalize(df['Laps'].values[0], record_path='Timings', meta='number')
df_lap[['season', 'round']] = df.loc[0, ['season', 'round']]
df_lap = df_lap[['season', 'round', 'number', 'driverId', 'position', 'time']]
df_lap
CodePudding user response:
From your URL, it seems you are retrieving laps one by one. I would therefore create a lap dataframe from the 'Laps' column and add 'season' and 'round' from the main df:
laps = []
# enclose the code below in loop(s) to retrieve the desired years/championship/laps
raw_data = pd.read_json('https://ergast.com/api/f1/1996/1/laps/1.json')
df = pd.json_normalize(raw_data['MRData']['RaceTable']['Races'])
df_lap = pd.json_normalize(df['Laps'].values[0], record_path='Timings', meta='number')
df_lap[['season', 'round']] = df.loc[0, ['season', 'round']]
df_lap = df_lap[['season', 'round', 'number', 'driverId', 'position', 'time']]
print(df_lap)
laps.append(df_lap)
# end of your loop
final_df = pd.concat(laps)