I am having trouble with getting distinct values from my dataframe.. Below is the code i currently use, in line 25(last line here) is the issue: I would like to show the top 10 fastest drivers based on their average heat(go-kart heat) time.
Input:
HeatNumber,NumberOfKarts,KartNumber,DriverName,Laptime
334,11,5,Monique,00:53.862
334,11,5,Monique,00:59.070
334,11,5,Monique,00:47.832
334,11,5,Monique,00:47.213
334,11,5,Monique,00:51.975
334,11,5,Monique,00:46.423
334,11,5,Monique,00:49.539
334,11,5,Monique,00:49.935
334,11,5,Monique,00:45.267
334,11,12,Robert-Jan,00:55.606
334,11,12,Robert-Jan,00:52.249
334,11,12,Robert-Jan,00:50.965
334,11,12,Robert-Jan,00:53.878
334,11,12,Robert-Jan,00:48.802
334,11,12,Robert-Jan,00:48.766
334,11,12,Robert-Jan,00:46.003
334,11,12,Robert-Jan,00:46.257
334,11,12,Robert-Jan,00:47.334
334,11,20,Katja,00:56.222
334,11,20,Katja,01:01.005
334,11,20,Katja,00:50.296
334,11,20,Katja,00:48.004
334,11,20,Katja,00:51.203
334,11,20,Katja,00:47.672
334,11,20,Katja,00:50.243
334,11,20,Katja,00:50.453
334,11,20,Katja,01:06.192
334,11,13,Bensu,00:56.332
334,11,13,Bensu,00:54.550
334,11,13,Bensu,00:52.023
334,11,13,Bensu,00:52.518
334,11,13,Bensu,00:50.738
334,11,13,Bensu,00:50.359
334,11,13,Bensu,00:49.307
334,11,13,Bensu,00:49.595
334,11,13,Bensu,00:50.504
334,11,17,Marit,00:56.740
334,11,17,Marit,00:52.534
334,11,17,Marit,00:48.331
334,11,17,Marit,00:56.204
334,11,17,Marit,00:49.066
334,11,17,Marit,00:49.210
334,11,17,Marit,00:45.655
334,11,17,Marit,00:46.261
334,11,17,Marit,00:46.837
334,11,11,Niels,00:58.518
334,11,11,Niels,01:01.562
334,11,11,Niels,00:51.238
334,11,11,Niels,00:48.808
Code:
import pandas as pd
import matplotlib.pyplot as plt
#Data
df = pd.read_csv('dataset_kartanalyser.csv')
df = df.dropna(axis=0, how='any')
df = df.join(df['Laptime'].str.split(':', 1, expand=True).rename(columns={0:'M', 1:'S'}))
df['M'] = df['M'].astype(int)
df['S'] = df['S'].astype(float)
df['Laptime'] = (df['M'] * 60) df['S']
df.drop(['M', 'S'], axis=1, inplace=True)
#Funties
def twee():
print("Het totaal aantal karts = " str(df['KartNumber'].nunique()))
print("Het aantal unique drivers = " str(df['DriverName'].nunique()))
print("Het totaal aantal heats = " str(df['HeatNumber'].nunique()))
def drie():
print("De 10 snelste Drivers obv individuele tijd zijn: ")
print((df.groupby('DriverName')['Laptime'].nsmallest(1)).nsmallest(10))
def vier():
print('De 10 snelste Drivers obv snelste heat gemiddelde:')
print((df.groupby(['DriverName', 'HeatNumber'])['Laptime'].mean().round(3)).nsmallest(10))
print(df)
HeatNumber NumberOfKarts KartNumber DriverName Laptime
0 334 11 5 Monique 53.862
1 334 11 5 Monique 59.070
2 334 11 5 Monique 47.832
3 334 11 5 Monique 47.213
4 334 11 5 Monique 51.975
... ... ... ... ... ...
4053 437 2 20 luuk 39.678
4054 437 2 20 luuk 39.872
4055 437 2 20 luuk 39.454
4056 437 2 20 luuk 39.575
4057 437 2 20 luuk 39.648
Output:
DriverName HeatNumber
giovanni 411 26.233
ryan 411 27.747
giovanni 408 27.938
papa 394 28.075
guus 406 28.998
Rob 427 29.371
Suus 427 29.416
Jan-jullius 394 29.428
Joep 427 29.934
Indy 423 29.991
The output i get i almost correct, expect that the driver "giovanni" occurs twice. I would like to only show the fastest avg heat time for each driver. Anyone who know how to do this?
CodePudding user response:
ok so add drop_duplication on a column like this just need to add sort as well
df.sort_values('B', ascending=True)
.drop_duplicates('A', keep='first')
(df.groupby(['DriverName', 'HeatNumber'])['Laptime'].mean().round(3)sort_values('Laptime', ascending=True).drop_duplicates('DriverName', keep='first')).nsmallest(10))
CodePudding user response:
You group the datas by Drivername and HeatNumber. See the HeatNumbers, one of them is 411 and another is 408. Because of that pandas understand they are exactly different. If you equals them, they will be one.