How to get distinct rows from pandas dataframe?-CodePudding

I am having trouble with getting distinct values from my dataframe.. Below is the code i currently use, in line 25(last line here) is the issue: I would like to show the top 10 fastest drivers based on their average heat(go-kart heat) time.

Input:

HeatNumber,NumberOfKarts,KartNumber,DriverName,Laptime
334,11,5,Monique,00:53.862
334,11,5,Monique,00:59.070
334,11,5,Monique,00:47.832
334,11,5,Monique,00:47.213
334,11,5,Monique,00:51.975
334,11,5,Monique,00:46.423
334,11,5,Monique,00:49.539
334,11,5,Monique,00:49.935
334,11,5,Monique,00:45.267
334,11,12,Robert-Jan,00:55.606
334,11,12,Robert-Jan,00:52.249
334,11,12,Robert-Jan,00:50.965
334,11,12,Robert-Jan,00:53.878
334,11,12,Robert-Jan,00:48.802
334,11,12,Robert-Jan,00:48.766
334,11,12,Robert-Jan,00:46.003
334,11,12,Robert-Jan,00:46.257
334,11,12,Robert-Jan,00:47.334
334,11,20,Katja,00:56.222
334,11,20,Katja,01:01.005
334,11,20,Katja,00:50.296
334,11,20,Katja,00:48.004
334,11,20,Katja,00:51.203
334,11,20,Katja,00:47.672
334,11,20,Katja,00:50.243
334,11,20,Katja,00:50.453
334,11,20,Katja,01:06.192
334,11,13,Bensu,00:56.332
334,11,13,Bensu,00:54.550
334,11,13,Bensu,00:52.023
334,11,13,Bensu,00:52.518
334,11,13,Bensu,00:50.738
334,11,13,Bensu,00:50.359
334,11,13,Bensu,00:49.307
334,11,13,Bensu,00:49.595
334,11,13,Bensu,00:50.504
334,11,17,Marit,00:56.740
334,11,17,Marit,00:52.534
334,11,17,Marit,00:48.331
334,11,17,Marit,00:56.204
334,11,17,Marit,00:49.066
334,11,17,Marit,00:49.210
334,11,17,Marit,00:45.655
334,11,17,Marit,00:46.261
334,11,17,Marit,00:46.837
334,11,11,Niels,00:58.518
334,11,11,Niels,01:01.562
334,11,11,Niels,00:51.238
334,11,11,Niels,00:48.808

Code:

import pandas as pd
import matplotlib.pyplot as plt

#Data
df = pd.read_csv('dataset_kartanalyser.csv')
df = df.dropna(axis=0, how='any')
df = df.join(df['Laptime'].str.split(':', 1, expand=True).rename(columns={0:'M', 1:'S'}))
df['M'] = df['M'].astype(int)
df['S'] = df['S'].astype(float)
df['Laptime'] = (df['M'] * 60)   df['S']
df.drop(['M', 'S'], axis=1, inplace=True)

#Funties
def twee():
    print("Het totaal aantal karts = "   str(df['KartNumber'].nunique())) 
    print("Het aantal unique drivers = "   str(df['DriverName'].nunique()))
    print("Het totaal aantal heats = "   str(df['HeatNumber'].nunique())) 

def drie():
    print("De 10 snelste Drivers obv individuele tijd zijn: ")
    print((df.groupby('DriverName')['Laptime'].nsmallest(1)).nsmallest(10))

def vier():
    print('De 10 snelste Drivers obv snelste heat gemiddelde:')
    print((df.groupby(['DriverName', 'HeatNumber'])['Laptime'].mean().round(3)).nsmallest(10))

print(df)
     HeatNumber  NumberOfKarts KartNumber DriverName  Laptime
0           334             11          5    Monique   53.862
1           334             11          5    Monique   59.070
2           334             11          5    Monique   47.832
3           334             11          5    Monique   47.213
4           334             11          5    Monique   51.975
...         ...            ...        ...        ...      ...
4053        437              2         20       luuk   39.678
4054        437              2         20       luuk   39.872
4055        437              2         20       luuk   39.454
4056        437              2         20       luuk   39.575
4057        437              2         20       luuk   39.648

Output:

DriverName   HeatNumber
giovanni     411           26.233
ryan         411           27.747
giovanni     408           27.938
papa         394           28.075
guus         406           28.998
Rob          427           29.371
Suus         427           29.416
Jan-jullius  394           29.428
Joep         427           29.934
Indy         423           29.991

The output i get i almost correct, expect that the driver "giovanni" occurs twice. I would like to only show the fastest avg heat time for each driver. Anyone who know how to do this?

CodePudding user response：

ok so add drop_duplication on a column like this just need to add sort as well df.sort_values('B', ascending=True)
.drop_duplicates('A', keep='first')

(df.groupby(['DriverName', 'HeatNumber'])['Laptime'].mean().round(3)sort_values('Laptime', ascending=True).drop_duplicates('DriverName', keep='first')).nsmallest(10))

CodePudding user response：

You group the datas by Drivername and HeatNumber. See the HeatNumbers, one of them is 411 and another is 408. Because of that pandas understand they are exactly different. If you equals them, they will be one.