for loop in dataframe in pandas-CodePudding

I have a problem with a "for loop" using a dataframe in pandas, hope somebody can help with that.

I have the following dataframe in a csv file:

,forename,surname,gender,age,100m,200m,400m,800m,1500m
0,Migdalia,Parrish,F,18,11.08,29.0,59.41,122.05,259.11
1,Valerie,Lee,F,10,17.23,46.0,100.02,232.64,480.95
2,John,Debnam,M,17,10.81,25.89,50.6,110.29,232.39
3,Roy,Miller,M,10,19.18,46.74,95.32,201.14,430.27
4,Aida,Aumiller,F,11,15.3,41.83,81.06,189.03,394.9
5,Marcia,Brown,F,19,11.13,24.62,57.59,119.13,256.37
6,Harry,Knows,M,16,12.39,25.94,49.67,106.56,237.14
7,Barry,Lennon,M,14,11.15,23.56,46.46,110.89,230.49
8,Lilia,Armstrong,F,13,8.84,25.09,59.54,128.95,258.47
9,Johnny,Casey,M,15,9.65,22.67,49.46,112.85,233.87
10,Donald,Taylor,M,15,11.74,22.42,49.22,114.62,224.63
11,Martha,Woods,F,14,9.01,24.34,55.25,118.8,254.87
12,Diane,Lauria,F,15,8.99,27.92,54.79,119.89,249.21
13,Yvonne,Pumphrey,F,16,8.84,27.29,57.63,123.13,247.41
14,Betty,Stephenson,F,14,11.04,28.73,59.05,126.29,256.44
15,Lilia,Armstrong,F,12,11.31,34.43,74.28,150.05,321.07

And I have to create a main function that calls another function that, using a "for loop", retrieves the athlete with the fastest time for each age (10,11,12,13,14,15,16) for a specific gender (e.g. 'F') and distance (e.g. '100m').

For example:

Input:
fastest_athletes = find_fastest_athletes(df,"100m","F",[10,11,12,13,14,15,16])
Output:
{
10: {’forename’: 'Valerie’, 'surname’: 'Lee’, 'time’: '17.23’},
11: {’forename’: 'Aida’, 'surname’: 'Aumiller’, 'time’: '15.3’},
12: {’forename’: 'Lilia’, 'surname’: 'Armstrong’, 'time’: '11.31’},
13: {’forename’: 'Lilia’, 'surname’: 'Armstrong’, 'time’: '8.84’},
14: {’forename’: 'Martha’, 'surname’: 'Woods’, 'time’: '9.01’},
15: {’forename’: 'Diane’, 'surname’: 'Lauria’, 'time’: '8.99’},
16: {’forename’: 'Yvonne’, 'surname’: 'Pumphrey’, 'time’: '8.84’}
}

I did the following code:

# Function with the for loop
def find_fastest_athletes(df,distance,gender,ages):
  for age in range(10,16):
    fastest_athletes = df[(df["gender"] == gender) & (df["age"] == age)]
    fastest_athletes_sorted = fastest_athletes.sort_values(distance,ascending=True)
    fastest_athletes_value = fastest_athletes_sorted.iloc[[0]][["forename","surname","100m"]]
    athletes_data = fastest_athletes_value.to_string(index=False, header=False).split('  ')
    athletes_data_dict = {
        'forename': athletes_data[0].strip(),
        'surname': athletes_data[1],
        'time': float(athletes_data[2])
        }
  return athletes_data_dict
  
# Main function
def main(filename='athletes.csv'):
    df = pd.read_csv(filename, index_col=0)
    df['100m'] = df['100m'].astype(float)
    print(find_fastest_athletes(df,'100m','F',[10,11,12,13,14,15,16]))
    return
   
if __name__ == "__main__":
  main()

With my coding I get as output ONLY the fastest athlete for the last age (16 year's old) and not ALL the fastest athletes for each age (10,11,12,13,14,15,16), why is that?

Also how can I add the age at the beginning of each line?

CodePudding user response：

The simplest answer is that you should not be doing your complicated function when what you want is just a groupby away. I don't know why your code doesn't work.

# Melt into a long-form dataframe
df_long = df.melt(id_vars=['forename', 'surname', 'gender', 'age'], var_name='event', value_name='time')

# Get the indices for maximum times in each gender/age/event and then use those indices to get the original rows
df_fastest = df_long.iloc[df_long.groupby(['gender', 'age', 'event'])['time'].idxmax(), :]

For one event:

>>> df_fastest.loc[df_fastest['event'] == '100m', :]

    forename    surname gender  age event   time
1   Valerie Lee F   10  100m    17.23
4   Aida    Aumiller    F   11  100m    15.30
15  Lilia   Armstrong   F   12  100m    11.31
...

CodePudding user response：

I think the reason that your code only return the fastest athlete for the last age is because you keep overriding the variable athletes_data_dict-

 def find_fastest_athletes(df,distance,gender,ages):
   tmp = []
   for age in ages:
     fastest_athletes = df.loc[(df.gender == gender) & (df.age == age)]

     fastest_athletes_sorted = fastest_athletes.sort_values(distance,ascending=True)

     fastest_athletes_value = fastest_athletes_sorted.iloc[[0]][["forename","surname","100m"]]
     athletes_data = fastest_athletes_value.to_string(index=False, header=False).split('  ')
     athletes_data_dict = {
    'forename': athletes_data[0].strip(),
    'surname': athletes_data[1],
    'time': float(athletes_data[2])
    }
    tmp.append(athletes_data_dict)
return tmp

i changed the slicing of your dataframe as well, and the range in the loop. so i used a list to store all the values, works well on my end.