I have a problem with a "for loop" using a dataframe in pandas, hope somebody can help with that.
I have the following dataframe in a csv file:
,forename,surname,gender,age,100m,200m,400m,800m,1500m
0,Migdalia,Parrish,F,18,11.08,29.0,59.41,122.05,259.11
1,Valerie,Lee,F,10,17.23,46.0,100.02,232.64,480.95
2,John,Debnam,M,17,10.81,25.89,50.6,110.29,232.39
3,Roy,Miller,M,10,19.18,46.74,95.32,201.14,430.27
4,Aida,Aumiller,F,11,15.3,41.83,81.06,189.03,394.9
5,Marcia,Brown,F,19,11.13,24.62,57.59,119.13,256.37
6,Harry,Knows,M,16,12.39,25.94,49.67,106.56,237.14
7,Barry,Lennon,M,14,11.15,23.56,46.46,110.89,230.49
8,Lilia,Armstrong,F,13,8.84,25.09,59.54,128.95,258.47
9,Johnny,Casey,M,15,9.65,22.67,49.46,112.85,233.87
10,Donald,Taylor,M,15,11.74,22.42,49.22,114.62,224.63
11,Martha,Woods,F,14,9.01,24.34,55.25,118.8,254.87
12,Diane,Lauria,F,15,8.99,27.92,54.79,119.89,249.21
13,Yvonne,Pumphrey,F,16,8.84,27.29,57.63,123.13,247.41
14,Betty,Stephenson,F,14,11.04,28.73,59.05,126.29,256.44
15,Lilia,Armstrong,F,12,11.31,34.43,74.28,150.05,321.07
And I have to create a main function that calls another function that, using a "for loop", retrieves the athlete with the fastest time for each age (10,11,12,13,14,15,16) for a specific gender (e.g. 'F') and distance (e.g. '100m').
For example:
Input:
fastest_athletes = find_fastest_athletes(df,"100m","F",[10,11,12,13,14,15,16])
Output:
{
10: {’forename’: 'Valerie’, 'surname’: 'Lee’, 'time’: '17.23’},
11: {’forename’: 'Aida’, 'surname’: 'Aumiller’, 'time’: '15.3’},
12: {’forename’: 'Lilia’, 'surname’: 'Armstrong’, 'time’: '11.31’},
13: {’forename’: 'Lilia’, 'surname’: 'Armstrong’, 'time’: '8.84’},
14: {’forename’: 'Martha’, 'surname’: 'Woods’, 'time’: '9.01’},
15: {’forename’: 'Diane’, 'surname’: 'Lauria’, 'time’: '8.99’},
16: {’forename’: 'Yvonne’, 'surname’: 'Pumphrey’, 'time’: '8.84’}
}
I did the following code:
# Function with the for loop
def find_fastest_athletes(df,distance,gender,ages):
for age in range(10,16):
fastest_athletes = df[(df["gender"] == gender) & (df["age"] == age)]
fastest_athletes_sorted = fastest_athletes.sort_values(distance,ascending=True)
fastest_athletes_value = fastest_athletes_sorted.iloc[[0]][["forename","surname","100m"]]
athletes_data = fastest_athletes_value.to_string(index=False, header=False).split(' ')
athletes_data_dict = {
'forename': athletes_data[0].strip(),
'surname': athletes_data[1],
'time': float(athletes_data[2])
}
return athletes_data_dict
# Main function
def main(filename='athletes.csv'):
df = pd.read_csv(filename, index_col=0)
df['100m'] = df['100m'].astype(float)
print(find_fastest_athletes(df,'100m','F',[10,11,12,13,14,15,16]))
return
if __name__ == "__main__":
main()
With my coding I get as output ONLY the fastest athlete for the last age (16 year's old) and not ALL the fastest athletes for each age (10,11,12,13,14,15,16), why is that?
Also how can I add the age at the beginning of each line?
CodePudding user response:
The simplest answer is that you should not be doing your complicated function when what you want is just a groupby away. I don't know why your code doesn't work.
# Melt into a long-form dataframe
df_long = df.melt(id_vars=['forename', 'surname', 'gender', 'age'], var_name='event', value_name='time')
# Get the indices for maximum times in each gender/age/event and then use those indices to get the original rows
df_fastest = df_long.iloc[df_long.groupby(['gender', 'age', 'event'])['time'].idxmax(), :]
For one event:
>>> df_fastest.loc[df_fastest['event'] == '100m', :]
forename surname gender age event time
1 Valerie Lee F 10 100m 17.23
4 Aida Aumiller F 11 100m 15.30
15 Lilia Armstrong F 12 100m 11.31
...
CodePudding user response:
I think the reason that your code only return the fastest athlete for the last age is because you keep overriding the variable athletes_data_dict-
def find_fastest_athletes(df,distance,gender,ages):
tmp = []
for age in ages:
fastest_athletes = df.loc[(df.gender == gender) & (df.age == age)]
fastest_athletes_sorted = fastest_athletes.sort_values(distance,ascending=True)
fastest_athletes_value = fastest_athletes_sorted.iloc[[0]][["forename","surname","100m"]]
athletes_data = fastest_athletes_value.to_string(index=False, header=False).split(' ')
athletes_data_dict = {
'forename': athletes_data[0].strip(),
'surname': athletes_data[1],
'time': float(athletes_data[2])
}
tmp.append(athletes_data_dict)
return tmp
i changed the slicing of your dataframe as well, and the range in the loop. so i used a list to store all the values, works well on my end.