Home > OS >  Pandas Help - Aggregation
Pandas Help - Aggregation

Time:05-05

Here's a Pandas question: I've got a csv that has all athlete data for all of the Olympics, including whether they scored a medal for a specific event, the country they represented, and the year they participated, along with some other extraneous data. I want to reduce down my dataframe to being grouped by country, year, and medal count. (gold, silver, and bronze all counting equally). Right now I'm doing: df_medal = df.groupby(['NOC', 'Year']).agg({'Medal':'count'}).reset_index() but this doesn't seem to be quite right- I'm getting a medal count way higher than should be possible, ex: getting that the USSR scored 496 medals in 1980. Thoughts? It could be that the data set itself is wrong, but I'm a novice with Pandas so I didn't want to arrive to that conclusion before someone checked my work.

The dataset I'm looking at is available here: https://www.kaggle.com/datasets/heesoo37/120-years-of-olympic-history-athletes-and-results

Note: If medals are not earned, they are marked as "NaN".

CodePudding user response:

Your code is right. You can control yourself by trying to get data in a different way. Instead of grouping you just search specifically for that case you mentioned. Therefore you can do:

df.loc[(df['NOC']=='URS') & (df['Year']==1980),'Medal'].size #789 elements with Nan
df.loc[(df['NOC']=='URS') & (df['Year']==1980),'Medal'].dropna().size #496 elements without Nan

That's excactly the number you got as a result. If you want you can also check for each medal

df.loc[(df['NOC']=='URS') & (df['Year']==1980),'Medal'].value_counts()

You will see how many gold, silver and bronze they got.

If your aggregating on the whole data is failing or you doubt about it, just go for one specific case and check it "manually"

  • Related