I have the following dataframe of athletes competing in races.
pos country year round event athlete Battle
1 Jamaica 2012 100m Beijing BOLT 0
2 Jamaica 2012 100m Beijing BLAKE 0
4 USA 2012 100m Beijing GAY 0
3 USA 2012 100m Beijing JOHN 0
6 Jamaica 2012 200m Beijing BOLT 0
7 Jamaica 2012 200m Beijing BLAKE 0
3 USA 2012 200m Beijing GAY 0
4 USA 2012 200m Beijing JOHN 0
I would like to assign values to the 'Battle' column that denotes an athlete got a higher grid spot. Something like this:
pos country year round event athlete Battle
1 Jamaica 2012 100m Beijing BOLT 1
2 Jamaica 2012 100m Beijing BLAKE 0
4 USA 2012 100m Beijing GAY 0
3 USA 2012 100m Beijing JOHN 1
6 Jamaica 2012 200m Beijing BOLT 2
7 Jamaica 2012 200m Beijing BLAKE 0
3 USA 2012 200m Beijing GAY 1
4 USA 2012 200m Beijing JOHN 1
Bolt has a better position than Blake in both races so his battle
value after 2 rounds will be 2 compared to Blake's zero. Gay and John performed better in one of the 2 races, so they have a value of 1 and 1.
How would one compute the Battle
column in pandas?? I am not sure where to begin. I have looked at other questions which include my own e.g. Assigning value to pandas dataframe values for unique values in another column and Compare values in two different pandas columns but these comparisons and/or operations span over two columns, whereas my desired result forces me to consider multiple columns (year, round, event, country)
which must be common, and find unique values in the athlete
column and compare.
I have tried to isolate athletes by country
using df.groupby(['country', 'round', 'pos']).apply(lambda x: x['athlete'])
country round pos athlete
Jamaica 100m 1 BOLT
2 BLAKE
USA 100m 4 GAY
3 JOHN
Jamaica 200m 6 BOLT
7 BLAKE
USA 200m 3 GAY
4 JOHN
While this isolates the important info, I am still confused on how to compare their positions. Any help would be appreciated.
CodePudding user response:
IIUC, you can:
groupby
country and round, and rank the athletes to see how many other athletes they outperformed.groupby
athlete and sum the ranks.
df["rank"] = df.groupby(["country", "round"])["pos"].transform(pd.Series.rank, ascending=False, method="dense")-1
df["Battle"] = df.groupby("athlete")["rank"].transform(sum)
>>> df.drop("rank", axis=1)
pos country year round event athlete Battle
0 1 Jamaica 2012 100m Beijing BOLT 2.0
1 2 Jamaica 2012 100m Beijing BLAKE 0.0
2 4 USA 2012 100m Beijing GAY 1.0
3 3 USA 2012 100m Beijing JOHN 1.0
4 6 Jamaica 2012 200m Beijing BOLT 2.0
5 7 Jamaica 2012 200m Beijing BLAKE 0.0
6 3 USA 2012 200m Beijing GAY 1.0
7 4 USA 2012 200m Beijing JOHN 1.0
~~~