I'd like to figure out the percentile of each row of a dataframe within a given group. For a toy example, consider this dataset of athletes from different sports.
pd.DataFrame({"name": ["Joe", "Bob", "Susan", "Kate", "Sam", "Shawn"],
"sport": ["hockey", "hockey", "hockey", "baseball", "baseball", "baseball"],
"points": [1,2,3,1,4,9]})
I want to compare the scoring activity of each athlete to athletes from the same sport. It wouldn't be fair to compare baseball and hockey players directly, so I want to see where each hockey player falls compared only to other hockey players. Here's the desired output.
pd.DataFrame({"name": ["Joe", "Bob", "Susan", "Kate", "Sam", "Shawn"],
"sport": ["hockey", "hockey", "hockey", "baseball", "baseball", "baseball"],
"points": [1,2,3,1,4,9],
"percentile": [0,.5,1,0,.5,1]})
My real dataset has thousands of groups and hundreds of thousands of rows.
CodePudding user response:
df['percentile'] = df.groupby(['sport'])['points'].rank(pct=True)
print(df)
Output:
name sport points percentile
0 Joe hockey 1 0.333333
1 Bob hockey 2 0.666667
2 Susan hockey 3 1.000000
3 Kate baseball 1 0.333333
4 Sam baseball 4 0.666667
5 Shawn baseball 9 1.000000
CodePudding user response:
To have the expected output, use groupby.rank
with rescaling:
Group size 3:
df['percentile'] = (df.groupby('sport')['points']
.rank(pct=True).
.sub(1/3).mul(3/2)
)
Generic:
df['percentile'] = (df.groupby('sport')['points']
.apply(lambda g: g.rank(pct=True)
.sub(1/len(g))
.mul(len(g)/(len(g)-1) if len(g)>1 else 0))
)
Output:
name sport points percentile
0 Joe hockey 1 0.0
1 Bob hockey 2 0.5
2 Susan hockey 3 1.0
3 Kate baseball 1 0.0
4 Sam baseball 4 0.5
5 Shawn baseball 9 1.0
CodePudding user response:
As far as I know, there is no direct way of calculating percentiles. This answer suggests using the rank method with pct=True
to return percentiles, in combination with groupby, you get:
df.groupby("sport")["points"].rank(pct=True)
resulting in
0 0.333333
1 0.666667
2 1.000000
3 0.333333
4 0.666667
5 1.000000
To assign that as a column, do:
df['percentile'] = df.groupby("sport")["points"].rank(pct=True)
CodePudding user response:
Try this:
df['percentile'] = df['points'].sub(1) / df['points'].groupby(df['sport']).transform('max').sub(1)
Output:
>>> df
name sport points percentile
0 Joe hockey 1 0.000
1 Bob hockey 2 0.500
2 Susan hockey 3 1.000
3 Kate baseball 1 0.000
4 Sam baseball 4 0.375
5 Shawn baseball 9 1.000