Add grouped percentiles to a pandas DataFrame-CodePudding

I'd like to figure out the percentile of each row of a dataframe within a given group. For a toy example, consider this dataset of athletes from different sports.

pd.DataFrame({"name": ["Joe", "Bob", "Susan", "Kate", "Sam", "Shawn"], 
          "sport": ["hockey", "hockey", "hockey", "baseball", "baseball", "baseball"],
         "points": [1,2,3,1,4,9]})

I want to compare the scoring activity of each athlete to athletes from the same sport. It wouldn't be fair to compare baseball and hockey players directly, so I want to see where each hockey player falls compared only to other hockey players. Here's the desired output.

pd.DataFrame({"name": ["Joe", "Bob", "Susan", "Kate", "Sam", "Shawn"], 
          "sport": ["hockey", "hockey", "hockey", "baseball", "baseball", "baseball"],
         "points": [1,2,3,1,4,9],
         "percentile": [0,.5,1,0,.5,1]})

My real dataset has thousands of groups and hundreds of thousands of rows.

CodePudding user response：

df['percentile'] = df.groupby(['sport'])['points'].rank(pct=True)
print(df)

Output:

    name     sport  points  percentile
0    Joe    hockey       1    0.333333
1    Bob    hockey       2    0.666667
2  Susan    hockey       3    1.000000
3   Kate  baseball       1    0.333333
4    Sam  baseball       4    0.666667
5  Shawn  baseball       9    1.000000

CodePudding user response：

To have the expected output, use groupby.rank with rescaling:

Group size 3:

df['percentile'] = (df.groupby('sport')['points']
                      .rank(pct=True).
                      .sub(1/3).mul(3/2)
                    )

Generic:

df['percentile'] = (df.groupby('sport')['points']
                      .apply(lambda g: g.rank(pct=True)
                            .sub(1/len(g))
                            .mul(len(g)/(len(g)-1) if len(g)>1 else 0))
                     )

Output:

    name     sport  points  percentile
0    Joe    hockey       1         0.0
1    Bob    hockey       2         0.5
2  Susan    hockey       3         1.0
3   Kate  baseball       1         0.0
4    Sam  baseball       4         0.5
5  Shawn  baseball       9         1.0

CodePudding user response：

As far as I know, there is no direct way of calculating percentiles. This answer suggests using the rank method with pct=True to return percentiles, in combination with groupby, you get:

df.groupby("sport")["points"].rank(pct=True)

resulting in

0    0.333333
1    0.666667
2    1.000000
3    0.333333
4    0.666667
5    1.000000

To assign that as a column, do:

df['percentile'] = df.groupby("sport")["points"].rank(pct=True)

CodePudding user response：

Try this:

df['percentile'] = df['points'].sub(1) / df['points'].groupby(df['sport']).transform('max').sub(1)

Output:

>>> df
    name     sport  points  percentile
0    Joe    hockey       1       0.000
1    Bob    hockey       2       0.500
2  Susan    hockey       3       1.000
3   Kate  baseball       1       0.000
4    Sam  baseball       4       0.375
5  Shawn  baseball       9       1.000