how to calculate percentile value of number in dataframe column grouped by index-CodePudding

I have a dataframe like this:

For every group in the data, I want to find out the percentile value of Score 35. (i.e the percentile where the 35 fits in the grouped data)

I tried different tricks but none of them worked.

scipy.stats.percentileofscore(df['Score], 35, kind='weak')
 --> This is working but this doesn't give me the percentile grouped by index

df.groupby('group')['Score].percentileofscore()
 --> 'SeriesGroupBy' object has no attribute 'percentileofscore'

scipy.stats.percentileofscore(df.groupby('group')[['Score]], 35, kind='strict')
 --> TypeError: '<' not supported between instances of 'str' and 'int'

My ideal output looks like this:

df:
        Score Percentile
group 
  A       50
  C       33

Can anyone suggest to me what works well here?

CodePudding user response：

Inverse quantile function for a sequence at point X is the proportion of values less than X in the sequence, right? So:

In [158]: df["Score"].lt(35).groupby(df["group"]).mean().mul(100)
Out[158]:
group
A    50.000000
C    33.333333
Name: Score, dtype: float64

get a True/False Series of whether < 35 or not on "Score"
group this Series over "group"
take the mean
- since True == 1 and False == 0, it will effectively give the proportion!
multiply by 100 to get percentages

CodePudding user response：

Assuming you want to calculate the top percentile, i.e. how much of the scores lie above 35, you need to use 100 - percentileofscore inside agg:

df.groupby('group').agg(Score_Percentile=('Score',\
                    lambda x: 100 - percentileofscore(x, 35, kind='weak'))).\
                    astype(int)

Output:

       Score_Percentile
group   
A      50
C      33

CodePudding user response：

To answer in a bit more general purpose way you're looking to do a custom aggregation on the group, which pandas lets you do with the agg method.

You can define the function yourself or use one from a library:

def percentileofscore(ser: pd.Series) -> float:
    return 100 * (ser > 35).sum() / ser.size

df.groupby("group").agg(percentileofscore)

Output:

      Score
group   
A     50.000000
C     33.333333