I have a dataframe like this:
df:
Score
group
A 100
A 34
A 40
A 30
C 24
C 60
C 35
For every group in the data, I want to find out the percentile value of Score 35. (i.e the percentile where the 35 fits in the grouped data)
I tried different tricks but none of them worked.
scipy.stats.percentileofscore(df['Score], 35, kind='weak')
--> This is working but this doesn't give me the percentile grouped by index
df.groupby('group')['Score].percentileofscore()
--> 'SeriesGroupBy' object has no attribute 'percentileofscore'
scipy.stats.percentileofscore(df.groupby('group')[['Score]], 35, kind='strict')
--> TypeError: '<' not supported between instances of 'str' and 'int'
My ideal output looks like this:
df:
Score Percentile
group
A 50
C 33
Can anyone suggest to me what works well here?
CodePudding user response:
Inverse quantile function for a sequence at point X is the proportion of values less than X in the sequence, right? So:
In [158]: df["Score"].lt(35).groupby(df["group"]).mean().mul(100)
Out[158]:
group
A 50.000000
C 33.333333
Name: Score, dtype: float64
- get a True/False Series of whether < 35 or not on "Score"
- group this Series over "group"
- take the mean
- since True == 1 and False == 0, it will effectively give the proportion!
mul
tiply by 100 to get percentages
CodePudding user response:
Assuming you want to calculate the top percentile, i.e. how much of the scores lie above 35
, you need to use 100 - percentileofscore
inside agg:
df.groupby('group').agg(Score_Percentile=('Score',\
lambda x: 100 - percentileofscore(x, 35, kind='weak'))).\
astype(int)
Output:
Score_Percentile
group
A 50
C 33
CodePudding user response:
To answer in a bit more general purpose way you're looking to do a custom aggregation on the group, which pandas lets you do with the agg
method.
You can define the function yourself or use one from a library:
def percentileofscore(ser: pd.Series) -> float:
return 100 * (ser > 35).sum() / ser.size
df.groupby("group").agg(percentileofscore)
Output:
Score
group
A 50.000000
C 33.333333