Calculate the average of the lowest n percentile-CodePudding

I have the following dataset. I want to find the average run of the lower 20 percentile. For example: If I divide the runs column into 5 batches then the first two rows will be in the 20 percentile. So the average run of these two rows will be (1 2)/2 = 1.5 How do I divide the data frame into 5 batches (with sorting) and then find the average of that specific group?

I have tried using the following but the output shows 2.8 instead of 3

d.runs.quantile(0.2)

Input:


ODI_runs = {'name': ['Tendulkar', 'Sangakkara', 'Ponting', 
                      'Jayasurya', 'Jayawardene', 'Kohli', 
                      'Haq', 'Kallis', 'Ganguly', 'Dravid'], 
            'runs': [1,2,3,4,5,6,7,8,9,10]} 
d = pd.DataFrame(ODI_runs)  

name            runs
Tendulkar       1
Sangakkara      2
Ponting         3
Jayasurya       4
Jayawardene     5
Kohli           6
Haq             7
Kallis          8
Ganguly         9
Dravid          10

Output:

1.5

CodePudding user response：

You could use the pandas.DataFrame.quantile method: to retrieve the value that separates the first 20% of the data we use df["runs"].quantile(0.2). Then, is all pandas: use loc to target the correct rows and columns, and calculate the .mean() of thos values:

>> df.loc[df["runs"] <= df["runs"].quantile(0.2), "runs"].mean()
1.5

CodePudding user response：

Try:

>>> df['runs'].sort_values().head(round(0.2*len(df))).mean()
1.5

If you want to get all results, use pd.cut:

labels = ['1st', '2nd', '3rd', '4th', '5th']
out = df['runs'].groupby(pd.cut(df['runs'], bins=5, labels=labels)).mean()
print(out)

# Output
runs
1st    1.5
2nd    3.5
3rd    5.5
4th    7.5
5th    9.5
Name: runs, dtype: float64