I have the following dataset. I want to find the average run of the lower 20 percentile. For example: If I divide the runs column into 5 batches then the first two rows will be in the 20 percentile. So the average run of these two rows will be (1 2)/2 = 1.5 How do I divide the data frame into 5 batches (with sorting) and then find the average of that specific group?
I have tried using the following but the output shows 2.8 instead of 3
d.runs.quantile(0.2)
Input:
ODI_runs = {'name': ['Tendulkar', 'Sangakkara', 'Ponting',
'Jayasurya', 'Jayawardene', 'Kohli',
'Haq', 'Kallis', 'Ganguly', 'Dravid'],
'runs': [1,2,3,4,5,6,7,8,9,10]}
d = pd.DataFrame(ODI_runs)
name runs
Tendulkar 1
Sangakkara 2
Ponting 3
Jayasurya 4
Jayawardene 5
Kohli 6
Haq 7
Kallis 8
Ganguly 9
Dravid 10
Output:
1.5
CodePudding user response:
You could use the pandas.DataFrame.quantile
method: to retrieve the value that separates the first 20% of the data we use df["runs"].quantile(0.2)
. Then, is all pandas: use loc to target the correct rows and columns, and calculate the .mean()
of thos values:
>> df.loc[df["runs"] <= df["runs"].quantile(0.2), "runs"].mean()
1.5
CodePudding user response:
Try:
>>> df['runs'].sort_values().head(round(0.2*len(df))).mean()
1.5
If you want to get all results, use pd.cut
:
labels = ['1st', '2nd', '3rd', '4th', '5th']
out = df['runs'].groupby(pd.cut(df['runs'], bins=5, labels=labels)).mean()
print(out)
# Output
runs
1st 1.5
2nd 3.5
3rd 5.5
4th 7.5
5th 9.5
Name: runs, dtype: float64