import pandas as pd
import csv
df_orig = pd.read_csv('test_sample.csv')
df_orig = df_orig[(df_orig['number']>0)]
decile_stats = df_orig.groupby(pd.qcut(df_orig.number, 5))['number'].mean()
print(decile_stats)
I'm trying to use python to calculate statistics for deciles of my dataset. I can calculate the mean of each decile using qcut, but I want to group my numbers by the values in a second column. This way the deciles are calculated and reported on values according to their value in the family column.
family number
0 1000 0.04
1 1000 0.20
2 1000 0.04
3 1000 0.16
4 1000 0.08
5 1000 0.02
6 1000 0.02
7 1000 0.02
8 1000 0.64
9 1000 0.04
My desired output would be:
Q1 1000 0.028617
Q2 1000 0.105060
Q3 1000 0.452467
Q4 1000 2.644886
Q5 1000 141.749797...
etc. with each 'family' shown, 1000, 2000, 3000 in this case.
CodePudding user response:
IIUC, you can use:
N = 3
labels = [f'Q{i}' for i in range(1, N 1)]
decile = lambda x: x.groupby(pd.qcut(x['number'], N, labels=labels)).mean()
out = df.groupby('family').apply(decile)['number'].rename('mean').reset_index()
Output:
>>> out
family number mean
0 1000 Q1 0.030000
1 1000 Q2 0.080000
2 1000 Q3 0.333333