for example I have this data frame
count
A 20
B 20
C 15
D 10
E 10
F 8
G 7
H 5
I 5
and if I want to cut it into several group (biggest 75%, 15%, and last 10%) it would be
count Class
A 20 Top1
B 20 Top1
C 15 Top1
D 10 Top1
E 10 Top1
F 8 Top2
G 7 Top2
H 5 Top3
I 5 Top3
How do I achieve that ? I know there is a pd.cut but in pd.cut[target]
it cuts the information about the count (Only produces A,B,C,etc ).
Thanks
CodePudding user response:
In theory, you can use qcut
:
df['Class'] = pd.qcut(df['count'], q=[0, 0.1, 0.15, 1],
labels=['Top3', 'Top2', 'Top1'])
Output (note the slightly different example):
count Class
A 20 Top1
B 20 Top1
C 15 Top1
D 10 Top1
E 10 Top1
F 8 Top1
G 7 Top1
H 5 Top2
I 1 Top3 # this value was changed to ensure having 3 bins
But qcut
is quite sensitive to empty bins and this won't work in your current case. What you can do is define the labels dynamically:
s = pd.qcut(df['count'], q=[0, 0.1, 0.15, 1], duplicates='drop')
df['Class'] = s.cat.codes.rsub(len(s.cat.categories)).astype(str).radd('Top')
# for debugging
df['qcut'] = s
df['rank'] = df['count'].rank(pct=True) # not exactly the same computation
Output:
count Class qcut rank
A 20 Top1 (5.4, 20.0] 0.944444
B 20 Top1 (5.4, 20.0] 0.944444
C 15 Top1 (5.4, 20.0] 0.777778
D 10 Top1 (5.4, 20.0] 0.611111
E 10 Top1 (5.4, 20.0] 0.611111
F 8 Top1 (5.4, 20.0] 0.444444
G 7 Top1 (5.4, 20.0] 0.333333
H 5 Top2 (4.999, 5.4] 0.166667
I 5 Top2 (4.999, 5.4] 0.166667