Panda dataframe percentage clustering-CodePudding

for example I have this data frame

  count
A 20
B 20
C 15
D 10
E 10
F  8 
G  7
H  5
I  5

and if I want to cut it into several group (biggest 75%, 15%, and last 10%) it would be

    count   Class
A     20     Top1       
B     20     Top1
C     15     Top1
D     10     Top1
E     10     Top1
F      8     Top2 
G      7     Top2
H      5     Top3
I      5     Top3

How do I achieve that ? I know there is a pd.cut but in pd.cut[target] it cuts the information about the count (Only produces A,B,C,etc ).

Thanks

CodePudding user response：

In theory, you can use qcut:

df['Class'] = pd.qcut(df['count'], q=[0, 0.1, 0.15, 1],
                      labels=['Top3', 'Top2', 'Top1'])

Output (note the slightly different example):

   count Class
A     20  Top1
B     20  Top1
C     15  Top1
D     10  Top1
E     10  Top1
F      8  Top1
G      7  Top1
H      5  Top2
I      1  Top3 # this value was changed to ensure having 3 bins

But qcut is quite sensitive to empty bins and this won't work in your current case. What you can do is define the labels dynamically:

s = pd.qcut(df['count'], q=[0, 0.1, 0.15, 1], duplicates='drop')
df['Class'] = s.cat.codes.rsub(len(s.cat.categories)).astype(str).radd('Top')

# for debugging
df['qcut'] = s
df['rank'] = df['count'].rank(pct=True) # not exactly the same computation

Output:

   count Class          qcut      rank
A     20  Top1   (5.4, 20.0]  0.944444
B     20  Top1   (5.4, 20.0]  0.944444
C     15  Top1   (5.4, 20.0]  0.777778
D     10  Top1   (5.4, 20.0]  0.611111
E     10  Top1   (5.4, 20.0]  0.611111
F      8  Top1   (5.4, 20.0]  0.444444
G      7  Top1   (5.4, 20.0]  0.333333
H      5  Top2  (4.999, 5.4]  0.166667
I      5  Top2  (4.999, 5.4]  0.166667