For example I have this dataframe :
count
A 20
B 20
C 15
D 10
E 10
F 8
G 7
H 5
I 5
And if I want to make a group based on biggest 75%, 15%, 10%. I expect this :
count Class
A 20 Top75
B 20 Top75
C 15 Top75
D 10 Top75
E 10 Top75
F 8 Top15
G 7 Top15
H 5 Top10
I 5 Top10
it has been anwered using np.cut
with target 75%,15%,10%. It categorizes correctly but it removes the 'count' column. Using np.qcut
it divides differently.
So, I want to use np.cut but without removing count value .
*Note the count could be any numbers, and I want to cut based on the percentage, cut of first 75%, 15% and the last 10%.
CodePudding user response:
df = pd.DataFrame(dict(count=[20,20,15,10,8,5]))
df['class'] = pd.cut(df['count'], [0, 5, 15, 20], labels=['Top10', 'Top15', 'Top75'])
| | count | class |
|---:|--------:|:--------|
| 0 | 20 | Top75 |
| 1 | 20 | Top75 |
| 2 | 15 | Top15 |
| 3 | 10 | Top15 |
| 4 | 8 | Top15 |
| 5 | 5 | Top10 |
CodePudding user response:
I don't know exactly what you want, but I think this might be desired
min_c = df['count'].min()
max_c = df['count'].max()
bins = [min_c-0.001, 0.9 * min_c 0.1 * max_c, 0.75 * min_c 0.25 * max_c, max_c]
labels = ['Top10', 'Top25', 'Top75']
df.assign(Class=pd.cut(df['count'], bins=bins, labels=labels))
Output:
count Class
A 20 Top75
B 20 Top75
C 15 Top75
D 10 Top75
E 10 Top75
F 8 Top25
G 7 Top25
H 5 Top10
I 5 Top10
CodePudding user response:
The exact logic is still unclear to me, but assuming that the count always sums up to 100, is in decreasing order. You can use a cumulated sum:
bins = [10, 15, 75]
df['Class'] = pd.cut(df.loc[::-1, 'count'].cumsum(),
np.cumsum([0] bins),
labels=[f'Top{n}' for n in bins])
Output:
count Class
A 20 Top75
B 20 Top75
C 15 Top75
D 10 Top75
E 10 Top75
F 8 Top15
G 7 Top15
H 5 Top10
I 5 Top10
Intermediates:
count cumsum bin Class
A 20 100 (25, 100] Top75
B 20 80 (25, 100] Top75
C 15 60 (25, 100] Top75
D 10 45 (25, 100] Top75
E 10 35 (25, 100] Top75
F 8 25 (10, 25] Top15
G 7 17 (10, 25] Top15
H 5 10 (0, 10] Top10
I 5 5 (0, 10] Top10