Home > Blockchain >  Numpy cut without removing other column
Numpy cut without removing other column

Time:11-19

For example I have this dataframe :

   count
A   20
B   20
C   15
D   10
E   10
F    8
G    7
H    5
I    5

And if I want to make a group based on biggest 75%, 15%, 10%. I expect this :

   count   Class
A   20     Top75
B   20     Top75 
C   15     Top75
D   10     Top75
E   10     Top75
F    8     Top15
G    7     Top15
H    5     Top10
I    5     Top10

it has been anwered using np.cut with target 75%,15%,10%. It categorizes correctly but it removes the 'count' column. Using np.qcut it divides differently. So, I want to use np.cut but without removing count value . *Note the count could be any numbers, and I want to cut based on the percentage, cut of first 75%, 15% and the last 10%.

CodePudding user response:

df = pd.DataFrame(dict(count=[20,20,15,10,8,5]))
df['class'] = pd.cut(df['count'], [0, 5, 15, 20], labels=['Top10', 'Top15', 'Top75'])

|    |   count | class   |
|---:|--------:|:--------|
|  0 |      20 | Top75   |
|  1 |      20 | Top75   |
|  2 |      15 | Top15   |
|  3 |      10 | Top15   |
|  4 |       8 | Top15   |
|  5 |       5 | Top10   |

CodePudding user response:

I don't know exactly what you want, but I think this might be desired

min_c = df['count'].min()
max_c = df['count'].max()
bins = [min_c-0.001, 0.9 * min_c   0.1 * max_c, 0.75 * min_c   0.25 * max_c, max_c]
labels = ['Top10', 'Top25', 'Top75']
df.assign(Class=pd.cut(df['count'], bins=bins, labels=labels))

Output:

    count   Class
A   20      Top75
B   20      Top75
C   15      Top75
D   10      Top75
E   10      Top75
F   8       Top25
G   7       Top25
H   5       Top10
I   5       Top10

CodePudding user response:

The exact logic is still unclear to me, but assuming that the count always sums up to 100, is in decreasing order. You can use a cumulated sum:

bins = [10, 15, 75]
df['Class'] = pd.cut(df.loc[::-1, 'count'].cumsum(),
                     np.cumsum([0] bins),
                     labels=[f'Top{n}' for n in bins])

Output:

   count  Class
A     20  Top75
B     20  Top75
C     15  Top75
D     10  Top75
E     10  Top75
F      8  Top15
G      7  Top15
H      5  Top10
I      5  Top10

Intermediates:

   count  cumsum        bin  Class
A     20     100  (25, 100]  Top75
B     20      80  (25, 100]  Top75
C     15      60  (25, 100]  Top75
D     10      45  (25, 100]  Top75
E     10      35  (25, 100]  Top75
F      8      25   (10, 25]  Top15
G      7      17   (10, 25]  Top15
H      5      10    (0, 10]  Top10
I      5       5    (0, 10]  Top10
  • Related