Home > Software engineering >  How to create bins and assign labels based on a given condition pandas
How to create bins and assign labels based on a given condition pandas

Time:09-23

I have a pandas Dataframe and I want to create a new columns (new1,new2,new3,new4,new5,new6,new7,new8,new9,new10) from the original columns(A-J). The condition is to create new columns by assigning bins to the values in columns (A-J).

bins to be assigned:

0
1.0 - 1.19
1.21.39
1.41.59
1.61.79
1.81.99
2.02.99
3.03.99
4.04.99
5.05.99
6.06.99
7.07.99
8.08.99
9.09.99
10.0 

DataFrame:

    A         B       C        D      E      F       G       H       I       J
0   34.00   5.18    19.10   70.00   0.00    1.10    3.30    2.96    0.00    1.45
1   3.65    7.20    0.00    3.15    14.00   0.00    8.20    13.50   0.00    0.00
2   6.60    12.00   104.94  3.60    24.00   6.54    23.00   4.83    0.00    0.00
3   2.16    5.30    0.00    8.40    31.48   0.00    6.40    26.44   0.00    0.00
4   4.66    268.86  160.00  19.14   1.58    17.77   47.57   88.01   0.00    0.00
5   7.40    2.64    0.00    4.90    3.53    0.00    46.00   0.00    0.00    0.00
6   6.18    60.00   8.06    13.50   6.20    3.05    9.00    38.90   0.00    0.00
7   6.51    5.30    6.20    17.00   27.00   15.00   4.51    11.50   0.00    0.00
8   3.25    40.72   4.37    8.02    0.00    110.00  16.25   3.95    75.00   0.00

The new columns to be added to the Dataframe should have the form. Expected Output form

new1          new2     new3  new4      new5    new6       new7      new8     new9   new10

10          5.05.99   10    10        0.00   1.0-1.19  3.03.99    2.02.99  0.00  1.41.59
3.03.99    7.07.99  0.00  3.03.99   10     0.00      8.08.99  10  0.00  0.00  0.00

I tried to create the new DataFrame by using the pd.cut function in pandas, and also assign labels to this bin. but the number of bins created was up to 30 instead of 15. I also tried to assign labels to the bin but I get an error in the process

labels = ['0','1.0 - 1.19' ,'1.2 – 1.39','1.4 – 1.59', '1.6 – 1.79','1.8 – 1.99','2.0 – 2.99','3.0 – 3.99','4.0 – 4.99','5.0 – 5.99',\
          '6.0 – 6.99','7.0 – 7.99','8.0 – 8.99','9.0 – 9.99','10.0 ']
bins = [0,1.0,1.19,1.2,1.39,1.4,1.59,1.6,1.79,1.8,1.99,2.0,2.99,3.0,3.99,4.0,4.99,5.0,5.99,6.0,6.99,7.0,7.99,8.0,8.99,9.0,9.99,10.0,np.inf]

df[['new1','new2','new3','new4','new5','new6','new7','new8','new9','new10']] = \
                          df[['A', 'B', 'C','D','E','F','G','H','I','J']].apply(lambda x: \
                         pd.cut(x,bins),axis=1).apply(lambda x: x.cat.add_categories([0.00])).fillna(0.00)
     

ValueError: Bin labels must be one fewer than the number of bin edges

CodePudding user response:

The solution involves editing your bins list:

# Same labels as yours
labels = ['0', '1.0 - 1.19', '1.2 – 1.39', '1.4 – 1.59', '1.6 – 1.79', 
          '1.8 – 1.99', '2.0 – 2.99', '3.0 – 3.99', '4.0 – 4.99', '5.0 – 5.99', 
          '6.0 – 6.99', '7.0 – 7.99', '8.0 – 8.99', '9.0 – 9.99', '10.0 ']

# Define the edges between bins
bins = [0, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 3.0, 4.0, 
        5.0, 6.0, 7.0, 8.0, 9.0, 10.0, np.inf]

# pd.cut each column, with each bin closed on left and open on right
res = df.apply(lambda x: pd.cut(x, bins=bins, labels=labels, right=False))

# rename columns and print result
res.columns = [f'new{i 1}' for i in range(df.shape[1])]

print(res)

         new1        new2        new3        new4        new5        new6        new7        new8   new9       new10
0       10.0   5.05.99       10.0        10.0            0  1.0 - 1.19  3.03.99  2.02.99      0  1.41.59
1  3.03.99  7.07.99           0  3.03.99       10.0            0  8.08.99       10.0       0           0
2  6.06.99       10.0        10.0   3.03.99       10.0   6.06.99       10.0   4.04.99      0           0
3  2.02.99  5.05.99           0  8.08.99       10.0            0  6.06.99       10.0       0           0
4  4.04.99       10.0        10.0        10.0   1.41.59       10.0        10.0        10.0       0           0
5  7.07.99  2.02.99           0  4.04.99  3.03.99           0       10.0            0      0           0
6  6.06.99       10.0   8.08.99       10.0   6.06.99  3.03.99  9.09.99       10.0       0           0
7  6.06.99  5.05.99  6.06.99       10.0        10.0        10.0   4.04.99       10.0       0           0
8  3.03.99       10.0   4.04.99  8.08.99           0       10.0        10.0   3.03.99  10.0            0

Explanation

The sequence of scalars passed as bins to pd.cut() "defines the bin edges allowing for non-uniform width": https://pandas.pydata.org/docs/reference/api/pandas.cut.html.

By default, each bin is open on the left and closed on the right. To switch this, pass right=False (which also closes the left edge of each bin).

For example, bins=[0, 1.0, 1.19, 1.2] causes pd.cut to make 3 intervals: [0.0, 1.0) < [1.0, 1.19) < [1.19, 2.0).

  • Related