I have a pandas Dataframe and I want to create a new columns (new1,new2,new3,new4,new5,new6,new7,new8,new9,new10) from the original columns(A-J). The condition is to create new columns by assigning bins to the values in columns (A-J).
bins to be assigned:
0
1.0 - 1.19
1.2 – 1.39
1.4 – 1.59
1.6 – 1.79
1.8 – 1.99
2.0 – 2.99
3.0 – 3.99
4.0 – 4.99
5.0 – 5.99
6.0 – 6.99
7.0 – 7.99
8.0 – 8.99
9.0 – 9.99
10.0
DataFrame:
A B C D E F G H I J
0 34.00 5.18 19.10 70.00 0.00 1.10 3.30 2.96 0.00 1.45
1 3.65 7.20 0.00 3.15 14.00 0.00 8.20 13.50 0.00 0.00
2 6.60 12.00 104.94 3.60 24.00 6.54 23.00 4.83 0.00 0.00
3 2.16 5.30 0.00 8.40 31.48 0.00 6.40 26.44 0.00 0.00
4 4.66 268.86 160.00 19.14 1.58 17.77 47.57 88.01 0.00 0.00
5 7.40 2.64 0.00 4.90 3.53 0.00 46.00 0.00 0.00 0.00
6 6.18 60.00 8.06 13.50 6.20 3.05 9.00 38.90 0.00 0.00
7 6.51 5.30 6.20 17.00 27.00 15.00 4.51 11.50 0.00 0.00
8 3.25 40.72 4.37 8.02 0.00 110.00 16.25 3.95 75.00 0.00
The new columns to be added to the Dataframe should have the form. Expected Output form
new1 new2 new3 new4 new5 new6 new7 new8 new9 new10
10 5.0–5.99 10 10 0.00 1.0-1.19 3.0–3.99 2.0–2.99 0.00 1.4–1.59
3.0–3.99 7.0–7.99 0.00 3.0–3.99 10 0.00 8.0 – 8.99 10 0.00 0.00 0.00
I tried to create the new DataFrame by using the pd.cut function in pandas, and also assign labels to this bin. but the number of bins created was up to 30 instead of 15. I also tried to assign labels to the bin but I get an error in the process
labels = ['0','1.0 - 1.19' ,'1.2 – 1.39','1.4 – 1.59', '1.6 – 1.79','1.8 – 1.99','2.0 – 2.99','3.0 – 3.99','4.0 – 4.99','5.0 – 5.99',\
'6.0 – 6.99','7.0 – 7.99','8.0 – 8.99','9.0 – 9.99','10.0 ']
bins = [0,1.0,1.19,1.2,1.39,1.4,1.59,1.6,1.79,1.8,1.99,2.0,2.99,3.0,3.99,4.0,4.99,5.0,5.99,6.0,6.99,7.0,7.99,8.0,8.99,9.0,9.99,10.0,np.inf]
df[['new1','new2','new3','new4','new5','new6','new7','new8','new9','new10']] = \
df[['A', 'B', 'C','D','E','F','G','H','I','J']].apply(lambda x: \
pd.cut(x,bins),axis=1).apply(lambda x: x.cat.add_categories([0.00])).fillna(0.00)
ValueError: Bin labels must be one fewer than the number of bin edges
CodePudding user response:
The solution involves editing your bins
list:
# Same labels as yours
labels = ['0', '1.0 - 1.19', '1.2 – 1.39', '1.4 – 1.59', '1.6 – 1.79',
'1.8 – 1.99', '2.0 – 2.99', '3.0 – 3.99', '4.0 – 4.99', '5.0 – 5.99',
'6.0 – 6.99', '7.0 – 7.99', '8.0 – 8.99', '9.0 – 9.99', '10.0 ']
# Define the edges between bins
bins = [0, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 3.0, 4.0,
5.0, 6.0, 7.0, 8.0, 9.0, 10.0, np.inf]
# pd.cut each column, with each bin closed on left and open on right
res = df.apply(lambda x: pd.cut(x, bins=bins, labels=labels, right=False))
# rename columns and print result
res.columns = [f'new{i 1}' for i in range(df.shape[1])]
print(res)
new1 new2 new3 new4 new5 new6 new7 new8 new9 new10
0 10.0 5.0 – 5.99 10.0 10.0 0 1.0 - 1.19 3.0 – 3.99 2.0 – 2.99 0 1.4 – 1.59
1 3.0 – 3.99 7.0 – 7.99 0 3.0 – 3.99 10.0 0 8.0 – 8.99 10.0 0 0
2 6.0 – 6.99 10.0 10.0 3.0 – 3.99 10.0 6.0 – 6.99 10.0 4.0 – 4.99 0 0
3 2.0 – 2.99 5.0 – 5.99 0 8.0 – 8.99 10.0 0 6.0 – 6.99 10.0 0 0
4 4.0 – 4.99 10.0 10.0 10.0 1.4 – 1.59 10.0 10.0 10.0 0 0
5 7.0 – 7.99 2.0 – 2.99 0 4.0 – 4.99 3.0 – 3.99 0 10.0 0 0 0
6 6.0 – 6.99 10.0 8.0 – 8.99 10.0 6.0 – 6.99 3.0 – 3.99 9.0 – 9.99 10.0 0 0
7 6.0 – 6.99 5.0 – 5.99 6.0 – 6.99 10.0 10.0 10.0 4.0 – 4.99 10.0 0 0
8 3.0 – 3.99 10.0 4.0 – 4.99 8.0 – 8.99 0 10.0 10.0 3.0 – 3.99 10.0 0
Explanation
The sequence of scalars passed as bins
to pd.cut()
"defines the bin edges allowing for non-uniform width": https://pandas.pydata.org/docs/reference/api/pandas.cut.html.
By default, each bin is open on the left and closed on the right. To switch this, pass right=False
(which also closes the left edge of each bin).
For example, bins=[0, 1.0, 1.19, 1.2]
causes pd.cut
to make 3 intervals: [0.0, 1.0) < [1.0, 1.19) < [1.19, 2.0)
.