I have a DataFrame df
that has an Age
column with continuous variables. I would like to create a new DataFrame new_df
, replacing the original continuous variables with categorical variables that I created from binning.
Is there a way to do this?
DataFrame (df
):
Customer_ID Gender Age
0 0002-ORFBO Female 37
1 0003-MKNFE Male 46
2 0004-TLHLJ Male 50
3 0011-IGKFF Male 78
4 0013-EXCHZ Female 75
5 0013-MHZWF Female 23
6 0013-SMEOE Female 67
7 0014-BMAQU Male 52
8 0015-UOCOJ Female 68
9 0016-QLJIS Female 43
10 0017-DINOC Male 47
11 0017-IUDMW Female 25
12 0018-NYROU Female 58
13 0019-EFAEP Female 32
14 0019-GFNTW Female 39
15 0020-INWCK Female 58
16 0020-JDNXP Female 52
17 0021-IKXGC Female 72
18 0022-TCJCI Male 79
My code:
# Ages 0 to 3: Toddler
# Ages 4 to 17: Child
# Ages 18 to 25: Young Adult
# Ages 26 to 64: Adult
# Ages 65 to 99: Elder
pd.cut(df.Age,bins=[0,3,17,25,64,99], labels=['Toddler', 'Child', 'Young Adult', 'Adult', 'Elder'])
CodePudding user response:
I thought you're getting there already, but there is no need to create a new data frame new_df
.. only need to create a new column called age_category
df = pd.read_csv('data.csv')
df['age_category'] = pd.cut(df['Age'], bins=[0,3,17,25,64,99], labels=['Toddler', 'Child', 'Young Adult', 'Adult', 'Elder'])
# print(df)
print(df[['Customer ID', 'Gender', 'Age', 'age_category']])
Output
Customer ID Gender Age age_category
0 0002-ORFBO Female 37 Adult
1 0003-MKNFE Male 46 Adult
2 0004-TLHLJ Male 50 Adult
3 0011-IGKFF Male 78 Elder
4 0013-EXCHZ Female 75 Elder
5 0013-MHZWF Female 23 Young Adult
6 0013-SMEOE Female 67 Elder
7 0014-BMAQU Male 52 Adult
8 0015-UOCOJ Female 68 Elder
9 0016-QLJIS Female 43 Adult
10 0017-DINOC Male 47 Adult
11 0017-IUDMW Female 25 Young Adult
12 0018-NYROU Female 58 Adult
13 0019-EFAEP Female 32 Adult
14 0019-GFNTW Female 39 Adult
15 0020-INWCK Female 58 Adult
16 0020-JDNXP Female 52 Adult
17 0021-IKXGC Female 72 Elder
18 0022-TCJCI Male 79 Elder
CodePudding user response:
If you really want it to be another dataframe, make a copy of the original, and then overwrite the Age
column with what you made:
new_df = df.copy()
new_df['Age'] = pd.cut(new_df['Age'], bins=[0,3,17,25,64,99], labels=['Toddler', 'Child', 'Young Adult', 'Adult', 'Elder'])
print(new_df)
# Output:
Customer_ID Gender Age
0 0002-ORFBO Female Adult
1 0003-MKNFE Male Adult
2 0004-TLHLJ Male Adult
3 0011-IGKFF Male Elder
4 0013-EXCHZ Female Elder
5 0013-MHZWF Female Young Adult
6 0013-SMEOE Female Elder
7 0014-BMAQU Male Adult
8 0015-UOCOJ Female Elder
9 0016-QLJIS Female Adult
10 0017-DINOC Male Adult
11 0017-IUDMW Female Young Adult
12 0018-NYROU Female Adult
13 0019-EFAEP Female Adult
14 0019-GFNTW Female Adult
15 0020-INWCK Female Adult
16 0020-JDNXP Female Adult
17 0021-IKXGC Female Elder
18 0022-TCJCI Male Elder
CodePudding user response:
You can try add include_lowest
argument to make 0
included to Toddler
label
out = df.join(pd.cut(df.pop('Age'),
bins=[0,3,17,25,64,99],
labels=['Toddler', 'Child', 'Young Adult', 'Adult', 'Elder'],
include_lowest=True,).to_frame('label'))
print(out)
label
0 NaN
1 Toddler
2 Toddler
3 Toddler
4 Toddler
5 Child
6 Child
7 Child
8 Child
9 Child
10 Child
11 Child
12 Child
13 Child
14 Child
15 Child
16 Child
17 Child
18 Child
19 Young Adult
20 Young Adult
21 Young Adult
22 Young Adult
23 Young Adult
24 Young Adult
25 Young Adult
26 Young Adult
27 Adult
28 Adult
29 Adult
30 Adult
31 Adult
32 Adult
33 Adult
34 Adult
35 Adult
36 Adult
37 Adult
38 Adult
39 Adult
40 Adult
41 Adult
42 Adult
43 Adult
44 Adult
45 Adult
46 Adult
47 Adult
48 Adult
49 Adult
50 Adult
51 Adult
52 Adult
53 Adult
54 Adult
55 Adult
56 Adult
57 Adult
58 Adult
59 Adult
60 Adult
61 Adult
62 Adult
63 Adult
64 Adult
65 Adult
66 Elder
67 Elder
68 Elder
69 Elder
70 Elder
71 Elder
72 Elder
73 Elder
74 Elder
75 Elder
76 Elder
77 Elder
78 Elder
79 Elder
80 Elder
81 Elder
82 Elder
83 Elder
84 Elder
85 Elder
86 Elder
87 Elder
88 Elder
89 Elder
90 Elder
91 Elder
92 Elder
93 Elder
94 Elder
95 Elder
96 Elder
97 Elder
98 Elder
99 Elder
100 Elder
New column to original df
df['label'] = pd.cut(df['Age'],
bins=[0,3,17,25,64,99],
labels=['Toddler', 'Child', 'Young Adult', 'Adult', 'Elder'],
include_lowest=True)
print(df)
Age label
0 -1 NaN
1 0 Toddler
2 1 Toddler
3 2 Toddler
4 3 Toddler
5 4 Child
6 5 Child
7 6 Child
8 7 Child
9 8 Child
10 9 Child
11 10 Child
12 11 Child
13 12 Child
14 13 Child
15 14 Child
16 15 Child
17 16 Child
18 17 Child
19 18 Young Adult
20 19 Young Adult
21 20 Young Adult
22 21 Young Adult
23 22 Young Adult
24 23 Young Adult
25 24 Young Adult
26 25 Young Adult
27 26 Adult
28 27 Adult
29 28 Adult
30 29 Adult
31 30 Adult
32 31 Adult
33 32 Adult
34 33 Adult
35 34 Adult
36 35 Adult
37 36 Adult
38 37 Adult
39 38 Adult
40 39 Adult
41 40 Adult
42 41 Adult
43 42 Adult
44 43 Adult
45 44 Adult
46 45 Adult
47 46 Adult
48 47 Adult
49 48 Adult
50 49 Adult
51 50 Adult
52 51 Adult
53 52 Adult
54 53 Adult
55 54 Adult
56 55 Adult
57 56 Adult
58 57 Adult
59 58 Adult
60 59 Adult
61 60 Adult
62 61 Adult
63 62 Adult
64 63 Adult
65 64 Adult
66 65 Elder
67 66 Elder
68 67 Elder
69 68 Elder
70 69 Elder
71 70 Elder
72 71 Elder
73 72 Elder
74 73 Elder
75 74 Elder
76 75 Elder
77 76 Elder
78 77 Elder
79 78 Elder
80 79 Elder
81 80 Elder
82 81 Elder
83 82 Elder
84 83 Elder
85 84 Elder
86 85 Elder
87 86 Elder
88 87 Elder
89 88 Elder
90 89 Elder
91 90 Elder
92 91 Elder
93 92 Elder
94 93 Elder
95 94 Elder
96 95 Elder
97 96 Elder
98 97 Elder
99 98 Elder
100 99 Elder