Let say that I have this example dataset
test = {'Education': ['High School', 'Uneducated', 'Graduate', 'College', np.nan, 'High School'],
'Gender': ['M', 'F', 'M', 'F', 'M', 'F']}
and the outcome will be like this, right
Education Gender
High School M
Uneducated F
Graduate M
College F
NaN M
High School F
All I want to do is, specify the 'Education' column to be ordinal, with this code,
edu = ['Uneducated','High School', 'College', 'Graduate']
oe_edu = OrdinalEncoder(categories=[edu])
test['Education'] = oe_edu.fit_transform(test[['Education']])
but I have a problem with the NaN values, and I still want to include NaN values, so later I can use imputation
(my scikit-learn version is 1.02 so it can handle NaN if default categories)
So, the final output to be like this
Education Gender
1.0 M
0.0 F
3.0 M
2.0 F
NaN M
1.0 F
maybe it will work if include this paramater 'handle_unknown' and 'unknown_value', but I'm not sure how to use it
CodePudding user response:
Never mind, I got it by myself
edu = ['Uneducated','High School', 'College', 'Graduate']
oe_edu = OrdinalEncoder(categories=[edu], handle_unknown='use_encoded_value', unknown_value=np.nan)
test['Education'] = oe_edu.fit_transform(test[['Education']])
CodePudding user response:
You can do it in pandas using map
:
mapping = {k:v for v,k in enumerate(edu)}
df['Education'] = df['Education'].map(mapping)
Output:
Education Gender
0 1.0 M
1 0.0 F
2 3.0 M
3 2.0 F
4 NaN M
5 1.0 F
CodePudding user response:
You can use pandas.Categorical
:
edu = ['Uneducated','High School', 'College', 'Graduate']
test['cat'] = pd.Categorical(test['Education'], categories=edu, ordered=True)
print(test)
Education Gender cat
0 High School M High School
1 Uneducated F Uneducated
2 Graduate M Graduate
3 College F College
4 NaN M NaN
5 High School F High School
print(test['cat'].cat.codes)
0 1
1 0
2 3
3 2
4 -1
5 1
dtype: int8