Ordinal Encoder with Specific order include NAN-CodePudding

Let say that I have this example dataset

test = {'Education': ['High School', 'Uneducated', 'Graduate', 'College', np.nan, 'High School'],
        'Gender': ['M', 'F', 'M', 'F', 'M', 'F']}

and the outcome will be like this, right

    Education   Gender
    High School M
    Uneducated  F
    Graduate    M
    College     F
    NaN         M
    High School F

All I want to do is, specify the 'Education' column to be ordinal, with this code,

edu = ['Uneducated','High School', 'College', 'Graduate']
oe_edu = OrdinalEncoder(categories=[edu])
test['Education'] = oe_edu.fit_transform(test[['Education']])

but I have a problem with the NaN values, and I still want to include NaN values, so later I can use imputation
(my scikit-learn version is 1.02 so it can handle NaN if default categories)

So, the final output to be like this

    Education   Gender
    1.0         M
    0.0         F
    3.0         M
    2.0         F
    NaN         M
    1.0         F

maybe it will work if include this paramater 'handle_unknown' and 'unknown_value', but I'm not sure how to use it

CodePudding user response：

Never mind, I got it by myself

edu = ['Uneducated','High School', 'College', 'Graduate']
oe_edu = OrdinalEncoder(categories=[edu], handle_unknown='use_encoded_value', unknown_value=np.nan)
test['Education'] = oe_edu.fit_transform(test[['Education']])

CodePudding user response：

You can do it in pandas using map:

mapping = {k:v for v,k in enumerate(edu)}
df['Education'] = df['Education'].map(mapping)

Output:

   Education Gender
0        1.0      M
1        0.0      F
2        3.0      M
3        2.0      F
4        NaN      M
5        1.0      F

CodePudding user response：

You can use pandas.Categorical:

edu = ['Uneducated','High School', 'College', 'Graduate']
test['cat'] = pd.Categorical(test['Education'], categories=edu, ordered=True)

print(test)
     Education Gender          cat
0  High School      M  High School
1   Uneducated      F   Uneducated
2     Graduate      M     Graduate
3      College      F      College
4          NaN      M          NaN
5  High School      F  High School

print(test['cat'].cat.codes)
0    1
1    0
2    3
3    2
4   -1
5    1
dtype: int8