Missing categorical data should be encoded with an all-zero one-hot vector-CodePudding

I am working on a machine learning project with very sparsely labeled data. There are several categorical features, resulting in roughly one hundred different classes between the features.

For example:

0    red
1    blue
2    <missing>

color_cat = pd.DataFrame(['red', 'blue', np.NAN])
color_enc = OneHotEncoder(sparse=True, handle_unknown='ignore')
color_one_hot = color_enc.fit_transform(color_cat)

After I put these through scikit's OneHotEncoder I am expecting the missing data to be encoded as 00, since the docs state that handle_unknown='ignore' causes the encoder to return an all zero array. Substituting another value, such as with [SimpleImputer][1] is not an option for me.

What I expect:

0    10
1    01
2    00

Instead OneHotEncoder treats the missing values as another category.

What I get:

0    100
1    010
2    001

I have seen the related question: How to handle missing values (NaN) in categorical data when using scikit-learn OneHotEncoder? But the solutions do not work for me. I explicitly require a zero vector.

CodePudding user response：

Never really worked with sparse matrix, but one way is to remove the column corresponding to your nan value. Get the categories_ from your model and create a Boolean mask where is it not nan (I use pd.Series.notna but probably other way) and create a new (or reassign) sparse matrix. Basically add to your code:

# currently you have
color_one_hot
# <3x3 sparse matrix of type '<class 'numpy.float64'>'
#   with 3 stored elements in Compressed Sparse Row format>

# line of code to add
new_color_one_hot = color_one_hot[:,pd.Series(color_enc.categories_[0]).notna().to_numpy()]

# and now you have
new_color_one_hot
# <3x2 sparse matrix of type '<class 'numpy.float64'>'
#   with 2 stored elements in Compressed Sparse Row format>

# and
new_color_one_hot.todense()
# matrix([[0., 1.],
#         [1., 0.],
#         [0., 0.]])

Edit: also get_dummies kind of gives similar result pd.get_dummies(color_cat[0], sparse=True)

EDIT: After looking a bit more you can specify the parameter categories in OneHotEncoder so if you do:

color_cat = pd.DataFrame(['red', 'blue', np.nan])
color_enc = OneHotEncoder(categories=[color_cat[0].dropna().unique()],  ## here
                          sparse=True, handle_unknown='ignore')
color_one_hot = color_enc.fit_transform(color_cat)
color_one_hot.todense()
# matrix([[1., 0.],
#         [0., 1.],
#         [0., 0.]])