I am working on a machine learning project with very sparsely labeled data. There are several categorical features, resulting in roughly one hundred different classes between the features.
For example:
0 red
1 blue
2 <missing>
color_cat = pd.DataFrame(['red', 'blue', np.NAN])
color_enc = OneHotEncoder(sparse=True, handle_unknown='ignore')
color_one_hot = color_enc.fit_transform(color_cat)
After I put these through scikit's OneHotEncoder
I am expecting the missing data to be encoded as 00
, since the docs state that handle_unknown='ignore'
causes the encoder to return an all zero array. Substituting another value, such as with [SimpleImputer][1]
is not an option for me.
What I expect:
0 10
1 01
2 00
Instead OneHotEncoder
treats the missing values as another category.
What I get:
0 100
1 010
2 001
I have seen the related question: How to handle missing values (NaN) in categorical data when using scikit-learn OneHotEncoder? But the solutions do not work for me. I explicitly require a zero vector.
CodePudding user response:
Never really worked with sparse matrix, but one way is to remove the column corresponding to your nan
value. Get the categories_
from your model and create a Boolean mask where is it not nan
(I use pd.Series.notna
but probably other way) and create a new (or reassign) sparse matrix. Basically add to your code:
# currently you have
color_one_hot
# <3x3 sparse matrix of type '<class 'numpy.float64'>'
# with 3 stored elements in Compressed Sparse Row format>
# line of code to add
new_color_one_hot = color_one_hot[:,pd.Series(color_enc.categories_[0]).notna().to_numpy()]
# and now you have
new_color_one_hot
# <3x2 sparse matrix of type '<class 'numpy.float64'>'
# with 2 stored elements in Compressed Sparse Row format>
# and
new_color_one_hot.todense()
# matrix([[0., 1.],
# [1., 0.],
# [0., 0.]])
Edit: also get_dummies
kind of gives similar result pd.get_dummies(color_cat[0], sparse=True)
EDIT: After looking a bit more you can specify the parameter categories
in OneHotEncoder
so if you do:
color_cat = pd.DataFrame(['red', 'blue', np.nan])
color_enc = OneHotEncoder(categories=[color_cat[0].dropna().unique()], ## here
sparse=True, handle_unknown='ignore')
color_one_hot = color_enc.fit_transform(color_cat)
color_one_hot.todense()
# matrix([[1., 0.],
# [0., 1.],
# [0., 0.]])