Inconsistent results from OrdinalEncoder with np.nan in input array-CodePudding

I wish to use OrdinalEncoder to encode some ordinal data with format like this: ["6-10","11-15","1-5",...,np.nan], with the encode order specified in parameter categories as ["1-5","6-10","11-15",...], with np.nan ignored (I wish to encode the given features first before filling the nans).

According to user manual, sklearn OrdinalEncoder should ignore np.nan in the input array:

[From https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features][1]

but inconsistent results is obtained from normal list/np.array/with categories parameter specified:

!pip install -U scikit-learn
!pip install -U numpy

import sklearn
import numpy as np
from sklearn.preprocessing import OrdinalEncoder

print(sklearn.__version__)

dummy_array = [["1-5"],["6-10"],["10-15"],["6-10"],["10-15"],["10-15"],["1-5"],[np.nan]]
dummy_array2 = np.array(["1-5","6-10","10-15","6-10","10-15","10-15","1-5",np.nan])
enc_order = ["1-5","6-10","10-15"]
enc1 = OrdinalEncoder()
enc2 = OrdinalEncoder()
enc3 = OrdinalEncoder(categories=[enc_order])
print(enc1.fit_transform(dummy_array))
print(enc2.fit_transform(dummy_array2.reshape(-1,1)))
print(enc3.fit_transform(dummy_array))

Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (1.0.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.0.1)
Requirement already satisfied: numpy>=1.14.6 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.21.3)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (3.0.0)
Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.4.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (1.21.3)
1.0.1
[[ 0.]
 [ 2.]
 [ 1.]
 [ 2.]
 [ 1.]
 [ 1.]
 [ 0.]
 [nan]]
[[0.]
 [2.]
 [1.]
 [2.]
 [1.]
 [1.]
 [0.]
 [3.]]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-c460949a3bd3> in <module>()
     16 print(enc1.fit_transform(dummy_array))
     17 print(enc2.fit_transform(dummy_array2.reshape(-1,1)))
---> 18 print(enc3.fit_transform(dummy_array))

2 frames
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    845         if y is None:
    846             # fit method of arity 1 (unsupervised transformation)
--> 847             return self.fit(X, **fit_params).transform(X)
    848         else:
    849             # fit method of arity 2 (supervised transformation)

/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_encoders.py in fit(self, X, y)
    884 
    885         # `_fit` will only raise an error when `self.handle_unknown="error"`
--> 886         self._fit(X, handle_unknown=self.handle_unknown, force_all_finite="allow-nan")
    887 
    888         if self.handle_unknown == "use_encoded_value":

/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_encoders.py in _fit(self, X, handle_unknown, force_all_finite)
    114                             " during fit".format(diff, i)
    115                         )
--> 116                         raise ValueError(msg)
    117             self.categories_.append(cats)
    118 

ValueError: Found unknown categories [nan] in column 0 during fit

As I don't have much experience in numpy and sklearn, I am not sure what is the reason with different results from these three cases. From my understanding, top two cases should all give the following result, and the third case should not raise an error:

[[ 0.]
 [ 2.]
 [ 1.]
 [ 2.]
 [ 1.]
 [ 1.]
 [ 0.]
 [nan]]

Any help would be appreciated, thank you! [1]: https://i.stack.imgur.com/Gba8X.png

CodePudding user response：

You need to be explicit what to do with unknown (missing) values:

from sklearn.preprocessing import OrdinalEncoder

dummy_array = [["1-5"],["6-10"],["10-15"],["6-10"],["10-15"],["10-15"],["1-5"],[np.nan]]
enc_order = ["1-5","6-10","10-15"]

# unknown_value is mandatory when handle_unknown is given
enc3 = OrdinalEncoder(categories=[enc_order], 
                      handle_unknown='use_encoded_value', 
                      unknown_value=np.nan)  

enc3.fit_transform(dummy_array)

yields

array([[ 0.],
       [ 1.],
       [ 2.],
       [ 1.],
       [ 2.],
       [ 2.],
       [ 0.],
       [nan]])

The default for handle_uknown is "error", which is the result you got.

The documentation states:

handle_unknown: {‘error’, ‘use_encoded_value’}, default=’error’

When set to ‘error’ an error will be raised in case an unknown categorical feature is present during transform. When set to ‘use_encoded_value’, the encoded value of unknown categories will be set to the value given for the parameter unknown_value

And the help for unknown_value is:

unknown_value : int or np.nan, default=None

When the parameter handle_unknown is set to ‘use_encoded_value’, this parameter is required and will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If set to np.nan, the dtype parameter must be a float dtype.

The reason your dummy_array2 comes out with all values encoded, including the NaN, is because the input is a NumPy array of strings: the np.nan will be converted to 'nan', since the other elements are strings, and a NumPy array requires a single data dtype. In this case, dtype is "U32". As a result, all values are properly encoded to integers (well, floats).