I wish to use OrdinalEncoder to encode some ordinal data with format like this: ["6-10","11-15","1-5",...,np.nan]
, with the encode order specified in parameter categories as ["1-5","6-10","11-15",...]
, with np.nan ignored (I wish to encode the given features first before filling the nans).
According to user manual, sklearn OrdinalEncoder should ignore np.nan
in the input array:
[From https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features][1]
but inconsistent results is obtained from normal list/np.array/with categories parameter specified:
!pip install -U scikit-learn
!pip install -U numpy
import sklearn
import numpy as np
from sklearn.preprocessing import OrdinalEncoder
print(sklearn.__version__)
dummy_array = [["1-5"],["6-10"],["10-15"],["6-10"],["10-15"],["10-15"],["1-5"],[np.nan]]
dummy_array2 = np.array(["1-5","6-10","10-15","6-10","10-15","10-15","1-5",np.nan])
enc_order = ["1-5","6-10","10-15"]
enc1 = OrdinalEncoder()
enc2 = OrdinalEncoder()
enc3 = OrdinalEncoder(categories=[enc_order])
print(enc1.fit_transform(dummy_array))
print(enc2.fit_transform(dummy_array2.reshape(-1,1)))
print(enc3.fit_transform(dummy_array))
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (1.0.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.0.1)
Requirement already satisfied: numpy>=1.14.6 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.21.3)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (3.0.0)
Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.4.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (1.21.3)
1.0.1
[[ 0.]
[ 2.]
[ 1.]
[ 2.]
[ 1.]
[ 1.]
[ 0.]
[nan]]
[[0.]
[2.]
[1.]
[2.]
[1.]
[1.]
[0.]
[3.]]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-c460949a3bd3> in <module>()
16 print(enc1.fit_transform(dummy_array))
17 print(enc2.fit_transform(dummy_array2.reshape(-1,1)))
---> 18 print(enc3.fit_transform(dummy_array))
2 frames
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
845 if y is None:
846 # fit method of arity 1 (unsupervised transformation)
--> 847 return self.fit(X, **fit_params).transform(X)
848 else:
849 # fit method of arity 2 (supervised transformation)
/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_encoders.py in fit(self, X, y)
884
885 # `_fit` will only raise an error when `self.handle_unknown="error"`
--> 886 self._fit(X, handle_unknown=self.handle_unknown, force_all_finite="allow-nan")
887
888 if self.handle_unknown == "use_encoded_value":
/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_encoders.py in _fit(self, X, handle_unknown, force_all_finite)
114 " during fit".format(diff, i)
115 )
--> 116 raise ValueError(msg)
117 self.categories_.append(cats)
118
ValueError: Found unknown categories [nan] in column 0 during fit
As I don't have much experience in numpy and sklearn, I am not sure what is the reason with different results from these three cases. From my understanding, top two cases should all give the following result, and the third case should not raise an error:
[[ 0.]
[ 2.]
[ 1.]
[ 2.]
[ 1.]
[ 1.]
[ 0.]
[nan]]
Any help would be appreciated, thank you! [1]: https://i.stack.imgur.com/Gba8X.png
CodePudding user response:
You need to be explicit what to do with unknown (missing) values:
from sklearn.preprocessing import OrdinalEncoder
dummy_array = [["1-5"],["6-10"],["10-15"],["6-10"],["10-15"],["10-15"],["1-5"],[np.nan]]
enc_order = ["1-5","6-10","10-15"]
# unknown_value is mandatory when handle_unknown is given
enc3 = OrdinalEncoder(categories=[enc_order],
handle_unknown='use_encoded_value',
unknown_value=np.nan)
enc3.fit_transform(dummy_array)
yields
array([[ 0.],
[ 1.],
[ 2.],
[ 1.],
[ 2.],
[ 2.],
[ 0.],
[nan]])
The default for handle_uknown
is "error"
, which is the result you got.
The documentation states:
handle_unknown
: {‘error’, ‘use_encoded_value’}, default=’error’When set to ‘error’ an error will be raised in case an unknown categorical feature is present during transform. When set to ‘use_encoded_value’, the encoded value of unknown categories will be set to the value given for the parameter unknown_value
And the help for unknown_value
is:
unknown_value
: int or np.nan, default=NoneWhen the parameter handle_unknown is set to ‘use_encoded_value’, this parameter is required and will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If set to np.nan, the dtype parameter must be a float dtype.
The reason your dummy_array2
comes out with all values encoded, including the NaN, is because the input is a NumPy array of strings: the np.nan
will be converted to 'nan'
, since the other elements are strings, and a NumPy array requires a single data dtype. In this case, dtype
is "U32". As a result, all values are properly encoded to integers (well, floats).