Home > Software engineering >  How does Sklearn's Normalizer class's fit function behave?
How does Sklearn's Normalizer class's fit function behave?

Time:08-12

According to the docs, the .fit function :

fit(X, y=None) : Do nothing and return the estimator unchanged.

This method is just there to implement the usual API and hence work in pipelines.

However, when I use the fit function, the Normalizer "fits" to the shape of the X passed, and expects the same number of features when the transform function is used thereafter.

eg:

A = np.random.rand(1,7)
B = np.random.rand(1,5)
print("A :",A,"\n","B :",B)

>>> A : [[0.56973872 0.74769087 0.81626309 0.03873601 0.71216399 0.31807755 0.96527768]] 
    B : [[0.49805279 0.73939067 0.85949423 0.79824846 0.52750957]]

from sklearn.preprocessing import Normalizer
normalizer = Normalizer()

#normalizer.fit(A)
a = normalizer.transform(A)
b = normalizer.transform(B)
print("a :",a,"\n","b :",b)

>>> a : [[0.32403221 0.42524041 0.46424006 0.02203065 0.40503491 0.18090287
        0.54899035]] 
    b : [[0.3182623  0.47248039 0.54922815 0.5100913  0.33708558]]

However, when the fit function is called in this ValueError is raised :

from sklearn.preprocessing import Normalizer
normalizer = Normalizer()

normalizer.fit(A)
a = normalizer.transform(A)
b = normalizer.transform(B)
print("a :",a,"\n","b :",b)

>>> ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [28], in <module>
      4 normalizer.fit(A)
      5 a = normalizer.transform(A)
----> 6 b = normalizer.transform(B)
      7 print("a :",a,"\n","b :",b)

File ~\PycharmProjects\notebookWith_LSP\venv\lib\site-packages\sklearn\preprocessing\_data.py:1948, in Normalizer.transform(self, X, copy)
   1931 """Scale each non zero row of X to unit norm.
   1932 
   1933 Parameters
   (...)
   1945     Transformed array.
   1946 """
   1947 copy = copy if copy is not None else self.copy
-> 1948 X = self._validate_data(X, accept_sparse="csr", reset=False)
   1949 return normalize(X, norm=self.norm, axis=1, copy=copy)

File ~\PycharmProjects\notebookWith_LSP\venv\lib\site-packages\sklearn\base.py:600, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
    597     out = X, y
    599 if not no_val_X and check_params.get("ensure_2d", True):
--> 600     self._check_n_features(X, reset=reset)
    602 return out

File ~\PycharmProjects\notebookWith_LSP\venv\lib\site-packages\sklearn\base.py:400, in BaseEstimator._check_n_features(self, X, reset)
    397     return
    399 if n_features != self.n_features_in_:
--> 400     raise ValueError(
    401         f"X has {n_features} features, but {self.__class__.__name__} "
    402         f"is expecting {self.n_features_in_} features as input."
    403     )

ValueError: X has 5 features, but Normalizer is expecting 7 features as input.

What exactly am I missing here ?

CodePudding user response:

The fit method doesn't learn a function but it stills validates the data calling self._validate_data(X). And the validation function by default blocks the input feature size for later consistency unless called with reset=False.

See at https://github.com/scikit-learn/scikit-learn/blob/17df37aee774720212c27dbc34e6f1feef0e2482/sklearn/base.py

In _validate_data function:

reset : bool, default=True
            Whether to reset the `n_features_in_` attribute.
            If False, the input will be checked for consistency with data
            provided when reset was last True

Unfortunately normaliser.fit doesn't seem to forward keyword arguments to validate data.

  • Related