how to normalize test and train data as both having different different number of rows-CodePudding

Basically, I have a dataframe d and one of the columns is "price" (Numerical) having 109248 rows. I divided the data into two parts d_train and d_test. d_train has 73196 values and d_test has 36052 values. Now to normalize d_train['price'] and d_test['price'] i did something like this..

price_scalar = Normalizer()
X_train_price = price_scalar.fit_transform(d_train['price'].values.reshape(1, -1)
X_test_price = price_scalar.transform(d_test['price'].values.reshape(1, -1))

Now I'm having this issue

ValueError                                Traceback (most recent call last)
<ipython-input-20-ba623ca7bafa> in <module>()
3 X_train_price = price_scalar.fit_transform(X_train['price'].values.reshape(1, -1))
----> 4 X_test_price = price_scalar.transform(X_test['price'].values.reshape(1, -1))
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in _check_n_features(self, X, reset)
394         if n_features != self.n_features_in_:
395             raise ValueError(
397                 f"is expecting {self.n_features_in_} features as input."
398             )
ValueError: X has 36052 features, but Normalizer is expecting 73196 features as input.

Doing change: reshape(-1,1) instead of reshape(1,-1) is ok but making all row values of "price" to 1

CodePudding user response：

Reshape(-1, 1) is Ok.The results with 1 is what is expected if you use Normalizer from sklearn: Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1, l2 or inf) equals one.

CodePudding user response：

scikit-learn always assumes that the data is organized with shape (n_points, n_features) (i.e., each row is a data point). Also, from the documentation, Normalizer normalizes "samples individually to unit norm". This means that each data point (i.e., row) is normalized, rather than along the column (i.e., all price values).

To normalize the values to the [0, 1] range, you should use the MinMaxScaler with the data reshaped into a column. That is,

from sklearn.preprocessing import MinMaxScaler
price_scalar = MinMaxScaler()
X_train_price = price_scalar.fit_transform(d_train['price'].values.reshape(-1, 1))
X_test_price = price_scalar.transform(d_test['price'].values.reshape(-1, 1))

It it noteworthy that this does not guarantee that the price values in the test set are all within the [0, 1] range. That is the way it should be when learning an ML model, but remember to keep that in mind.

CodePudding user response：

Here, you can directly fit_transform() function, instead of fit() and transform() function separately.

price_scalar = Normalizer()
X_train_price = price_scalar.fit_transform(d_train['price'].values.reshape(1, -1)
X_test_price = price_scalar.fit_transform(d_test['price'].values.reshape(1, -1))