from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
What I know is fit() method calculates mean and standard deviation of the feature and then transform() method uses them to transform the feature into a new scaled feature. fit_transform() is nothing but calling fit() & transform() method in a single line.
But here why are we only calling fit() for training data and not for testing data?? Does that means we are using mean & standard deviation of training data to transform our testing data ??
CodePudding user response:
fit
computes the mean and stdev to be used for later scaling, note it's just a computation with no scaling done.
transform
uses the previously computed mean and stdev to scale the data (subtract mean from all values and then divide it by stdev).
fit_transform
does both at the same time. So you can do it with just 1 line of code.
For X_train
dataset, we do fit_transform
because we need to compute mean and stdev, and then use it to scale the X_train
dataset. For X_test
dataset, since we already have the mean and stdev, we only do the transformation part.
Edit: X_test
data should be totally unseen and unknown (ie, no info is extracted from them), so we can only derive info from X_train
. The reason why we apply the derived mean and stdev (from X_train
) to transform X_test
as well, is to have the same "apple-to-apple" comparison for y_test
and y_pred
.
By the way, if the train/test data is split properly without bias, and that the data is sufficiently large, both datasets would have the same approximation to the population mean and stdev.