I am used to running sklearn's standard scaler the following way:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
scaled_X_train = scaler.transform(X_train)
Where X_train
is an array containing the features in my training dataset.
I may then use the same scaler to scale the features in my test dataset X_test
:
scaled_X_test = scaler.transform(X_test)
I know that I may also "bake" the scaler in the model, using sklearn's make_pipeline
:
from sklearn.pipeline import make_pipeline
clf = make_pipeline(preprocessing.StandardScaler(), RandomForestClassifier(n_estimators=100))
But then how do I use the scaler? Is it enough to call the model like I normally would, i.e.:
clf.fit(X_train,y_train)
And then:
y_pred = clf.predict(X_test)
?
CodePudding user response:
Yes, that is correct. It's also a good idea to bake the preprocessing into a pipeline, to avoid the common pitfall of scaling the test and training datasets independently.
When calling clf.fit(X_train,y_train)
, the pipeline will fit the Scaler on X_train, and subsequently use that fit to preprocess your test dataset.
See an example at the beginning of the "common pitfalls and recommended practices" documentation.
We recommend using a Pipeline, which makes it easier to chain transformations with estimators, and reduces the possibility of forgetting a transformation.
So the fact that you don't "use" the Scaler yourself is per design.
With that said, if you wanted for some reason to independently access the scaler from a pipeline, for example to check it's values, you could do so:
clf.fit(X_train,y_train)
# For example, get the first step of the pipeline steps[0]
# then get the actual scaler object [1]
clf.steps[0][1].scale_