I am creating a decision tree model using scikit-learn and I need to split the data BEFORE scaling them using StandardScaler()
. However, I also want to use the cross_val_score()
method.
I first encoded some of my categorical data using OneHotEncoding()
within make_column_transformer()
as below:
transformer = sklearn.compose.make_column_transformer(
(sklearn.preprocessing.OneHotEncoder(handle_unknown='ignore'), ['SoilDrainage', 'Geology', 'LU2016']),
remainder='passthrough')
I then instantiate my model and scaler class:
model = sklearn.tree.DecisionTreeClassifier()
scalar = sklearn.preprocessing.StandardScaler()
I add them to my pipeline:
pipe = sklearn.pipeline.make_pipeline(transformer, scalar, model)
Finally, I input the pipeline into cross_val_score()
:
sklearn.model_selection.cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()
I don't get any errors when I do this but because the split is done within the cross_val_score()
method I'm not sure how to verify if the scaler is applied before or after the data has been split.
CodePudding user response:
If you look at the documentation for _fit_and_score()
, it is explicitly written that the estimator (=your pipeline) is applied only on the splits of the dataset.
Furthermore, in cross_validate()
, the estimator is cloned for each split.