Home > Software engineering >  How do I know if the data split thas been done before or after my scaler with scikit-learn
How do I know if the data split thas been done before or after my scaler with scikit-learn

Time:07-27

I am creating a decision tree model using scikit-learn and I need to split the data BEFORE scaling them using StandardScaler(). However, I also want to use the cross_val_score() method.

I first encoded some of my categorical data using OneHotEncoding() within make_column_transformer() as below:

transformer = sklearn.compose.make_column_transformer(
            (sklearn.preprocessing.OneHotEncoder(handle_unknown='ignore'), ['SoilDrainage', 'Geology', 'LU2016']),
            remainder='passthrough')

I then instantiate my model and scaler class:

model = sklearn.tree.DecisionTreeClassifier()

scalar = sklearn.preprocessing.StandardScaler()

I add them to my pipeline:

pipe = sklearn.pipeline.make_pipeline(transformer, scalar, model)

Finally, I input the pipeline into cross_val_score():

sklearn.model_selection.cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()

I don't get any errors when I do this but because the split is done within the cross_val_score() method I'm not sure how to verify if the scaler is applied before or after the data has been split.

CodePudding user response:

https://github.com/scikit-learn/scikit-learn/blob/baf0ea25d/sklearn/model_selection/_validation.py#L381

If you look at the documentation for _fit_and_score(), it is explicitly written that the estimator (=your pipeline) is applied only on the splits of the dataset.

Furthermore, in cross_validate(), the estimator is cloned for each split.

  • Related