Avoiding data leakage when using BaggingClassifier (Regressor) with feature scaling (StandardScaler)-CodePudding

I am running bagging with LogisticRegression. Since the latter uses regularization, features must be scaled. Since bagging takes a sample (with replacement) from the original training data set, scaling should take place after that. Scaling the original data set and then taking a sample amounts to data leakage. This is similar to how scaling is (mis)used with CV: it is wrong to scale the whole data set and then feed it to CV.

It appears that there are no built in tools to avoid data leakage with bagging (see the code below), but I may be wrong. Any help will be appreciated.

from sklearn.ensemble import BaggingClassifier

single_log_reg = LogisticRegression(solver="liblinear", random_state = np.random.RandomState(18))

bagged_logistic = BaggingClassifier(single_log_reg, n_estimators = 100, random_state = np.random.RandomState(42))

logit_bagged_pipeline = Pipeline(steps=[
    ('scaler', StandardScaler(with_mean = False)),
    ('bagged_logit', bagged_logistic)
])

logit_bagged_grid = {'bagged_logit__base_estimator__C': c_values,
                    'bagged_logit__max_features' : [100, 200, 400, 600, 800, 1000]}
logit_bagged_searcher = GridSearchCV(estimator = logit_bagged_pipeline, param_grid = logit_bagged_grid, cv = skf, 
                               scoring = "roc_auc", n_jobs = 6, verbose = 4)
logit_bagged_searcher.fit(all_model_features, y_train)

CodePudding user response：

The leakage you mention is really only a major concern if you intend to use out-of-bag performance estimates. Otherwise, each of your models gets a little information from the scaling as to how its bag compares to the rest of the data, which might lead to slight overfitting, but your test scores will be fine.

But, it is relatively straightforward to do this in sklearn. You just need to tie the scaler to the logistic regression inside the bagging:

single_log_reg = LogisticRegression(solver="liblinear", random_state = 18)

logit_scaled_pipeline = Pipeline(steps=[
    ('scaler', StandardScaler(with_mean = False)),
    ('logit', single_log_reg),
])

bagged_logsc = BaggingClassifier(logit_scaled_pipeline, n_estimators = 100, random_state = 42)

logit_bagged_grid = {
    'bagged_logsc__base_estimator__logit__C': c_values,
    'bagged_logsc__max_features' : [100, 200, 400, 600, 800, 1000],
}
logit_bagged_searcher = GridSearchCV(estimator = bagged_logsc, param_grid = logit_bagged_grid, cv = skf, 
                               scoring = "roc_auc", n_jobs = 6, verbose = 4)
logit_bagged_searcher.fit(all_model_features, y_train)

On random states, see https://stackoverflow.com/a/69756672/10495893.