Home > database >  StratifiedKFold and Over-Sampling together
StratifiedKFold and Over-Sampling together

Time:01-02

I have a machine learning model and a dataset with 15 features about breast cancer. I want to predict the status of a person (alive or dead). I have 85% alive cases and only 15% dead. So, I want to use over-sampling for dealing with this problem and combine it with stratified k fold. I write this code, it seems to work well, but I don t know if I put them in the right order:

skf = StratifiedKFold(n_splits=10, random_state=None)
skf.get_n_splits(x, y)

ros = RandomOverSampler(sampling_strategy="not majority") 
x_res, y_res = ros.fit_resample(x, y)

for train_index, test_index in skf.split(x_res,y_res):     
    x_train,x_test=x_res.iloc[train_index],x_res.iloc[test_index]
    y_train,y_test=y_res.iloc[train_index],y_res.iloc[test_index]

Is it correct in this way? Or should I apply oversampling before stratified k fold?

CodePudding user response:

Careful: resampling before splitting can cause data leakage where training data leaks into the test data (see the common pitfalls section of the imblearn docs).

Put the steps in a pipeline, then pass to cross_validate with StratifiedKFold:

from imblearn.pipeline import make_pipeline

model = make_pipeline(
    RandomOverSampler(sampling_strategy="not majority"),
    LogisticRegression(),
)

print(cross_validate(model, X, y, cv=StratifiedKFold())["test_score"].mean())

CodePudding user response:

Is it correct in this way? Or should I apply oversampling before stratified k fold?

Note that this is exactly what your code does: you apply oversampling ros.fit_resample(x, y) before k-fold split skf.split(x_res,y_res).

You should apply oversampling after k-fold split. If you do oversampling before the split there's a chance that some data points will be present both in train and in test in the same split (this is called data leakage) which would lead to overfitting.

Correct version of your code would look like this:

skf = StratifiedKFold(n_splits=10, random_state=None)
ros = RandomOverSampler(sampling_strategy="not majority")

for train_index, test_index in skf.split(x, y):     
    x_train_unsampled, x_test = x.iloc[train_index], x.iloc[test_index]
    y_train_unsampled, y_test = y.iloc[train_index], y.iloc[test_index]
    x_train, y_train = ros.fit_resample(x_train_unsampled, y_train_unsampled)

However, I encourage you to use pipelining and cross_validate instead of writing all the boilerplate code yourself, as Alexander suggested in his answer. This will both save you time and effort and also minimize the risk of introducing bugs.

Few other notes:

  1. get_n_splits() does nothing except for returning the number of splits you provided in the line before that. It does not actually do anything with the data. You can just remove that from your code.
  2. Notice that I oversample only train pool. Usually you would only want to oversample train pool.
  • Related