Data leakage related to scikit learn and train-test split-CodePudding

I have a question about data leakage. I have learned a lot of my Data Science and Machine Learning skills from the “2022 Python for Machine Learning & Data Science Masterclass” course on Udemy. The standard procedure when working with the data has always been to do data cleaning, such as filling in missing values, feature selection, feature engineering and EDA first, and then use train-test split. After the train-test split, any normalization or scaling, such as StandardScaler is applied. I have been practicing this way too.

I have read that data cleaning should be done separately on the train and test sets (first, you split your data into train and test sets, and then start doing any data cleaning techniques).

I’m confused, because this is not the way that I have learned, and now I’m wondering if my current routine is wrong. I’m confused why you wouldn’t want to apply any preprocessing steps first, because then it seems like you are doing double work. I understand that scaling and normalization should be done after the train-test split, but I don’t understand why the preprocessing steps that I have learned in the course are applied separately.

Can someone provide some clarification please? Any suggestions and advice is extremely appreciated!

Thank you in advance!

CodePudding user response：

That's a good question.

I think the scikit-learn approach is to consider the test set as data that is not accessible during development and that is only used to validate the performance of the model.

By separating the data before preprocessing. It is guaranteed that none of the preprocessing steps have been able to capture information from the test set. Data leakage can occur if parametric transformers (StandardScaler, TfidfVectorizer, etc) are trained on test set data.

Furthermore, relying on the entire dataset to perform the data cleaning could be considered as data leakage because we would be using information from the test set to develop them.

To avoid duplicate work, it is possible to encapsulate the cleaning, preprocessing with the model using scikit-learn Pipeline. This will bring the advantage of having all these steps and the model in an object that can then be serialized and reused in production.

CodePudding user response：

Let's consider a scenario, when you want to fill in the missing values. Generally mean is taken as a value to fill the missing ones. Now how do you calculate the mean? Do you take the mean from entire dataset or from the train part? So if you take the entire dataset (because you haven't split the dataset yet) to calculate the mean, as you can see there is already some leakage.

Therefore, if you have a separate process for both, there is no chance for unexpected leakages.

CodePudding user response：

To add on @AntioneDubius's points, I will share from the perspective of what can or should not be done. Short answer: do together only for the preprocessing steps that will not result in data leakage.

These preprocessing steps can be done together on train/test sets:

data cleaning: stripping white spaces, removing unwanted characters
raw data computation: calculate duration such as age from date of birth, subtracting between dates to get duration, etc
convert categorical variables: one-hot encoding, binary encoding, label encoding, get_dummies, etc

To split the data using train_test_split() is also an art. Be sure to include all target classes in the split to ensure their presence in the right ratio, so I usually like to use the parameter stratify=y. See this post for more details.

Even in cross validation, there is also StratifiedKFold and StratifiedShuffleSplit

Not only for target classes, each categorical class should be represented in both train and test set. We do not want a situation where a type of categorical data only appears in the test but not train set.

Take extra care when doing transformation like scaling and normalization. You should not do fit_transform() separately on training and test set. The right way is to fit on training set, then... transform on training set and transform on test set. See this post for example.

Edit: to answer @derFotik's question to use mean to fill in the missing data... my take is to impute missing data with mean separately for train and test set. Provided:

missing data is minimal (I'm talking about small like <2%)
sufficient amount of data, in statistics we learn that for large data, both train sample mean and test sample mean will approximate to the population mean