What's wrong with these seemingly perfect ML model?-CodePudding

I wanted to find an optimal model to solve the assigned classification problem. Everything went smooth before I applied pd.get_dummies() function to preprocess the data. The experiment showed a impossibly perfect result. I know it is unlikely to happen but I do not know why. Any help would be highly appreciated.

Code for preprocessing data is as below

# Encoding Booking Status
status_dict = {'Not_Canceled':1, 'Canceled':0}
df.booking_status = df.booking_status.map(status_dict)
df.drop('Booking_ID',axis=1, inplace=True)
df = df.dropna()
df = pd.get_dummies(df)

# Standardizing Data
from sklearn.preprocessing import StandardScaler
import numpy as np
X = df.iloc[:,0:-1]
y = df.iloc[:,-1]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
np.set_printoptions(precision=3)
print(rescaledX[0:5,:])

And I split my data into training and testing with a proportion of 0.3

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(rescaledX, y, test_size=0.3, random_state=15)

I used several models and the amazing result is enter image description here

Simple code, stupid me. By the way, just a beginner in ML field. Any advice to master it well?

CodePudding user response：

It was caused by data leaks. You must split your data first before any data pre-processing step. For example,

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(rescaledX, y, test_size=0.3, random_state=15)

Then do your data scaling part on the training and test data separately.

scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)

You could try to use Pipe line as well to avoid data leaks.

# correct data preparation for model evaluation with k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# define the pipeline
steps = list()
steps.append(('scaler', MinMaxScaler()))
steps.append(('model', LogisticRegression()))
pipeline = Pipeline(steps=steps)
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model using cross-validation
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores)*100, std(scores)*100))

Ref: https://machinelearningmastery.com/data-preparation-without-data-leakage/

CodePudding user response：

It's possible that you have a perfect result there are many reasons why you may have high results. one-hot encoding and standardization: The pd.get_dummies() function creates binary variables for each unique category in the columns specified, which can lead to data leakage if the same categories appear in both the training and test sets. In addition, StandardScaler() standardizes the data by subtracting the mean and dividing by the standard deviation, so if the data is already standardized, this step will not have much effect. Dataset is too small or too simple for the classification problem, which is causing the model to overfit. In this case, the model is able to learn the underlying patterns in the data and make predictions with 100% accuracy on the training set, but it will not generalize well to new, unseen data. To check for overfiting, you can try splitting your data into training and test sets and evaluating the model's performance on the test set. If the model performs well on the training set but poorly on the test set, this indicates overfiting. class imbalance problem: If the classes are unbalanced, the model can achieve high accuracy by simply predicting the majority class all the time. To check the class imbalance, you can use the value_counts() function on y. To solve this problem you can use oversampling or undersampling methods to balance the class. It's always important to validate our model performance with unseen data. You can use cross validation or train-test split to evaluate the model's performance and make sure that the model is generalizing well to unseen data.

Hope it clear and helpful