Home > Enterprise >  How to properly use Smote in Classification models
How to properly use Smote in Classification models

Time:05-25

I am using smote to balanced the output (y) only for Model train but want to test the model with original data as it makes logic how we can test the model with smote created outputs. Please ask anything for clarification if I didn't explained it well. It's my starting on Stack overflow.

from imblearn.over_sampling import SMOTE
oversample = SMOTE()
X_sm, y_sm = oversample.fit_resample(X, y)

# Splitting Dataset into Train and Test (Smote)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm,test_size=0.2,random_state=42)

Here i applied the Random Forest Classifier on my data

import math
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

# RF = RandomForestClassifier(n_estimators=100)
# RF.fit(X_train, y_train.values.ravel())
# y_pred = RF.predict(X)
# print(metrics.classification_report(y,y_pred))

RF = RandomForestClassifier(n_estimators=10)
RF.fit(X_train, y_train.values.ravel())

If i applied this but X also contains the data which we used for train. how we can remove the data which we already used for training the data.

y_pred = RF.predict(X)
print(metrics.classification_report(y,y_pred))

CodePudding user response:

I used SMOTE in the past, it is suboptimal. Lately, researchers have proven some flaws in the generated distribution of Synthetic Minority Oversample Technique (SMOTE). I know sometimes we don't have a choice regarding the unbalanced classes, but you can use sklearn.ensemble.RandomForestClassifier, where you can define a proper class_weight to handle the unbalanced class problem.

Check scikit-learn documentation:

Scikit-documentation

CodePudding user response:

I agree with razimbres about using class_weight. Another option for you would be to split the dataset into train and test first. Then, keep the test set aside. Use only the training set from here on:

X_sm, y_sm = oversample.fit_resample(X_train, y_train)
.
.
.
  • Related