Home > Enterprise >  Steps of multiclass classification problem [closed]
Steps of multiclass classification problem [closed]

Time:09-29

So this question is more theoretical, than a practical one. I got a dataframe with 4 classes of cars' body types (e.g. sedan, hatchback, etc.) and different characteristics (doors, seats, maximum speed, etc.). The goal is to select features based on the results of Pearson, Chi-2, RFE, logistic regression, XGBoost and then make a predictive model. I've already encoded classes of body types into variables (0, 1, 2, 3) by means of label encoding. Then, I realized that data is imbalanced (if I'm correctly interpreting histogram). The question is: Should I fix the imbalance right after features selection and before building model or features selection is now irrelevant and I must return one step back, balance date, then perform selection and only then build a model?

UPD: the class distribution is below
1 0.512228
2 0.282609
0 0.118207
3 0.086957

CodePudding user response:

There are a few ways to handle data imbalance. Let me outline the approaches below:

  1. Use a multi-class classification approach. E.g., treat your multi-class classification problem as a binary classification problem hierarchically. E.g., if your data has 10 samples corresponding to class A, 20 samples corresponding to class B, and 50 samples corresponding to class C, then perform binary classification first by combining samples of class A and B as one class and samples of class C as another. You then repeat this step by training the classifier for class A and class B as a binary classification problem.
  2. Data augmentation - Is there a way you can perform data augmentation to increase the number of samples for a certain class? E.g., if you are dealing with image data, flipping, random cropping, rotation can help augment samples of class with fewer data.

When dealing with imbalanced data, feature selection is irrelevant before you actually balance the data as the model will tend to be skewed towards classes with more samples. E.g., if your data has 9 samples that are classified as class A and 1 sample that is classified as B, your model will learn to classify everything to class A since it sees little incentive to classify samples belonging to class B correctly because even if it classifies everything to class A, it still achieves a high enough accuracy of 90%. Another way to fix this imbalance is to use weighted penalization. E.g., every time your model makes an incorrect classification for a sample with less class data, you penalize the model more than you would for an incorrect classification made for sample belonging to a class with more data.

CodePudding user response:

First, you do the data preparation and normalize the data. Make sure your dataset has no imbalanced data. you can solve it by sampling techniques and k-fold cross-validation(for reference). After that, you do the feature selection and build the model. It helps to give better accuracy and overall it will be a good model.

  • Related