Dataset: train.csv
Approach
I have four classes to be predicted and they are really very imbalanced so i tried using SMOTE and a feed forward network but using smote is giving very poor results as compared to original dataset on the test data
model architecture
#model architecture
from tensorflow.keras.layers import Dense, BatchNormalization, Dropout, Flatten
model = tf.keras.Sequential()
model.add(Dense(512, activation='relu', input_shape=(7, )))
model.add(BatchNormalization())
model.add(Dense(256, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(128, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(64, activation='relu'))
model.add(Dense(4, activation='softmax'))
earlystopping = tf.keras.callbacks.EarlyStopping(
monitor="val_loss",
patience=40,
mode="auto",
restore_best_weights=True,
)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
So how to approach for this problem and increase the f1-score on the test dataset Any help is appreciated
CodePudding user response:
Below is an explanation of what could be the best approach for your case.
SMOTE
- Usually SMOTE balances out the data by random upsampling, so even if you have a data sample distribution like Class A having 15000 Records and Class B having 200 records it would upsample the Class B to 15000 Records too.
- Having too many random samples generated from the 200 Records it self sometimes makes the model very hard to learn and differentiate between classes, since the upsampling has significantly increased Class B records from 200 to 15000 by duplicating it.
Possible Solutions
- Instead of SMOTE I would recommend to try Stratified Sampling between the train/test and then try building the model on top of it.
- Having class weights as parameter is another best approach and its present almost for all ML algorithms. In your case for Keras you can Refer Here it could be very helpful.