I'm using XGBoost model to predict attacks, But I get 100% accuracy, I tried Random Forest as well, and same, I get 100%. How can I handle this ovrefitting problem ? The steps I followed are: Data cleaning Data splitting Feature scaling Feature selection I even tried to change this order, but still get the same thing. Do you have any idea how to handle this? Thanks
CodePudding user response:
Overfitting occurs when your model becomes too complex for its task. Simply said instead of learning patterns in your data, the model will be able to learn every case it is presented in the training set by heart.
To avoid this, you will have to choose a model that is less complex, in your case reduce the depth of your trees. Split your data in separate train, validation and test sets, then train different models of different complexities. When you evaluate these models, you will notice that its predictive capabilities on the training set will increase with complexity. Initially its capabilities on the validation set will follow until a point is reached where no more increase on the validation set can be achieved. On the contrary, it will likely decrease beyond this point, because you are starting to overfit.
Use this data to find a suitable model, then finally evaluate the model you decided to use by using the test set you have kept aside until now.
CodePudding user response:
Thank you for your clarification, I solved the problem by tuning the hyperparameters eta and max_depth.
param = {
'eta': 0.1,
'max_depth': 1,
'objective': 'multi:softprob',
'num_class': 3}
steps = 20 # The number of training iterations
model = xgb.train(param, D_train, steps)
preds = model.predict(D_test)