How to do cross-validation on random forest?-CodePudding

I am working on a binary classification using random forest. My dataset is imbalanced with 77:23 ratio. my dataset shape is (977, 7)

I initially tried the below

model = RandomForestClassifier(class_weight='balanced',max_depth=5,max_features='sqrt',n_estimators=300,random_state=24)
model.fit(X_train,y_train)
y_pred = mode.predict(X_test)

However, now I want to apply cross validation during my random forest training and then use that model to predict the y values for test data. So, I did the below

model = RandomForestClassifier(class_weight='balanced',max_depth=5,max_features='sqrt',n_estimators=300,random_state=24)
scores = cross_val_score(model,X_train, y_train,cv=10, scoring='f1')
y_pred = cross_val_predict(model,X_test,cv=10)

As you can see that this is incorrect. How can I apply cross-validation during training random forest and later, use that cross-validated model to predict y_pred correctly?

CodePudding user response：

The purpose of cross-validation is model checking, not model building.

Once you have checked with cross-validation that you obtain similar metrics for every split, you have to train your model with all your training data.

CodePudding user response：

You can't use 'cross_val_score' or 'cross_val_predict' to get back a model post-cross-validation. Otherwise, you can use the code block below, to calculate the F1 score at each fold using the testing data and validation data.

from sklearn.model_selection import KFold
from sklearn.metrics import f1_score

k = 10
kf_10 = KFold(n_splits = k, random_state = 24)
model_rfc = RandomForestClassifier(class_weight='balanced',max_depth=5,max_features='sqrt',n_estimators=300,random_state=24)
rfc_f1_CV_list = []
rfc_f1_test_list = []

for train_index, test_index in kf_10.split(X_train):
    X_train_CV, X_test_CV = X[train_index], X[test_index]
    y_train_CV, y_test_CV = y[train_index], y[test_index]
    model_rfc.fit(X_train_CV, y_train_CV)

    #Target prediction & F1 score using the 10 rows left out from CV.
    y_pred_CV = model_rfc.predict(X_test_CV)
    rfc_f1_CV = f1_score(y_test_CV, y_pred_CV)
    rfc_f1_CV_list.append(rfc_f1_CV)

    #Target prediction & F1 score using the rows from your test split.
    y_pred_test = model_rfc.predict(X_test)
    rfc_f1_test = f1_score(y_test, y_pred_test)
    rfc_f1_test_list.append(rfc_f1_test)

You can modify the above code to save the model at a given fold and use it in the snippet below:

y_pred = cross_val_predict(model, X_test, y_test, cv=10, scoring='f1') 
f1_score(y_test, y_pred, average='binary')