I am working on a binary classification using random forest. My dataset is imbalanced with 77:23 ratio. my dataset shape is (977, 7)
I initially tried the below
model = RandomForestClassifier(class_weight='balanced',max_depth=5,max_features='sqrt',n_estimators=300,random_state=24)
model.fit(X_train,y_train)
y_pred = mode.predict(X_test)
However, now I want to apply cross validation during my random forest training and then use that model to predict the y values for test data. So, I did the below
model = RandomForestClassifier(class_weight='balanced',max_depth=5,max_features='sqrt',n_estimators=300,random_state=24)
scores = cross_val_score(model,X_train, y_train,cv=10, scoring='f1')
y_pred = cross_val_predict(model,X_test,cv=10)
As you can see that this is incorrect. How can I apply cross-validation during training random forest and later, use that cross-validated model to predict y_pred
correctly?
CodePudding user response:
The purpose of cross-validation is model checking, not model building.
Once you have checked with cross-validation that you obtain similar metrics for every split, you have to train your model with all your training data.
CodePudding user response:
You can't use 'cross_val_score' or 'cross_val_predict' to get back a model post-cross-validation. Otherwise, you can use the code block below, to calculate the F1 score at each fold using the testing data and validation data.
from sklearn.model_selection import KFold
from sklearn.metrics import f1_score
k = 10
kf_10 = KFold(n_splits = k, random_state = 24)
model_rfc = RandomForestClassifier(class_weight='balanced',max_depth=5,max_features='sqrt',n_estimators=300,random_state=24)
rfc_f1_CV_list = []
rfc_f1_test_list = []
for train_index, test_index in kf_10.split(X_train):
X_train_CV, X_test_CV = X[train_index], X[test_index]
y_train_CV, y_test_CV = y[train_index], y[test_index]
model_rfc.fit(X_train_CV, y_train_CV)
#Target prediction & F1 score using the 10 rows left out from CV.
y_pred_CV = model_rfc.predict(X_test_CV)
rfc_f1_CV = f1_score(y_test_CV, y_pred_CV)
rfc_f1_CV_list.append(rfc_f1_CV)
#Target prediction & F1 score using the rows from your test split.
y_pred_test = model_rfc.predict(X_test)
rfc_f1_test = f1_score(y_test, y_pred_test)
rfc_f1_test_list.append(rfc_f1_test)
You can modify the above code to save the model at a given fold and use it in the snippet below:
y_pred = cross_val_predict(model, X_test, y_test, cv=10, scoring='f1')
f1_score(y_test, y_pred, average='binary')