How can we include a prediction column in the initial dataset/dataframe after performing K-Fold cros-CodePudding

I would like to run a K-fold cross validation on my data using a classifier. What I would like to produce, however, is the prediction column or predict proba columns for each sample directly the initial dataset/dataframe. Any ideas?

from sklearn.metrics import accuracy_score
import pandas as pd
from sklearn.model_selection import KFold

k = 5
kf = KFold(n_splits=k, random_state=None)

acc_score = []
auroc_score = []

for train_index , test_index in kf.split(X):
    X_train , X_test = X.iloc[train_index,:],X.iloc[test_index,:]
    y_train , y_test = y[train_index] , y[test_index]

    model.fit(X_train, y_train)
    pred_values = model.predict(X_test)
    predict_prob = model.predict_proba(X_test.values)[:,1]

    auroc = roc_auc_score(y_test, predict_prob)
    acc = accuracy_score(pred_values , y_test)

    auroc_score.append(auroc)
    acc_score.append(acc)

avg_acc_score = sum(acc_score)/k
print('accuracy of each fold - {}'.format(acc_score))
print('Avg accuracy : {}'.format(avg_acc_score))
print('AUROC of each fold - {}'.format(auroc_score))
print('Avg AUROC : {}'.format(sum(auroc_score)/k))

Given this code, how could I begin to generate such an idea: add a prediction column or, even better, the prediction proba columns for each sample within the initial dataset?

EDIT 1.0:

Here is an example, I have a dataset with let's say 4 feature column and one column that refer to the teacher (i.e., my class data). I am experiencing a K fold cross validation on it given a Scikit Learn model. What I wish at the end is to have this initial dataset (i.e., 4 features the column that is the class data) the column prediction that is the output of the k fold cross validation prediction for each sample of the dataset. Following this I will easily be able to understand how to introduce more outputs columns such as the proba equal to the class "1" for every sample and same for the proba equal to the class "0"..

K=10:

In 10-fold cross-validation, each example (sample) will be used exactly once in a test set and 9 times in a training set. So, after 10-fold cross-validation, the result should be a dataframe where I would have the predicted class for ALL examples in the dataset. Each example will be assigned its 4 initial features, its labelled class and the class predicted computed in the cross-validation fold where that example was used in the test set.

CodePudding user response：

You can use the .loc method to accomplish this. This question has a nice answer that shows how to use it: df.loc[index_position, "column_name"] = some_value

So, an edited version of the code you posted (I needed data, and removed auc_roc since we aren't using probabilities per your edit):

from sklearn.metrics import accuracy_score, roc_auc_score
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.datasets import load_breast_cancer
from sklearn.neural_network import MLPClassifier

X,y = load_breast_cancer(return_X_y=True, as_frame=True)
model = MLPClassifier()

k = 5
kf = KFold(n_splits=k, random_state=None)

acc_score = []
auroc_score = []

# Create columns
X['Prediction'] = 1

# Define what values to use for the model
model_columns = [x for x in X.columns if x != 'Prediction']

for train_index , test_index in kf.split(X):
    X_train , X_test = X.iloc[train_index,:],X.iloc[test_index,:]
    y_train , y_test = y[train_index] , y[test_index]

    model.fit(X_train[model_columns], y_train)
    pred_values = model.predict(X_test[model_columns])

    acc = accuracy_score(pred_values , y_test)
    acc_score.append(acc)

    # Add values to the dataframe
    X.loc[test_index, 'Prediction'] = pred_values

avg_acc_score = sum(acc_score)/k
print('accuracy of each fold - {}'.format(acc_score))
print('Avg accuracy : {}'.format(avg_acc_score))

# Add label back per question
X['Label'] = y

# Print first 5 rows to show that it works
print(X.head(n=5))

Yields

accuracy of each fold - [0.9210526315789473, 0.9122807017543859, 0.9736842105263158, 0.9649122807017544, 0.8672566371681416]
Avg accuracy : 0.927837292345909
   mean radius  mean texture  ...  Prediction  Label
0        17.99         10.38  ...           0      0
1        20.57         17.77  ...           0      0
2        19.69         21.25  ...           0      0
3        11.42         20.38  ...           1      0
4        20.29         14.34  ...           0      0

[5 rows x 32 columns]

(Obviously the model/values etc are all arbitrary)

CodePudding user response：

You can use cross_val_predict, see help page, it basically returns you the cross validated estimates:

from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.metrics import accuracy_score
from sklearn import datasets, linear_model
from sklearn.linear_model import LogisticRegression
import pandas as pd
 

X,y = make_classification()
df = pd.DataFrame(X,columns = ["feature{:02d}".format(i) for i in range(X.shape[1])])
df['label'] = y

df['pred'] = cross_val_predict(LogisticRegression(), X, y, cv=KFold(5))