Roc_auc score not working with cross_validate...data is numerical, not categorical-CodePudding

I am trying to iterate through three different scoring methods with the cross_validate function on my iris dataset: 'neg_log_loss', 'accuracy', 'roc_auc'.

I am starting out with this and then I want to put it in a for loop that iterates over the various classifiers AND their preprocessing methods.

datasets = {
    "Unprocessed": (x_train, x_test, y_train, y_test),
    "Standardisation": (x1_train, x1_test, y_train, y_test),
    "Normalisation": (x2_train, x2_test, y_train, y_test),
    "Rescale": (x3_train, x3_test, y_train, y_test),
}

models = {
    "Logistic Regression": LogisticRegression(),
    "Support Vector Machine": SVC(probability=True),
    "Decision Tree": DecisionTreeClassifier(max_leaf_nodes=3),
    "Random Forest": RandomForestClassifier(max_depth=3),
    "LinearDiscriminant": LinearDiscriminantAnalysis(),
    "K-Nearest Neighbour": KNeighborsClassifier(n_neighbors=3),
    "Naive Bayes": GaussianNB(),
    "XGBoost": XGBClassifier()
    
}


scores_ = ['accuracy','neg_log_loss', 'roc_auc']
cv = RepeatedStratifiedKFold(n_splits=10, random_state=seed)
scores = cross_validate(model, X, y, scoring='roc_auc', cv=cv, error_score="raise")

For some reason I am getting nan values for roc_auc even though the data is all numerical. I tried this:

from sklearn.model_selection import cross_validate
from sklearn.utils import shuffle
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
y = encoder.fit_transform(pd.DataFrame(iris.target)).toarray()
X_s, y_s = shuffle(X, y)
scores = cross_validate(model, X_s, y_s, cv=cv, scoring="roc_auc")

However I get:

ValueError: Supported target types are: ('binary', 'multiclass'). Got 'multilabel-indicator' instead.

Why could this be?

CodePudding user response：

The problem is that roc_auc_score expects the probabilities and not the predictions in the case of multi-class classification. However, with that code the score is getting the output of predict instead.

Use a new scorer:

from sklearn.metrics import roc_auc_score, make_scorer

multi_roc_scorer = make_scorer(lambda y_in, y_p_in: roc_auc_score(y_in, y_p_in, multi_class='ovr'), needs_proba=True)
scores = cross_validate(model, X_s, y_s, scoring=multi_roc_scorer, cv=cv, error_score="raise")