I am trying to iterate through three different scoring methods with the cross_validate function on my iris dataset: 'neg_log_loss', 'accuracy', 'roc_auc'.
I am starting out with this and then I want to put it in a for loop that iterates over the various classifiers AND their preprocessing methods.
datasets = {
"Unprocessed": (x_train, x_test, y_train, y_test),
"Standardisation": (x1_train, x1_test, y_train, y_test),
"Normalisation": (x2_train, x2_test, y_train, y_test),
"Rescale": (x3_train, x3_test, y_train, y_test),
}
models = {
"Logistic Regression": LogisticRegression(),
"Support Vector Machine": SVC(probability=True),
"Decision Tree": DecisionTreeClassifier(max_leaf_nodes=3),
"Random Forest": RandomForestClassifier(max_depth=3),
"LinearDiscriminant": LinearDiscriminantAnalysis(),
"K-Nearest Neighbour": KNeighborsClassifier(n_neighbors=3),
"Naive Bayes": GaussianNB(),
"XGBoost": XGBClassifier()
}
scores_ = ['accuracy','neg_log_loss', 'roc_auc']
cv = RepeatedStratifiedKFold(n_splits=10, random_state=seed)
scores = cross_validate(model, X, y, scoring='roc_auc', cv=cv, error_score="raise")
For some reason I am getting nan values for roc_auc even though the data is all numerical. I tried this:
from sklearn.model_selection import cross_validate
from sklearn.utils import shuffle
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
y = encoder.fit_transform(pd.DataFrame(iris.target)).toarray()
X_s, y_s = shuffle(X, y)
scores = cross_validate(model, X_s, y_s, cv=cv, scoring="roc_auc")
However I get:
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'multilabel-indicator' instead.
Why could this be?
CodePudding user response:
The problem is that roc_auc_score
expects the probabilities and not the predictions in the case of multi-class classification. However, with that code the score is getting the output of predict
instead.
Use a new scorer:
from sklearn.metrics import roc_auc_score, make_scorer
multi_roc_scorer = make_scorer(lambda y_in, y_p_in: roc_auc_score(y_in, y_p_in, multi_class='ovr'), needs_proba=True)
scores = cross_validate(model, X_s, y_s, scoring=multi_roc_scorer, cv=cv, error_score="raise")