Calculate the confidence score for the machine learning results-CodePudding

I have applied Xgboost classifier to classify my output. I already included the probability by using predict_proba. However, I would like to add a confidence interval score to show how confident the model is or

import xgboost as xgb
xgb_model = xgb.XGBClassifier(base_score=0.2, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=40, min_child_weight=1, 
              monotone_constraints='()', n_estimators=100, n_jobs=10,
              num_parallel_tree=1, objective='multi:softprob', predictor='auto',
              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=None,
              subsample=1, tree_method='exact', validate_parameters=1,
              verbosity=None)
xgb_model.fit(X_train_vectors_tfidf, y_train)



y_predict = xgb_model.predict(X_test_vectors_tfidf)
y_prob = xgb_model.predict_proba(X_test_vectors_tfidf)[:,1]
print(classification_report(y_test,y_predict))
print('Confusion Matrix:',confusion_matrix(y_test, y_predict))

CodePudding user response：

To compute confidence intervals for the predicted probabilities is to use bootstrapping. This involves repeatedly sampling your test set with replacement and refitting the model on each sample. For each sample, you can then compute the predicted probabilities and record the corresponding confidence intervals.

This will give you an idea of the uncertainty in the predicted probabilities.

# Number of samples in the test set
n_samples = X_test_vectors_tfidf.shape[0]

# Create an array to store the predicted probabilities and confidence intervals
y_prob_bootstrapped = np.zeros((n_bootstraps, n_samples))
ci_lower = np.zeros((n_bootstraps, n_samples))
ci_upper = np.zeros((n_bootstraps, n_samples))

# Use bootstrapping to compute confidence intervals for the predicted probabilities
for i in range(n_bootstraps):
    # Sample the test set with replacement
    X_resampled, y_resampled = resample(X_test_vectors_tfidf, y_test, n_samples=n_samples, random_state=i, replace=True)
    
    # Fit the model on the resampled data
    xgb_model.fit(X_resampled, y_resampled)
    
    # Compute the predicted probabilities on the resampled data
    y_prob_bootstrapped[i] = xgb_model.predict_proba(X_test_vectors_tfidf)[:, 1]
    
    # Compute the confidence intervals
    ci_lower[i], ci_upper[i] = np.percentile(y_prob_bootstrapped[i], [2.5, 97.5])
    
# Compute the mean predicted probabilities across all bootstrap samples
y_prob_mean = y_prob_bootstrapped.mean(axis=0)

# Compute the confidence intervals using the mean predicted probabilities
ci_lower = y_prob_mean - ci_lower
ci_upper = y_prob_mean   ci_upper

It looks like X_test_vectors_tfidf is a sparse matrix, which means that it is a matrix where most of the elements are zero and only a small fraction of the elements have non-zero values. When you try to use resample from sklearn.utils on a sparse matrix, it raises a TypeError because the number of non-zero elements in the sparse matrix is ambiguous.

To solve this error, you can either convert the sparse matrix to a dense matrix by calling the toarray method, or you can use the shape[0] attribute to get the number of samples in the sparse matrix.