I'm wondering how I can extract feature importances from Logistic regression, GBM and XGBoost in scikit-learn with the feature names when using the classifier in a pipeline with preprocessing. I want to know how do I extract feature importances from a Sklearn pipeline?
From the brief research I've done, I am not sure if this is possible with scikit-learn.
I also found a package called ELI5 (https://eli5.readthedocs.io/en/latest/overview.html) that is supposed to fix that issue with sci-kit-learn but I would like to compare it with feature importance results.
Please see the code below:-
# Defining regressand(Y) and regressors(X)
X = df3.drop(['Prediction_SAP_Burst','Unnamed: 0'], axis=1)
y = df3['Prediction_SAP_Burst']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, random_state=1,stratify=y)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
numerical_features = list(set(X.columns.to_list()))
numerical_transformer = Pipeline(steps=[
('scaler', MinMaxScaler())])
#Link all the transformers together in a ColumnTransformer
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features)
])
from xgboost import XGBClassifier
# create a pipeline for each classifier.
lr_clf = Pipeline(steps=[('preprocessor', preprocessor),
('lr', LogisticRegression(class_weight={0:0.52,1:16.14},random_state=1))])
xgb_clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', XGBClassifier(scale_pos_weight=8, random_state=1))])
gbc_clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', GradientBoostingClassifier(random_state=1, learning_rate=0.2, max_features=0.5,
n_estimators=250, subsample=0.8))])
lr_clf.fit(X_train, y_train)
xgb_clf.fit(X_train, y_train)
gbc_clf.fit(X_train, y_train)
#compute classification report and confusion matrix
def results(name: str, model: BaseEstimator) -> None:
preds = model.predict(X_test)
model_cv = cross_validate(model, X_train, y_train, cv=StratifiedKFold(n_splits=5), n_jobs=-1, scoring='f1')
print(f"Kfold precision score: {model_cv['test_score']}")
print(f"Average score of Kfold: {model_cv['test_score'].mean():.3f} /- {model_cv['test_score'].std():.3f}")
print(name " score: %.3f" % model.score(X_test, y_test))
print(classification_report(y_test, preds))
labels = ['Good', 'Bad']
conf_matrix = confusion_matrix(y_test, preds, normalize='true')
font = {'family' : 'normal',
'size' : 14}
plt.rc('font', **font)
plt.figure(figsize= (10,6))
sns.heatmap(conf_matrix, xticklabels=labels, yticklabels=labels, annot=True, cmap='Blues')
plt.title("Confusion Matrix for " name)
plt.ylabel('True Class')
plt.xlabel('Predicted Class')
results("Logistic Regression" , lr_clf)
results("X Gradient Boost" , xgb_clf)
results("Gradient boost Classifier" , gbc_clf)
CodePudding user response:
It is possible to extract feature importances from a scikit-learn pipeline. The specific way to do it will depend on which classifier you are using in the pipeline.
In general, after fitting a classifier in a pipeline, you can access the feature importances using the .feature_importances_ attribute of the classifier. For example, if you are using an XGBoost classifier, you can access the feature importances like this:
# Define the pipeline
xgb_clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', XGBClassifier(scale_pos_weight=8, random_state=1))])
# Fit the pipeline
xgb_clf.fit(X_train, y_train)
# Access the feature importances
importances = xgb_clf.steps[1][1].feature_importances_
Using Eli5
# Import the necessary modules
import eli5
from eli5.sklearn import PermutationImportance
# Define the pipeline
xgb_clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', XGBClassifier(scale_pos_weight=8, random_state=1))])
# Fit the pipeline
xgb_clf.fit(X_train, y_train)
# Use ELI5 to explain the weights of the model
explainer = PermutationImportance(xgb_clf, random_state=1).fit(X_train, y_train)
eli5.show_weights(explainer, feature_names=X_train.columns.tolist())
CodePudding user response:
This is really hard for anyone to every answer. This is no idea method for about finding feature importances.
But to help your googling out, saliency of a model is what your talking about and has helped me in the past to find methods to do this.
I remember using some captum saliency libaray stuff to find feature importances.