Home > database >  How do I extract feature importances from a Sklearn pipeline
How do I extract feature importances from a Sklearn pipeline

Time:12-11

I'm wondering how I can extract feature importances from Logistic regression, GBM and XGBoost in scikit-learn with the feature names when using the classifier in a pipeline with preprocessing. I want to know how do I extract feature importances from a Sklearn pipeline?

From the brief research I've done, I am not sure if this is possible with scikit-learn.

I also found a package called ELI5 (https://eli5.readthedocs.io/en/latest/overview.html) that is supposed to fix that issue with sci-kit-learn but I would like to compare it with feature importance results.

Please see the code below:-

# Defining regressand(Y) and regressors(X)
X = df3.drop(['Prediction_SAP_Burst','Unnamed: 0'], axis=1)
y = df3['Prediction_SAP_Burst']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, random_state=1,stratify=y)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

numerical_features = list(set(X.columns.to_list()))

numerical_transformer = Pipeline(steps=[
    ('scaler', MinMaxScaler())])

#Link all the transformers together in a ColumnTransformer

from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features)
            ])

from xgboost import XGBClassifier

# create a pipeline for each classifier.

lr_clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('lr', LogisticRegression(class_weight={0:0.52,1:16.14},random_state=1))])

xgb_clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', XGBClassifier(scale_pos_weight=8, random_state=1))])

gbc_clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', GradientBoostingClassifier(random_state=1, learning_rate=0.2, max_features=0.5,
                      n_estimators=250, subsample=0.8))])

lr_clf.fit(X_train, y_train)
xgb_clf.fit(X_train, y_train)
gbc_clf.fit(X_train, y_train)

#compute classification report and confusion matrix

def results(name: str, model: BaseEstimator) -> None:
    preds = model.predict(X_test)

    model_cv = cross_validate(model, X_train, y_train, cv=StratifiedKFold(n_splits=5), n_jobs=-1, scoring='f1')
    print(f"Kfold precision score: {model_cv['test_score']}")
    print(f"Average score of Kfold: {model_cv['test_score'].mean():.3f}  /- {model_cv['test_score'].std():.3f}")

    print(name   " score: %.3f" % model.score(X_test, y_test))
    print(classification_report(y_test, preds))
    labels = ['Good', 'Bad']

    conf_matrix = confusion_matrix(y_test, preds, normalize='true')

    font = {'family' : 'normal',
            'size'   : 14}

    plt.rc('font', **font)
    plt.figure(figsize= (10,6))
    sns.heatmap(conf_matrix, xticklabels=labels, yticklabels=labels, annot=True, cmap='Blues')
    plt.title("Confusion Matrix for "   name)
    plt.ylabel('True Class')
    plt.xlabel('Predicted Class')

results("Logistic Regression" , lr_clf)
results("X Gradient Boost" , xgb_clf)
results("Gradient boost Classifier" , gbc_clf)

CodePudding user response:

It is possible to extract feature importances from a scikit-learn pipeline. The specific way to do it will depend on which classifier you are using in the pipeline.

In general, after fitting a classifier in a pipeline, you can access the feature importances using the .feature_importances_ attribute of the classifier. For example, if you are using an XGBoost classifier, you can access the feature importances like this:

# Define the pipeline
xgb_clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', XGBClassifier(scale_pos_weight=8, random_state=1))])

# Fit the pipeline
xgb_clf.fit(X_train, y_train)

# Access the feature importances
importances = xgb_clf.steps[1][1].feature_importances_

Using Eli5

# Import the necessary modules
import eli5
from eli5.sklearn import PermutationImportance

# Define the pipeline
xgb_clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', XGBClassifier(scale_pos_weight=8, random_state=1))])

# Fit the pipeline
xgb_clf.fit(X_train, y_train)

# Use ELI5 to explain the weights of the model
explainer = PermutationImportance(xgb_clf, random_state=1).fit(X_train, y_train)
eli5.show_weights(explainer, feature_names=X_train.columns.tolist())

CodePudding user response:

This is really hard for anyone to every answer. This is no idea method for about finding feature importances.

But to help your googling out, saliency of a model is what your talking about and has helped me in the past to find methods to do this.

I remember using some captum saliency libaray stuff to find feature importances.

  • Related