Inspection of the feature importance in scikit-learn pipelines-CodePudding

I have defined the following pipelines using scikit-learn:

model_lg = Pipeline([("preprocessing", StandardScaler()), ("classifier", LogisticRegression())])
model_dt = Pipeline([("preprocessing", StandardScaler()), ("classifier", DecisionTreeClassifier())])
model_gb = Pipeline([("preprocessing", StandardScaler()), ("classifier", HistGradientBoostingClassifier())])

Then I used cross validation to evaluate the performance of each model:

cv_results_lg = cross_validate(model_lg, data, target, cv=5, return_train_score=True, return_estimator=True)
cv_results_dt = cross_validate(model_dt, data, target, cv=5, return_train_score=True, return_estimator=True)
cv_results_gb = cross_validate(model_gb, data, target, cv=5, return_train_score=True, return_estimator=True)

When I try to inspect the feature importance for each model using the coef_ method, it gives me an attribution error:

model_lg.steps[1][1].coef_
AttributeError: 'LogisticRegression' object has no attribute 'coef_'

model_dt.steps[1][1].coef_
AttributeError: 'DecisionTreeClassifier' object has no attribute 'coef_'

model_gb.steps[1][1].coef_
AttributeError: 'HistGradientBoostingClassifier' object has no attribute 'coef_'

I was wondering, how I can fix this error? or is there any other approach to inspect the feature importance in each model?

CodePudding user response：

Imo, the point here is the following. On the one side, the pipeline instances model_lg, model_dt etc. are not explicitely fitted (you're not calling method .fit() on them directly) and this prevents you from trying to access the coef_ attribute on the instances themselves.

On the other side, by calling .cross_validate() with parameter return_estimator=True (which is possible with .cross_validate() only among the cross-validation methods), you can get the fitted estimators back for each cv split, but you should access them via your dictionaries cv_results_lg, cv_results_dt etc (on the 'estimator' key). Here's the reference in the code and here's an example:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_validate

X, y = load_iris(return_X_y=True)

model_lg = Pipeline([("preprocessing", StandardScaler()), ("classifier", LogisticRegression())])

cv_results_lg = cross_validate(model_lg, X, y, cv=5, return_train_score=True, return_estimator=True)

These would be - for instance - the results computed on the first fold.

cv_results_lg['estimator'][0].named_steps['classifier'].coef_

Useful insights on related topics might be found in:

CodePudding user response：

Try this:

from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris


X, y = load_iris(return_X_y=True)


model_lg = Pipeline([("preprocessing", StandardScaler()), ("classifier", LogisticRegression())])
model_dt = Pipeline([("preprocessing", StandardScaler()), ("classifier", DecisionTreeClassifier())])
model_gb = Pipeline([("preprocessing", StandardScaler()), ("classifier", HistGradientBoostingClassifier())])

Output:

>>> cross_validate(model_lg, X, y, cv=5,return_train_score=True, return_estimator=True)

{'fit_time': array([0.01361918, 0.01245522, 0.02597785, 0.01145387, 0.009269  ]),
 'score_time': array([0.00070882, 0.00079298, 0.00081825, 0.00045204, 0.0004909 ]),
 'estimator': [Pipeline(steps=[('preprocessing', StandardScaler()),
                  ('classifier', LogisticRegression())]),
  Pipeline(steps=[('preprocessing', StandardScaler()),
                  ('classifier', LogisticRegression())]),
  Pipeline(steps=[('preprocessing', StandardScaler()),
                  ('classifier', LogisticRegression())]),
  Pipeline(steps=[('preprocessing', StandardScaler()),
                  ('classifier', LogisticRegression())]),
  Pipeline(steps=[('preprocessing', StandardScaler()),
                  ('classifier', LogisticRegression())])],
 'test_score': array([0.96666667, 1.        , 0.93333333, 0.9       , 1.        ]),
 'train_score': array([0.95      , 0.96666667, 0.98333333, 0.98333333, 0.96666667])}

CodePudding user response：

make for loop in some algorithm and print accuracy