Home > Blockchain >  Inspection of the feature importance in scikit-learn pipelines
Inspection of the feature importance in scikit-learn pipelines

Time:04-06

I have defined the following pipelines using scikit-learn:

model_lg = Pipeline([("preprocessing", StandardScaler()), ("classifier", LogisticRegression())])
model_dt = Pipeline([("preprocessing", StandardScaler()), ("classifier", DecisionTreeClassifier())])
model_gb = Pipeline([("preprocessing", StandardScaler()), ("classifier", HistGradientBoostingClassifier())])

Then I used cross validation to evaluate the performance of each model:

cv_results_lg = cross_validate(model_lg, data, target, cv=5, return_train_score=True, return_estimator=True)
cv_results_dt = cross_validate(model_dt, data, target, cv=5, return_train_score=True, return_estimator=True)
cv_results_gb = cross_validate(model_gb, data, target, cv=5, return_train_score=True, return_estimator=True)

When I try to inspect the feature importance for each model using the coef_ method, it gives me an attribution error:

model_lg.steps[1][1].coef_
AttributeError: 'LogisticRegression' object has no attribute 'coef_'

model_dt.steps[1][1].coef_
AttributeError: 'DecisionTreeClassifier' object has no attribute 'coef_'

model_gb.steps[1][1].coef_
AttributeError: 'HistGradientBoostingClassifier' object has no attribute 'coef_'

I was wondering, how I can fix this error? or is there any other approach to inspect the feature importance in each model?

CodePudding user response:

Imo, the point here is the following. On the one side, the pipeline instances model_lg, model_dt etc. are not explicitely fitted (you're not calling method .fit() on them directly) and this prevents you from trying to access the coef_ attribute on the instances themselves.

On the other side, by calling .cross_validate() with parameter return_estimator=True (which is possible with .cross_validate() only among the cross-validation methods), you can get the fitted estimators back for each cv split, but you should access them via your dictionaries cv_results_lg, cv_results_dt etc (on the 'estimator' key). Here's the reference in the code and here's an example:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_validate

X, y = load_iris(return_X_y=True)

model_lg = Pipeline([("preprocessing", StandardScaler()), ("classifier", LogisticRegression())])

cv_results_lg = cross_validate(model_lg, X, y, cv=5, return_train_score=True, return_estimator=True)

These would be - for instance - the results computed on the first fold.

cv_results_lg['estimator'][0].named_steps['classifier'].coef_

Useful insights on related topics might be found in:

CodePudding user response:

Try this:

from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris


X, y = load_iris(return_X_y=True)


model_lg = Pipeline([("preprocessing", StandardScaler()), ("classifier", LogisticRegression())])
model_dt = Pipeline([("preprocessing", StandardScaler()), ("classifier", DecisionTreeClassifier())])
model_gb = Pipeline([("preprocessing", StandardScaler()), ("classifier", HistGradientBoostingClassifier())])

Output:

>>> cross_validate(model_lg, X, y, cv=5,return_train_score=True, return_estimator=True)

{'fit_time': array([0.01361918, 0.01245522, 0.02597785, 0.01145387, 0.009269  ]),
 'score_time': array([0.00070882, 0.00079298, 0.00081825, 0.00045204, 0.0004909 ]),
 'estimator': [Pipeline(steps=[('preprocessing', StandardScaler()),
                  ('classifier', LogisticRegression())]),
  Pipeline(steps=[('preprocessing', StandardScaler()),
                  ('classifier', LogisticRegression())]),
  Pipeline(steps=[('preprocessing', StandardScaler()),
                  ('classifier', LogisticRegression())]),
  Pipeline(steps=[('preprocessing', StandardScaler()),
                  ('classifier', LogisticRegression())]),
  Pipeline(steps=[('preprocessing', StandardScaler()),
                  ('classifier', LogisticRegression())])],
 'test_score': array([0.96666667, 1.        , 0.93333333, 0.9       , 1.        ]),
 'train_score': array([0.95      , 0.96666667, 0.98333333, 0.98333333, 0.96666667])}

CodePudding user response:

make for loop in some algorithm and print accuracy

  • Related