I have defined the following pipelines using scikit-learn:
model_lg = Pipeline([("preprocessing", StandardScaler()), ("classifier", LogisticRegression())])
model_dt = Pipeline([("preprocessing", StandardScaler()), ("classifier", DecisionTreeClassifier())])
model_gb = Pipeline([("preprocessing", StandardScaler()), ("classifier", HistGradientBoostingClassifier())])
Then I used cross validation to evaluate the performance of each model:
cv_results_lg = cross_validate(model_lg, data, target, cv=5, return_train_score=True, return_estimator=True)
cv_results_dt = cross_validate(model_dt, data, target, cv=5, return_train_score=True, return_estimator=True)
cv_results_gb = cross_validate(model_gb, data, target, cv=5, return_train_score=True, return_estimator=True)
When I try to inspect the feature importance for each model using the coef_
method, it gives me an attribution error:
model_lg.steps[1][1].coef_
AttributeError: 'LogisticRegression' object has no attribute 'coef_'
model_dt.steps[1][1].coef_
AttributeError: 'DecisionTreeClassifier' object has no attribute 'coef_'
model_gb.steps[1][1].coef_
AttributeError: 'HistGradientBoostingClassifier' object has no attribute 'coef_'
I was wondering, how I can fix this error? or is there any other approach to inspect the feature importance in each model?
CodePudding user response:
Imo, the point here is the following. On the one side, the pipeline instances model_lg
, model_dt
etc. are not explicitely fitted (you're not calling method .fit()
on them directly) and this prevents you from trying to access the coef_
attribute on the instances themselves.
On the other side, by calling .cross_validate()
with parameter return_estimator=True
(which is possible with .cross_validate()
only among the cross-validation methods), you can get the fitted estimators back for each cv split, but you should access them via your dictionaries cv_results_lg
, cv_results_dt
etc (on the 'estimator'
key). Here's the reference in the code and here's an example:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_validate
X, y = load_iris(return_X_y=True)
model_lg = Pipeline([("preprocessing", StandardScaler()), ("classifier", LogisticRegression())])
cv_results_lg = cross_validate(model_lg, X, y, cv=5, return_train_score=True, return_estimator=True)
These would be - for instance - the results computed on the first fold.
cv_results_lg['estimator'][0].named_steps['classifier'].coef_
Useful insights on related topics might be found in:
- How to get feature importances of a multi-label classification problem?
- Get support and ranking attributes for RFE using Pipeline in Python 3
CodePudding user response:
Try this:
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
model_lg = Pipeline([("preprocessing", StandardScaler()), ("classifier", LogisticRegression())])
model_dt = Pipeline([("preprocessing", StandardScaler()), ("classifier", DecisionTreeClassifier())])
model_gb = Pipeline([("preprocessing", StandardScaler()), ("classifier", HistGradientBoostingClassifier())])
Output:
>>> cross_validate(model_lg, X, y, cv=5,return_train_score=True, return_estimator=True)
{'fit_time': array([0.01361918, 0.01245522, 0.02597785, 0.01145387, 0.009269 ]),
'score_time': array([0.00070882, 0.00079298, 0.00081825, 0.00045204, 0.0004909 ]),
'estimator': [Pipeline(steps=[('preprocessing', StandardScaler()),
('classifier', LogisticRegression())]),
Pipeline(steps=[('preprocessing', StandardScaler()),
('classifier', LogisticRegression())]),
Pipeline(steps=[('preprocessing', StandardScaler()),
('classifier', LogisticRegression())]),
Pipeline(steps=[('preprocessing', StandardScaler()),
('classifier', LogisticRegression())]),
Pipeline(steps=[('preprocessing', StandardScaler()),
('classifier', LogisticRegression())])],
'test_score': array([0.96666667, 1. , 0.93333333, 0.9 , 1. ]),
'train_score': array([0.95 , 0.96666667, 0.98333333, 0.98333333, 0.96666667])}
CodePudding user response:
make for loop in some algorithm and print accuracy