I'm using a RandomForestClassifier model, which I’d like to explain using Shapley values. The data (which contains 150 features) was first run through a PCA selector, which converted it into 3 new features, and then this selected reduced-data was given to a RandomForestClassifier model. The model is then given to shap.Explainer(). The problem is, I’d like the shap to explain the model with the original 150 features, and not with the 3 pca components. Therefore I called the shap.Explainer() with the original data:
#the selector:
fs_all_pca = PCA(n_components=3).fit(X)
X_all_pca = fs_all_pca.transform(X)
#the model:
model = RandomForestClassifier(max_depth=5, min_samples_split=4, n_estimators=200, min_samples_leaf=3, class_weight=class_weights)
model.fit(X_all_pca, y)
#explain with shap:
explainer = shap.Explainer(model.predict, X)
shap_values = explainer(X)
However I get this error:
raise ValueError(
ValueError: X has 150 features, but RandomForestClassifier is expecting 3 features as input.
Is there a way to run shapley for the pca model, with the original data before it passed through the pca selection stage?
CodePudding user response:
You're fitting your model
on X_all_pca
having 3 features:
fs_all_pca = PCA(n_components=3).fit(X)
X_all_pca = fs_all_pca.transform(X)
model.fit(X_all_pca, y)
However, when you're predicting you want to feed all the features:
explainer = shap.Explainer(model.predict, X)
Hence your error message.
It must be more or less:
explainer = shap.Explainer(model.predict, X_all_pca)
If you for some reason want to do analysis for all features (why do PCA then???), do a pipeline and feed it through KernelExplainer