I'm new to using shap
, so I'm still trying to get my head around it. Basically, I have a simple sklearn.ensemble.RandomForestClassifier
fit using model.fit(X_train,y_train)
, and so on. After training, I'd like to obtain the Shap values to explain predictions on unseen data. Based on the docs and other tutorials, this seems to be the way to go:
explainer = shap.Explainer(model.predict, X_train)
shap_values = explainer.shap_values(X_test)
However, this takes a long time to run (about 18 hours for my data). If I replace the model.predict
with just model
in the first line, i.e:
explainer = shap.Explainer(model, X_train)
shap_values = explainer.shap_values(X_test)
It significantly reduces the runtime (down to about 40 minutes). So that leaves me to wonder what I'm actually getting in the second case?
To reiterate, I just want to be able to explain new predictions, and it seems strange to me that it would be this expensive - so I'm sure I'm doing something wrong.
CodePudding user response:
I think your question already contains a hint:
explainer = shap.Explainer(model.predict, X_train)
shap_values = explainer.shap_values(X_test)
is expensive and most probably is a kind of an exact algo to calculate Shapely values out of a function.
explainer = shap.Explainer(model, X_train)
shap_values = explainer.shap_values(X_test)
averages readily available predictions from trained model.
To prove the first claim (second is the fact of the matter) let's study source code for Explainer class.
class Explainer(Serializable):
""" Uses Shapley values to explain any machine learning model or python function.
This is the primary explainer interface for the SHAP library. It takes any combination
of a model and masker and returns a callable subclass object that implements
the particular estimation algorithm that was chosen.
"""
def __init__(self, model, masker=None, link=links.identity, algorithm="auto", output_names=None, feature_names=None, linearize_link=True,
seed=None, **kwargs):
""" Build a new explainer for the passed model.
Parameters
----------
model : object or function
User supplied function or model object that takes a dataset of samples and
computes the output of the model for those samples.
So, now you know one can provide either a model or a function as the first argument.
In case Pandas is supplied as masker:
if safe_isinstance(masker, "pandas.core.frame.DataFrame") or \
((safe_isinstance(masker, "numpy.ndarray") or sp.sparse.issparse(masker)) and len(masker.shape) == 2):
if algorithm == "partition":
self.masker = maskers.Partition(masker)
else:
self.masker = maskers.Independent(masker)
Finally, if callable is supplied:
elif callable(self.model):
if issubclass(type(self.masker), maskers.Independent):
if self.masker.shape[1] <= 10:
algorithm = "exact"
else:
algorithm = "permutation"
Hopefully, you see now why the first one is an exact one (and thus takes long to calculate).
Now to your question(s):
What is the correct way to obtain explanations for predictions using Shap?
and
So that leaves me to wonder what I'm actually getting in the second case?
If you have a model (tree, linear, whatever) which is supported by SHAP use:
explainer = shap.Explainer(model, X_train)
shap_values = explainer.shap_values(X_test)
These are SHAP values extracted from a model and this is why SHAP
came into existence.
If it's not supported, use 1st one.
Both should give similar results.