Why does SageMaker SHAP require a baseline dataset?-CodePudding

SageMaker Clarify SHAP (https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.clarify.SHAPConfig) requires users to specify a baseline dataset. The regular, popular SHAP (https://github.com/slundberg/shap) does not require this, making its use simpler than ours.

Why do we require a baseline dataset?

CodePudding user response：

Most of the approaches in SHAP do require a background/baseline dataset. It is only the TreeSHAP (to my knowledge) that can do without it (by using instead information stored in the trees themselves to know about how to "integrate out features" that are masked). The Clarify documentation says it uses Kernel SHAP, so a background dataset is required. However, notice that they will compute one for you if baseline=None, using clustering on the background data available to Clarify from your training the model in the first place.