Goal¶

This post aims to introduce how to interpret the prediction for Boston Housing using shap.

What is SHAP?

SHAP is a module for making a prediction by some machine learning models interpretable, where we can see which feature variables have an impact on the predicted value. In other words, it can calculate SHAP values, i.e., how much the predicted variable would be increased or decreased by a certain feature variable.

Reference

Github - shap

Libraries¶

In [11]:

import xgboost
import shap
shap.initjs()

Load Boston Housing Dataset¶

In [3]:

X, y = shap.datasets.boston()
X[:5]

Out[3]:

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33

In [4]:

y[:5]

Out[4]:

array([24. , 21.6, 34.7, 33.4, 36.2])

Train a predictor by `xgboost`¶

In [10]:

d_param = {
    "learning_rate": 0.01
}

model = xgboost.train(params=d_param,
                      dtrain=xgboost.DMatrix(X, label=y), 
                      num_boost_round=100)

Create an explainer¶

In [12]:

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)

Outcome of `SHAP`¶

Single prediction explainer¶

The visualization below shows the explanations for one prediction based on i-th data.

red: positive impacts on the prediction
blue: negative impacts on the prediction

In [19]:

i = 0
shap.force_plot(explainer.expected_value, shap_values[i,:], X.iloc[i,:])

Out[19]:

All prediction explainers¶

All explainers like the above are plotted in one graph as below.

In [14]:

shap.summary_plot(shap_values, X, plot_type="violin")

Variable importance¶

This variable importance shown as below simply aggregates the above by computing the sum of the absolute values of shap values for all data points.

In [13]:

shap.summary_plot(shap_values, X, plot_type="bar")

Force Plot¶

The other way of visualizing shap values are the one to stack all shap values across samples or feature values themselves.

In [16]:

shap.force_plot(explainer.expected_value, shap_values, X)

Out[16]:

Dependency Plot¶

This plot shows a certain value and its shap value as a scatter plot with the color specified by automatically selected variable, which separates most the certain value and its shap value.

In [25]:

# specify by the index of the features
shap.dependence_plot(ind=12, shap_values=shap_values, features=X)

In [17]:

# specify by the feature name
shap.dependence_plot(ind="RM", shap_values=shap_values, features=X)