Posts about xgboost

Interpretability of prediction for Boston Housing using SHAP

Goal

This post aims to introduce how to interpret the prediction for Boston Housing using shap.

What is SHAP?

SHAP is a module for making a prediction by some machine learning models interpretable, where we can see which feature variables have an impact on the predicted value. In other words, it can calculate SHAP values, i.e., how much the predicted variable would be increased or decreased by a certain feature variable.

Reference

Libraries

In [11]:
import xgboost
import shap
shap.initjs()

Load Boston Housing Dataset

In [3]:
X, y = shap.datasets.boston()
X[:5]
Out[3]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33
In [4]:
y[:5]
Out[4]:
array([24. , 21.6, 34.7, 33.4, 36.2])

Train a predictor by xgboost

In [10]:
d_param = {
    "learning_rate": 0.01
}

model = xgboost.train(params=d_param,
                      dtrain=xgboost.DMatrix(X, label=y), 
                      num_boost_round=100)

Create an explainer

In [12]:
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)

Outcome of SHAP

Single prediction explainer

The visualization below shows the explanations for one prediction based on i-th data.

  • red: positive impacts on the prediction
  • blue: negative impacts on the prediction
In [19]:
i = 0
shap.force_plot(explainer.expected_value, shap_values[i,:], X.iloc[i,:])
Out[19]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

All prediction explainers

All explainers like the above are plotted in one graph as below.

In [14]:
shap.summary_plot(shap_values, X, plot_type="violin")

Variable importance

This variable importance shown as below simply aggregates the above by computing the sum of the absolute values of shap values for all data points.

In [13]:
shap.summary_plot(shap_values, X, plot_type="bar")

Force Plot

The other way of visualizing shap values are the one to stack all shap values across samples or feature values themselves.

In [16]:
shap.force_plot(explainer.expected_value, shap_values, X)
Out[16]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

Dependency Plot

This plot shows a certain value and its shap value as a scatter plot with the color specified by automatically selected variable, which separates most the certain value and its shap value.

In [25]:
# specify by the index of the features
shap.dependence_plot(ind=12, shap_values=shap_values, features=X)
In [17]:
# specify by the feature name
shap.dependence_plot(ind="RM", shap_values=shap_values, features=X)