# Introduction to Bayesian Optimization

## Goal¶

This notebook aims to introduce how Bayesian Optimization works using `bayesian-optimization` module.

Bayesian Optimization is the way of estimating the unknown function where we can choose the arbitrary input \$x\$ and obtain the response from that function. The outcome of Bayesian Optimization is to obtain the mean and confidence interval of the function we look for by step. You could also stop earlier or decide go further iteratively.

This will cover the very first toy example of Bayesian Optimization by defining "black-box" function and show how interactively or step-by-step Bayesian Optimization will figure and estimate this "black-box" function.

Reference

## Libraries¶

In [41]:
```from bayes_opt import BayesianOptimization
from bayes_opt import UtilityFunction
import numpy as np
import warnings
from IPython.display import display, clear_output
import matplotlib.pyplot as plt
from matplotlib import gridspec
%matplotlib inline
```

## Unknown Function¶

We can have any function to estimate here. As an example, we will have 1-D function defined by the following equation:

\$\$f(x) = 3e^{-(x-3)^{2}} - e^{-(x-2)^2} + 2 e^{-(x+3)^2}\$\$
In [2]:
```def unknown_func(x):
return 3 * np.exp(-(x-3) **2) - np.exp(-(x-2) **2) + 2 * np.exp(-(x + 3) **2)
```

If we visualize the unknown function (as a reference), we can plot like below. Note that we are not supposed to know this plot since this function is "black-box"

In [4]:
```x = np.linspace(-6, 6, 10000).reshape(-1, 1)
y = unknown_func(x)

plt.plot(x, y);
plt.title('1-D Unknown Function to be estimated');
plt.xlabel('X');
plt.ylabel('Response from the function');
```

## Bayesian Optimization¶

First of all, we need to create BayesianOptimization object by passing the function `f` you want to estimate with its input boundary as `pbounds`.

In [5]:
```optimizer = BayesianOptimization(f=unknown_func, pbounds={'x': (-6, 6)}, verbose=0)
optimizer
```
Out[5]:
`<bayes_opt.bayesian_optimization.BayesianOptimization at 0x11ab55dd8>`

Then, we can start to explore this function by trying different inputs.

• `init_points` is the number of initial points to start with.
• `n_iter` is the number of iteration. This `optimizer.maximize` hold the state so whenever you execute it, it will continue from the last iteration.

### Helper functions¶

In [26]:
```def posterior(optimizer, x_obs, y_obs, grid):
optimizer._gp.fit(x_obs, y_obs)

mu, sigma = optimizer._gp.predict(grid, return_std=True)
return mu, sigma

def plot_gp(optimizer, x, y, fig=None, xlim=None):
if fig is None:
fig = plt.figure(figsize=(16, 10))
steps = len(optimizer.space)
fig.suptitle(
'Gaussian Process and Utility Function After {} Steps'.format(steps),
fontdict={'size':30}
)

gs = gridspec.GridSpec(2, 1, height_ratios=[3, 1])
axis = plt.subplot(gs[0])
acq = plt.subplot(gs[1])

x_obs = np.array([[res["params"]["x"]] for res in optimizer.res])
y_obs = np.array([res["target"] for res in optimizer.res])

mu, sigma = posterior(optimizer, x_obs, y_obs, x)
axis.plot(x, y, linewidth=3, label='Target')
axis.plot(x_obs.flatten(), y_obs, 'D', markersize=8, label=u'Observations', color='r')
axis.plot(x, mu, '--', color='k', label='Prediction')

axis.fill(np.concatenate([x, x[::-1]]),
np.concatenate([mu - 1.9600 * sigma, (mu + 1.9600 * sigma)[::-1]]),
alpha=.3, fc='C0', ec='None', label='95% confidence interval')
if xlim is not None:
axis.set_xlim(xlim)
axis.set_ylim((None, None))
axis.set_ylabel('f(x)', fontdict={'size':20})
axis.set_xlabel('x', fontdict={'size':20})

utility_function = UtilityFunction(kind="ucb", kappa=5, xi=0)
utility = utility_function.utility(x, optimizer._gp, 0)
acq.plot(x, utility, label='Utility Function', color='C3')
acq.plot(x[np.argmax(utility)], np.max(utility), 'o', markersize=15,
label=u'Next Best Guess', markerfacecolor='gold', markeredgecolor='k', markeredgewidth=1)
if xlim is not None:
acq.set_xlim(xlim)
acq.set_ylim((np.min(utility) , np.max(utility) + 0.5))
acq.set_ylabel('Utility', fontdict={'size':20})
acq.set_xlabel('x', fontdict={'size':20})

return fig
```

### Visualize the iterative step¶

In [54]:
```# fig = plt.figure(figsize=(16, 10))
xlim = (-6, 6)
optimizer = BayesianOptimization(f=unknown_func, pbounds={'x': xlim}, verbose=0)

with warnings.catch_warnings():
warnings.simplefilter("ignore")
for i in range(15):
break
#         optimizer.maximize(init_points=0, n_iter=1, kappa=5)
#         fig = plot_gp(optimizer, x, y, fig=fig, xlim=xlim)
#         display(plt.gcf())
#         clear_output(wait=True)
```

# Draw Perceptron graph by graphviz

## Goal¶

This post aims to introduce how to draw a diagram for perceptron.

Reference

## Libraries¶

In [1]:
```from graphviz import Digraph
```

## Create a node list and dictionary for the edges¶

In [12]:
```# List of nodes
l_nodes = ['1', 'x0', 'x1', 'y']

# Dictionary mapping from label name to the edge between two nodes
d_edges = {'b': ('1', 'y'),
'w0': ('x0', 'y'),
'w1': ('x1', 'y')}
```

## Visualize a graph for perceptron¶

In [13]:
```# Create Digraph object
dot = Digraph()
dot.attr(rankdir='LR')

for n in l_nodes:
dot.node(n)

for label, edges in d_edges.items():
dot.edge(edges[0], edges[1], label=label)

# Fill node 1 by gray
dot.node('1', style='filled')

# Visualize the graph
dot
```
Out[13]:

# Parallel Plot for Cateogrical and Continuous variable by Plotly Express

## Goal¶

This post aims to introduce how to draw Parallel Plot for categorical and continuous variable by `Plotly Express`

Reference

## Libraries¶

In [19]:
```import pandas as pd
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "png"
```

## Create continuous data¶

In [4]:
```df = px.data.election()
```
Out[4]:
district Coderre Bergeron Joly total winner result
0 101-Bois-de-Liesse 2481 1829 3024 7334 Joly plurality
1 102-Cap-Saint-Jacques 2525 1163 2675 6363 Joly plurality
2 11-Sault-au-Récollet 3348 2770 2532 8650 Coderre plurality
3 111-Mile-End 1734 4782 2514 9030 Bergeron majority
4 112-DeLorimier 1770 5933 3044 10747 Bergeron majority

## Visualize Parallel Plot for continuous data¶

In [20]:
```fig = px.parallel_coordinates(df,color='total', color_continuous_scale=px.colors.sequential.Inferno)
fig
```

## Create categorical data¶

In [9]:
```df = px.data.election()
```
Out[9]:
district Coderre Bergeron Joly total winner result
0 101-Bois-de-Liesse 2481 1829 3024 7334 Joly plurality
1 102-Cap-Saint-Jacques 2525 1163 2675 6363 Joly plurality
2 11-Sault-au-Récollet 3348 2770 2532 8650 Coderre plurality
3 111-Mile-End 1734 4782 2514 9030 Bergeron majority
4 112-DeLorimier 1770 5933 3044 10747 Bergeron majority

## Visualize Parallel for continuous data¶

In [21]:
```fig = px.parallel_categories(df, color="total", color_continuous_scale=px.colors.sequential.Inferno)
fig
```

# Split Up: dtreeviz (Part 5)

## Goal¶

This post aims to break down the module `dtreeviz` module step by step to fully understand what is implemented. After fully understanding this, I would like to contribute to this module and submit a pull request.

I really like this module and would like to see this works for other tree-based modules like XGBoost or Lightgbm. I found the exact same issue (issues 15) in github so I hope I could contribute to this issue.

This post is the 5th part:

• `ctreeviz_univar`

Reference

## `trees.ctreeviz_univar`¶

• L267: the beginning of the definition for `ctreeviz_univar`
• L272-275: treatment for pandas input
• L280-288: load decision tree classifier object as shadow_tree and other relevant attributes e.g., # of class, target values.
• L290-302: setting labels and spines visibility
• L304-319: plotting stacked bar chart with histogram when `gtype=='barstacked'`
• L320-330: plotting scatter plot with gitter
• L332: setting tick parameters
• L352-353: setting legend
• L355-358: setting a title
• L360-362: setting `splits` vertical line between categories
In [53]:
```from pathlib import Path
from graphviz.backend import run, view
import matplotlib.pyplot as plt
from numbers import Number
import matplotlib.patches as patches
import tempfile
import os
from sys import platform as PLATFORM
from colour import Color, rgb2hex
from typing import Mapping, List
from dtreeviz.utils import inline_svg_images, myround
from sklearn import tree
import graphviz

from dtreeviz.trees import *

# How many bins should we have based upon number of classes
NUM_BINS = [0, 0, 10, 9, 8, 6, 6, 6, 5, 5, 5]
# 0, 1, 2,  3, 4, 5, 6, 7, 8, 9, 10

def ctreeviz_univar(ax, x_train, y_train, max_depth, feature_name, class_names,
target_name,
fontsize=14, fontname="Arial", nbins=25, gtype='strip',
show={'title','legend','splits'},
colors=None):
if isinstance(x_train, pd.Series):
x_train = x_train.values
if isinstance(y_train, pd.Series):
y_train = y_train.values

#    ax.set_facecolor('#F9F9F9')
ct = tree.DecisionTreeClassifier(max_depth=max_depth)
ct.fit(x_train.reshape(-1, 1), y_train)

feature_names=[feature_name], class_names=class_names)

overall_feature_range = (np.min(x_train), np.max(x_train))
color_values = colors['classes'][n_classes]
color_map = {v: color_values[i] for i, v in enumerate(class_values)}
X_colors = [color_map[cl] for cl in class_values]

ax.set_xlabel(f"{feature_name}", fontsize=fontsize, fontname=fontname,
color=colors['axis_label'])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.yaxis.set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_linewidth(.3)

X_hist = [x_train[y_train == cl] for cl in class_values]

if gtype == 'barstacked':
bins = np.linspace(start=overall_feature_range[0], stop=overall_feature_range[1], num=nbins, endpoint=True)
hist, bins, barcontainers = ax.hist(X_hist,
color=X_colors,
align='mid',
histtype='barstacked',
bins=bins,
label=class_names)

for patch in barcontainers:
for rect in patch.patches:
rect.set_linewidth(.5)
rect.set_edgecolor(colors['edge'])
ax.set_xlim(*overall_feature_range)
ax.set_xticks(overall_feature_range)
ax.set_yticks([0, max([max(h) for h in hist])])
elif gtype == 'strip':
# user should pass in short and wide fig
sigma = .013
mu = .08
class_step = .08
dot_w = 20
ax.set_ylim(0, mu + n_classes*class_step)
print('X_hist', X_hist)
for i, bucket in enumerate(X_hist):
y_noise = np.random.normal(mu+i*class_step, sigma, size=len(bucket))
ax.scatter(bucket, y_noise, alpha=.7, marker='o', s=dot_w, c=color_map[i],
edgecolors=colors['scatter_edge'], lw=.3)

ax.tick_params(axis='both', which='major', width=.3, labelcolor=colors['tick_label'],
labelsize=fontsize)

splits = [node.split() for node in shadow_tree.internal]
splits = sorted(splits)
bins = [ax.get_xlim()[0]] + splits + [ax.get_xlim()[1]]

pred_box_height = .07 * ax.get_ylim()[1]
preds = []
for i in range(len(bins) - 1):
left = bins[i]
right = bins[i + 1]
inrange = y_train[(x_train >= left) & (x_train <= right)]
values, counts = np.unique(inrange, return_counts=True)
pred = values[np.argmax(counts)]
rect = patches.Rectangle((left, 0), (right - left), pred_box_height, linewidth=.3,
edgecolor=colors['edge'], facecolor=color_map[pred])
preds.append(pred)

if 'legend' in show:
add_classifier_legend(ax, class_names, class_values, color_map, target_name, colors)

if 'title' in show:
accur = ct.score(x_train.reshape(-1, 1), y_train)
title = f"Classifier tree depth {max_depth}, training accuracy={accur*100:.2f}%"
plt.title(title, fontsize=fontsize, color=colors['title'])

if 'splits' in show:
for split in splits:
plt.plot([split, split], [*ax.get_ylim()], '--', color=colors['split_line'], linewidth=1)
```

## Create a toy classification example¶

In [48]:
```import numpy as np
import graphviz
from sklearn import tree

X = np.array([0, 1, 0.5, 10, 11, 12, 20, 21, 22, 30, 30, 32]).reshape(-1, 1)
Y = np.array(['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd']).reshape(-1, 1)
clf = tree.DecisionTreeClassifier(max_depth=3)
clf = clf.fit(X, Y)

df = pd.DataFrame(data={'X':X.ravel(), 'Y': Y.ravel()}, index=range(len(X)))
df.plot(kind='bar');
plt.title('Sample Data for Univariate Regression');
```

## Visualize classification tree for univariate case¶

In [54]:
```fig, ax = plt.subplots(1)
ctreeviz_univar(ax, pd.Series(X.ravel()), pd.Series(Y.ravel()),
feature_name='X',
target_name='Y',
max_depth=4,
class_names=['a', 'b', 'c', 'd'],
gtype = 'barstacked',
show={'title', 'splits'}
)
```

Note When I apply `show={'legend'}`, I obtained the error below and still not figured out yet what was wrong.

```---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-42-c31e8b14db34> in <module>
4                 target_name='Y',
5                 max_depth=4,
----> 6                 class_names=['a', 'b', 'c', 'd']
7                )

<ipython-input-41-b466a69d927c> in ctreeviz_univar(ax, x_train, y_train, max_depth, feature_name, class_names, target_name, fontsize, fontname, nbins, gtype, show, colors)
85         for i, bucket in enumerate(X_hist):
86             y_noise = np.random.normal(mu+i*class_step, sigma, size=len(bucket))
---> 87             ax.scatter(bucket, y_noise, alpha=.7, marker='o', s=dot_w, c=color_map[i],
88                        edgecolors=colors['scatter_edge'], lw=.3)
89

KeyError: 0
```

# Split Up: dtreeviz (Part 4)

## Goal¶

This post aims to break down the module `dtreeviz` module step by step to fully understand what is implemented. After fully understanding this, I would like to contribute to this module and submit a pull request.

I really like this module and would like to see this works for other tree-based modules like XGBoost or Lightgbm. I found the exact same issue (issues 15) in github so I hope I could contribute to this issue.

This post is the 4th part: breaking down `DTreeViz` class and `rtreeviz_univar` method.

Reference

## `DTreeViz` class¶

• L23: the beginning of `DTreeViz` class
• L24-25: `__init__` method taking `dot` object as an input
• L26-78: deal with save, view the visualization as svg file

### rtreeviz_univar¶

• L81: the beginning of `rtreeviz_univar` method
• L94-102: initial setting for the range of X, y data and converting them into numpy array.
• L104-105: create a scikit-learn decision tree
• L121-122: plot the original X and y data points
• L125-126: plot the vertical line for decision boundary (gray line)
• L128-134: plot the horizontal line for mean line (red line by default)
• L136: Change the appearance of ticks
• L138-140: setting title
• L142-143: setting x and y label based on `feature_name` and `target_name`
In [4]:
```from pathlib import Path
from graphviz.backend import run, view
import matplotlib.pyplot as plt
from numbers import Number
import matplotlib.patches as patches
import tempfile
import os
from sys import platform as PLATFORM
from colour import Color, rgb2hex
from typing import Mapping, List
from dtreeviz.utils import inline_svg_images, myround
from sklearn import tree
import graphviz

# How many bins should we have based upon number of classes
NUM_BINS = [0, 0, 10, 9, 8, 6, 6, 6, 5, 5, 5]
# 0, 1, 2,  3, 4, 5, 6, 7, 8, 9, 10

def rtreeviz_univar(ax,
x_train: (pd.Series, np.ndarray),  # 1 vector of X data
y_train: (pd.Series, np.ndarray),
max_depth = 10,
feature_name: str = None,
target_name: str = None,
min_samples_leaf = 1,
fontsize: int = 14,
show={'title','splits'},
split_linewidth=.5,
mean_linewidth = 2,
markersize=None,
colors=None):
if isinstance(x_train, pd.Series):
x_train = x_train.values
if isinstance(y_train, pd.Series):
y_train = y_train.values

y_range = (min(y_train), max(y_train))  # same y axis for all
overall_feature_range = (np.min(x_train), np.max(x_train))

t = tree.DecisionTreeRegressor(max_depth=max_depth, min_samples_leaf=min_samples_leaf)
t.fit(x_train.reshape(-1,1), y_train)

splits = []
splits.append(node.split())
splits = sorted(splits)
bins = [overall_feature_range[0]] + splits + [overall_feature_range[1]]

means = []
for i in range(len(bins) - 1):
left = bins[i]
right = bins[i + 1]
inrange = y_train[(x_train >= left) & (x_train <= right)]
means.append(np.mean(inrange))

ax.scatter(x_train, y_train, marker='o', alpha=.4, c=colors['scatter_marker'], s=markersize,
edgecolor=colors['scatter_edge'], lw=.3)

if 'splits' in show:
for split in splits:
ax.plot([split, split], [*y_range], '--', color=colors['split_line'], linewidth=split_linewidth)

prevX = overall_feature_range[0]
for i, m in enumerate(means):
split = overall_feature_range[1]
if i < len(splits):
split = splits[i]
ax.plot([prevX, split], [m, m], '-', color=colors['mean_line'], linewidth=mean_linewidth)
prevX = split

ax.tick_params(axis='both', which='major', width=.3, labelcolor=colors['tick_label'], labelsize=fontsize)

if 'title' in show:
title = f"Regression tree depth {max_depth}, samples per leaf {min_samples_leaf},\nTraining \$R^2\$={t.score(x_train.reshape(-1,1),y_train):.3f}"
plt.title(title, fontsize=fontsize, color=colors['title'])

plt.xlabel(feature_name, fontsize=fontsize, color=colors['axis_label'])
plt.ylabel(target_name, fontsize=fontsize, color=colors['axis_label'])
```

### Create a toy sample¶

In [42]:
```import numpy as np
import graphviz
from sklearn import tree

X = np.array([0, 1, 0.5, 10, 11, 12, 20, 21, 22, 30, 30, 32]).reshape(-1, 1)
Y = np.array([0., 0, 0, 50, 49, 50, 20, 21, 19, 90, 89, 91]).reshape(-1, 1)
clf = tree.DecisionTreeRegressor(max_depth=3)
clf = clf.fit(X, Y)

plt.scatter(x=X, y=Y, s=5);
plt.title('Sample Data for Univariate Regression');
```

### Visualize a tree using `rtreeviz_univar`¶

In [51]:
```fig, ax = plt.subplots(1)
rtreeviz_univar(ax, pd.Series(X.ravel()), pd.Series(Y.ravel()),
feature_name='X',
target_name='Y',
markersize=15)
```

# Split Up: dtreeviz (Part 3)

## Goal¶

This post aims to break down the module `dtreeviz` module step by step to fully understand what is implemented. After fully understanding this, I would like to contribute to this module and submit a pull request.

I really like this module and would like to see this works for other tree-based modules like XGBoost or Lightgbm. I found the exact same issue (issues 15) in github so I hope I could contribute to this issue.

This post is the 3rd part: breaking down `ShadowDecTree`.

Reference

## `ShadowDecTreeNode` class¶

### Source github¶

In [2]:
```import numpy as np
import pandas as pd
from collections import defaultdict, Sequence
from typing import Mapping, List, Tuple
from numbers import Number
from sklearn.utils import compute_class_weight

#

"""
A node in a shadow tree.  Each node has left and right
pointers to child nodes, if any.  As part of tree construction process, the
samples examined at each decision node or at each leaf node are
saved into field node_samples.
"""
def __init__(self, shadow_tree, id, left=None, right=None):
self.id = id
self.left = left
self.right = right

def split(self) -> (int,float):

def feature(self) -> int:

def feature_name(self) -> (str,None):
return None

def samples(self) -> List[int]:
"""
Return a list of sample indexes associated with this node. If this is a
leaf node, it indicates the samples used to compute the predicted value
or class.  If this is an internal node, it is the number of samples used
to compute the split point.
"""

def nsamples(self) -> int:
"""
Return the number of samples associated with this node. If this is a
leaf node, it indicates the samples used to compute the predicted value
or class. If this is an internal node, it is the number of samples used
to compute the split point.
"""
return self.shadow_tree.tree_model.tree_.n_node_samples[self.id] # same as len(self.node_samples)

def split_samples(self) -> Tuple[np.ndarray, np.ndarray]:
"""
Return the list of indexes to the left and the right of the split value.
"""
samples = np.array(self.samples())
split = self.split()
left = np.nonzero(node_X_data < split)[0]
right = np.nonzero(node_X_data >= split)[0]
return left, right

def isleaf(self) -> bool:
return self.left is None and self.right is None

def isclassifier(self):

def prediction(self) -> (Number,None):
"""
If this is a leaf node, return the predicted continuous value, if this is a
regressor, or the class number, if this is a classifier.
"""
if not self.isleaf(): return None
if self.isclassifier():
predicted_class = np.argmax(counts)
return predicted_class
else:

def prediction_name(self) -> (str,None):
"""
If the tree model is a classifier and we know the class names,
return the class name associated with the prediction for this leaf node.
Return prediction class or value otherwise.
"""
if self.isclassifier():
return self.prediction()

def class_counts(self) -> (List[int],None):
"""
If this tree model is a classifier, return a list with the count
associated with each class.
"""
if self.isclassifier():
else:
return None

def __str__(self):
if self.left is None and self.right is None:
return "<pred={value},n={n}>".format(value=round(self.prediction(),1), n=self.nsamples())
else:
return "({f}@{s} {left} {right})".format(f=self.feature_name(),
s=round(self.split(),1),
left=self.left if self.left is not None else '',
right=self.right if self.right is not None else '')
```

### Instantiate class objects¶

#### Create a tree model by scikit learn¶

In [3]:
```import numpy as np
import graphviz
from sklearn import tree

X = np.array([[0, 0], [1, 1]])
Y = np.array([0, 1])
# Y = [0, 1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
dot_data = tree.export_graphviz(clf, out_file=None,
feature_names=[0, 1],
class_names=['0', '1'],
filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(dot_data)
graph
```
Out[3]:

### Create a `ShadowDecTreeNode`¶

ShadowDecTreeNode `__init__`

• L222-226: store input arguments as class members
• L228-308: define the same functions in tree objects like `split`, `feature` etc. or utility functions
In [4]:
```# instantiate ShadowDecTree
```
In [5]:
```# instantiate ShadowDecTreeNode
```
Out[5]:
`<__main__.ShadowDecTreeNode at 0x120eda908>`

In [6]:
```# L228 split
```
Out[6]:
`0.5`
In [7]:
```# L231 feature
```
Out[7]:
`1`
In [8]:
```# L239 samples
```
Out[8]:
`[0, 1]`
In [9]:
```# L248 nsamples
```
Out[9]:
`2`
In [10]:
```# L257 split_samples
```
Out[10]:
`(array([0]), array([1]))`
In [11]:
```# L268 isleaf
```
Out[11]:
`True`
In [12]:
```# L271 isclassifier
```
Out[12]:
`array([ True])`
In [13]:
```# L287 prediction_name
```
Out[13]:
`0`
In [14]:
```# L298 class_counts
```
Out[14]:
`array([1, 1])`

# Visualization Samples by Plotly Express

## Goal¶

This post aims to introduce examples of visualization by Plotly Express.

The followings are introduced:

• Prepared example data
• Scatter plot
• basic
• basic + size
• basic + size + color
• basic + size + color + time
• heatmap + histogram

Reference

## Libraries¶

In [8]:
```import pandas as pd
import numpy as np
import plotly_express as px
import plotly.io as pio
pio.renderers.default = "png"
```

### Car Share¶

In [2]:
```df = px.data.carshare()
```
Out[2]:
centroid_lat centroid_lon car_hours peak_hour
0 45.471549 -73.588684 1772.750000 2
1 45.543865 -73.562456 986.333333 23
2 45.487640 -73.642767 354.750000 20
3 45.522870 -73.595677 560.166667 23
4 45.453971 -73.738946 2836.666667 19

### Tips¶

In [3]:
```df = px.data.tips()
```
Out[3]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

### Election¶

In [4]:
```df = px.data.election()
```
Out[4]:
district Coderre Bergeron Joly total winner result
0 101-Bois-de-Liesse 2481 1829 3024 7334 Joly plurality
1 102-Cap-Saint-Jacques 2525 1163 2675 6363 Joly plurality
2 11-Sault-au-Récollet 3348 2770 2532 8650 Coderre plurality
3 111-Mile-End 1734 4782 2514 9030 Bergeron majority
4 112-DeLorimier 1770 5933 3044 10747 Bergeron majority

### Wind¶

In [5]:
```df = px.data.wind()
```
Out[5]:
direction strength frequency
0 N 0-1 0.5
1 NNE 0-1 0.6
2 NE 0-1 0.5
3 ENE 0-1 0.4
4 E 0-1 0.4

### Gap Minder¶

In [6]:
```df = px.data.gapminder()
```
Out[6]:
country continent year lifeExp pop gdpPercap iso_alpha iso_num
0 Afghanistan Asia 1952 28.801 8425333 779.445314 AFG 4
1 Afghanistan Asia 1957 30.332 9240934 820.853030 AFG 4
2 Afghanistan Asia 1962 31.997 10267083 853.100710 AFG 4
3 Afghanistan Asia 1967 34.020 11537966 836.197138 AFG 4
4 Afghanistan Asia 1972 36.088 13079460 739.981106 AFG 4

## Scatter Plot¶

### Basic Scatter plot¶

In [10]:
```px.scatter(df, x='gdpPercap', y='lifeExp', width=900, height=400)
```

### Scatter plot + size¶

In [11]:
```px.scatter(df, x='gdpPercap', y='lifeExp', size='pop', width=900, height=400)
```

### Scatter plot + size + color¶

In [12]:
```px.scatter(df, x='gdpPercap', y='lifeExp', size='pop', color='country', width=900, height=400)
```

### Scatter plot + size + color + time¶

In [13]:
```px.scatter(df, x='gdpPercap', y='lifeExp', size='pop', color='country', animation_frame='year', width=900, height=400)
```
In [14]:
```px.density_heatmap(df, x="gdpPercap", y="lifeExp", marginal_y="histogram", marginal_x="histogram")
```

# Split Up: dtreevis (Part 2)

## Goal¶

This post aims to break down the module `dtreeviz` module step by step to fully understand what is implemented. After fully understanding this, I would like to contribute to this module and submit a pull request.

I really like this module and would like to see this works for other tree-based modules like XGBoost or Lightgbm. I found the exact same issue (issues 15) in github so I hope I could contribute to this issue.

This post is the 2nd part of the process of breaking down `ShadowDecTree`.

Reference

## `ShadowDecTree` class¶

### Source github¶

In [109]:
```import numpy as np
import pandas as pd
from collections import defaultdict, Sequence
from typing import Mapping, List, Tuple
from numbers import Number
from sklearn.utils import compute_class_weight

"""
The decision trees for classifiers and regressors from scikit-learn
are built for efficiency, not ease of tree walking. This class
is intended as a way to wrap all of that information in an easy to use
package.
This tree shadows a decision tree as constructed by scikit-learn's
DecisionTree(Regressor|Classifier).  As part of build process, the
samples considered at each decision node or at each leaf node are
saved as a big dictionary for use by the nodes.
Field leaves is list of shadow leaf nodes. Field internal is list of
Field root is the shadow tree root.
Parameters
----------
class_names : (List[str],Mapping[int,str]). A mapping from target value
to target class name. If you pass in a list of strings,
target value i must be associated with class name[i]. You
can also pass in a dict that maps value to name.
"""
def __init__(self, tree_model,
X_train,
y_train,
feature_names : List[str],
class_names : (List[str],Mapping[int,str])=None):
self.tree_model = tree_model
self.feature_names = feature_names
self.class_names = class_names
self.class_weight = tree_model.class_weight

if getattr(tree_model, 'tree_') is None: # make sure model is fit
tree_model.fit(X_train, y_train)

if tree_model.tree_.n_classes > 1:
if isinstance(self.class_names, dict):
self.class_names = self.class_names
elif isinstance(self.class_names, Sequence):
self.class_names = {i:n for i, n in enumerate(self.class_names)}
else:
raise Exception(f"class_names must be dict or sequence, not {self.class_names.__class__.__name__}")

if isinstance(X_train, pd.DataFrame):
X_train = X_train.values
self.X_train = X_train
if isinstance(y_train, pd.Series):
y_train = y_train.values
self.y_train = y_train
if self.isclassifier():
self.unique_target_values = np.unique(y_train)
self.class_weights = compute_class_weight(tree_model.class_weight, self.unique_target_values, self.y_train)

tree = tree_model.tree_
children_left = tree.children_left
children_right = tree.children_right

# use locals not args to walk() for recursion speed in python
leaves = []
internal = [] # non-leaf nodes

def walk(node_id):
if (children_left[node_id] == -1 and children_right[node_id] == -1):  # leaf
leaves.append(t)
return t
else:  # decision node
left = walk(children_left[node_id])
right = walk(children_right[node_id])
t = ShadowDecTreeNode(self, node_id, left, right)
internal.append(t)
return t

root_node_id = 0
# record root to actual shadow nodes
self.root = walk(root_node_id)
self.leaves = leaves
self.internal = internal

def nclasses(self):
return self.tree_model.tree_.n_classes[0]

def nnodes(self) -> int:
return self.tree_model.tree_.node_count

def leaf_sample_counts(self) -> List[int]:
return [self.tree_model.tree_.n_node_samples[leaf.id] for leaf in self.leaves]

def isclassifier(self):
return self.tree_model.tree_.n_classes > 1

def get_split_node_heights(self, X_train, y_train, nbins) -> Mapping[int,int]:
class_values = self.unique_target_values
node_heights = {}
# print(f"Goal {nbins} bins")
for node in self.internal:
# print(node.feature_name(), node.id)
X_feature = X_train[:, node.feature()]
overall_feature_range = (np.min(X_feature), np.max(X_feature))
# print(f"range {overall_feature_range}")
r = overall_feature_range[1] - overall_feature_range[0]

bins = np.linspace(overall_feature_range[0],
overall_feature_range[1], nbins+1)
# bins = np.arange(overall_feature_range[0],
#                  overall_feature_range[1] + binwidth, binwidth)
# print(f"\tlen(bins)={len(bins):2d} bins={bins}")
X, y = X_feature[node.samples()], y_train[node.samples()]
X_hist = [X[y == cl] for cl in class_values]
height_of_bins = np.zeros(nbins)
for cl in class_values:
hist, foo = np.histogram(X_hist[cl], bins=bins, range=overall_feature_range)
# print(f"class {cl}: goal_n={len(bins):2d} n={len(hist):2d} {hist}")
height_of_bins += hist
node_heights[node.id] = np.max(height_of_bins)

# print(f"\tmax={np.max(height_of_bins):2.0f}, heights={list(height_of_bins)}, {len(height_of_bins)} bins")
return node_heights

def predict(self, x : np.ndarray) -> Tuple[Number,List]:
"""
Given an x-vector of features, return predicted class or value based upon
this tree. Also return path from root to leaf as 2nd value in return tuple.
Recursively walk down tree from root to appropriate leaf by
comparing feature in x to node's split value. Also return
:param x: Feature vector to run down the tree to a leaf.
:type x: np.ndarray
:return: Predicted class or value based
:rtype: Number
"""
def walk(t, x, path):
if t is None:
return None
path.append(t)
if t.isleaf():
return t
if x[t.feature()] < t.split():
return walk(t.left, x, path)
return walk(t.right, x, path)

path = []
leaf = walk(self.root, x, path)
return leaf.prediction(), path

def tesselation(self):
"""
Walk tree and return list of tuples containing a leaf node and bounding box
list of (x1,y1,x2,y2) coordinates
:return:
:rtype:
"""
bboxes = []

def walk(t, bbox):
if t is None:
return None
# print(f"Node {t.id} bbox {bbox} {'   LEAF' if t.isleaf() else ''}")
if t.isleaf():
bboxes.append((t, bbox))
return t
# shrink bbox for left, right and recurse
s = t.split()
if t.feature()==0:
walk(t.left,  (bbox[0],bbox[1],s,bbox[3]))
walk(t.right, (s,bbox[1],bbox[2],bbox[3]))
else:
walk(t.left,  (bbox[0],bbox[1],bbox[2],s))
walk(t.right, (bbox[0],s,bbox[2],bbox[3]))

# create bounding box in feature space (not zeroed)
f1_values = self.X_train[:, 0]
f2_values = self.X_train[:, 1]
overall_bbox = (np.min(f1_values), np.min(f2_values), # x,y of lower left edge
np.max(f1_values), np.max(f2_values)) # x,y of upper right edge
walk(self.root, overall_bbox)

return bboxes

@staticmethod
def node_samples(tree_model, data) -> Mapping[int, list]:
"""
Return dictionary mapping node id to list of sample indexes considered by
the feature/split decision.
"""
# Doc say: "Return a node indicator matrix where non zero elements
#           indicates that the samples goes through the nodes."
dec_paths = tree_model.decision_path(data)

# each sample has path taken down tree
node_to_samples = defaultdict(list)
for sample_i, dec in enumerate(dec_paths):
_, nz_nodes = dec.nonzero()
for node_id in nz_nodes:
node_to_samples[node_id].append(sample_i)

return node_to_samples

def __str__(self):
return str(self.root)
```

### Instantiate class objects¶

#### Create a tree model by scikit learn¶

In [93]:
```import numpy as np
import graphviz
from sklearn import tree

X = np.array([[0, 0], [1, 1]])
Y = np.array([0, 1])
# Y = [0, 1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
dot_data = tree.export_graphviz(clf, out_file=None,
feature_names=[0, 1],
class_names=['0', '1'],
filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(dot_data)
graph
```
Out[93]: