Explain Image Classification by SHAP Deep Explainer

Goal

This post aims to introduce how to explain Image Classification (trained by PyTorch) via SHAP Deep Explainer.

Shap is the module to make the black box model interpretable. For example, image classification tasks can be explained by the scores on each pixel on a predicted image, which indicates how much it contributes to the probability positively or negatively.

image

Reference

Time the performance by timeit and cell magic

Goal

This post aims to introduce how to measure the running time of your code or function by using timeit module and cell magic command %%time and %%timeit.

Reference *Time a Python Function

Libraries

In [1]:
import pandas as pd
import timeit

Measure a literal code

In [2]:
timeit.timeit('2**10')
Out[2]:
0.01853936500265263

Measure the execution time for a function

In [4]:
def execute_abc(x=10):
    return 2 ** 10
In [7]:
timeit.timeit(execute_abc)
Out[7]:
0.0984989230055362

Measure the time by using Cell Magic in Jupyter Notebook

In [3]:
%%time
2**10
CPU times: user 4 µs, sys: 1e+03 ns, total: 5 µs
Wall time: 11 µs
Out[3]:
1024
In [9]:
%timeit 2**10
17.9 ns ± 0.597 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)

Performance difference between append and insert in Python

Goal

This post aims to compare the performance between append and insert in Python. The performance comparison is simply done by the piece of code that counts a number, append it to a list, and then reverse it.

We will see the significant difference between two codes: one using append is linear and another using insert is quadratic run time growth as below.

Reference

Libraries

In [26]:
from timeit import Timer
import pandas as pd
%matplotlib inline

Append

In [6]:
count = 10**5
In [9]:
def count_by_append(count):
    nums = []
    for i in range(count):
        nums.append(i)
    nums.reverse()
count_by_append(count)

The execution time is 22ms

Insert

In [10]:
def count_by_insert(count):
    nums = []
    for i in range(count):
        nums.insert(0, i)
count_by_insert(count)      

The execution time is 3.53s

Comparison

In [24]:
counts = [10 ** i for i in range(5)]
time_by_append = []
time_by_insert = []

for count in counts:
    print(f'Processing {count}')
    t = Timer(lambda: count_by_append(count))
    time_by_append.append(t.timeit(number=10))
    t = Timer(lambda: count_by_insert(count))
    time_by_insert.append(t.timeit(number=10))

df_performance = pd.DataFrame({'count': counts,
                               'count_by_append': time_by_append,
                               'count_by_insert': time_by_insert})
df_performance
Processing 1
Processing 10
Processing 100
Processing 1000
Processing 10000
Out[24]:
count count_by_append count_by_insert
0 1 0.000013 0.000011
1 10 0.000019 0.000027
2 100 0.000107 0.000227
3 1000 0.000927 0.005318
4 10000 0.009438 0.379457
In [34]:
# Plot the performance difference
df_performance.set_index('count').plot(title='Performance Comparison beteen Append and Insert');

Draw Perceptron graph by graphviz

Goal

This post aims to introduce how to draw a diagram for perceptron.

Reference

Libraries

In [1]:
from graphviz import Digraph

Create a node list and dictionary for the edges

In [12]:
# List of nodes
l_nodes = ['1', 'x0', 'x1', 'y']

# Dictionary mapping from label name to the edge between two nodes
d_edges = {'b': ('1', 'y'), 
           'w0': ('x0', 'y'), 
           'w1': ('x1', 'y')}

Visualize a graph for perceptron

In [13]:
# Create Digraph object
dot = Digraph()
dot.attr(rankdir='LR')

# Add nodes
for n in l_nodes:
    dot.node(n)        

# Add edges
for label, edges in d_edges.items(): 
    dot.edge(edges[0], edges[1], label=label)

# Fill node 1 by gray
dot.node('1', style='filled')
    
# Visualize the graph
dot
Out[13]:
%3 1 1y y1->y bx0 x0x0->y w0x1 x1x1->y w1

Implement Perceptron

Goal

This post aims to introduce how to implement Perceptron, which is the foundation of neural network and a simple gate function returning 0 (no signal) or 1 (signal) given a certain input.

In this post, the following fate functions are implemented:

  • AND
  • NAND
  • OR
  • XOR
$$ y = f(\mathbf{x})=\begin{cases} 0 & (b + \mathbf{wx} \le 0)\\ 1 &(b + \mathbf{wx} \gt 0) \end{cases}$$

Implement AND gate

In [10]:
def AND(x0, x1, w0=0.5, w1=0.5, b=0.6):
    return ((x0 * w0 + x1 * w1) > b) * 1.0
In [11]:
for x0, x1 in [(0, 0), (0, 1), (1, 0), (1, 1)]:
    print(f"AND(x0={x0}, x1={x1}) = {AND(x0, x1)}")
AND(x0=0, x1=0) = 0.0
AND(x0=0, x1=1) = 0.0
AND(x0=1, x1=0) = 0.0
AND(x0=1, x1=1) = 1.0

Implement NAND gate

In [24]:
def NAND(x0, x1, w0=-0.5, w1=-0.5, b=-0.6):
    return ((x0 * w0 + x1 * w1) > b) * 1.0
In [25]:
for x0, x1 in [(0, 0), (0, 1), (1, 0), (1, 1)]:
    print(f"NAND(x0={x0}, x1={x1}) = {NAND(x0, x1)}")
NAND(x0=0, x1=0) = 1.0
NAND(x0=0, x1=1) = 1.0
NAND(x0=1, x1=0) = 1.0
NAND(x0=1, x1=1) = 0.0

Implement OR gate

In [34]:
def OR(x0, x1, w0=0.5, w1=0.5, b=0.2):
    return ((x0 * w0 + x1 * w1) > b) * 1.0
In [35]:
for x0, x1 in [(0, 0), (0, 1), (1, 0), (1, 1)]:
    print(f"OR(x0={x0}, x1={x1}) = {OR(x0, x1)}")
OR(x0=0, x1=0) = 0.0
OR(x0=0, x1=1) = 1.0
OR(x0=1, x1=0) = 1.0
OR(x0=1, x1=1) = 1.0

Implement XOR gate

In [36]:
def XOR(x0, x1):
    n0 = NAND(x0, x1)
    n1 = OR(x0, x1)
    return AND(n0, n1)
In [37]:
for x0, x1 in [(0, 0), (0, 1), (1, 0), (1, 1)]:
    print(f"XOR(x0={x0}, x1={x1}) = {XOR(x0, x1)}")
    
XOR(x0=0, x1=0) = 0.0
XOR(x0=0, x1=1) = 1.0
XOR(x0=1, x1=0) = 1.0
XOR(x0=1, x1=1) = 0.0

Parallel Plot for Cateogrical and Continuous variable by Plotly Express

Goal

This post aims to introduce how to draw Parallel Plot for categorical and continuous variable by Plotly Express image

Reference

Libraries

In [19]:
import pandas as pd
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "png"

Create continuous data

In [4]:
df = px.data.election()
df.head()
Out[4]:
district Coderre Bergeron Joly total winner result
0 101-Bois-de-Liesse 2481 1829 3024 7334 Joly plurality
1 102-Cap-Saint-Jacques 2525 1163 2675 6363 Joly plurality
2 11-Sault-au-Récollet 3348 2770 2532 8650 Coderre plurality
3 111-Mile-End 1734 4782 2514 9030 Bergeron majority
4 112-DeLorimier 1770 5933 3044 10747 Bergeron majority

Visualize Parallel Plot for continuous data

In [20]:
fig = px.parallel_coordinates(df,color='total', color_continuous_scale=px.colors.sequential.Inferno)
fig

Create categorical data

In [9]:
df = px.data.election()
df.head()
Out[9]:
district Coderre Bergeron Joly total winner result
0 101-Bois-de-Liesse 2481 1829 3024 7334 Joly plurality
1 102-Cap-Saint-Jacques 2525 1163 2675 6363 Joly plurality
2 11-Sault-au-Récollet 3348 2770 2532 8650 Coderre plurality
3 111-Mile-End 1734 4782 2514 9030 Bergeron majority
4 112-DeLorimier 1770 5933 3044 10747 Bergeron majority

Visualize Parallel for continuous data

In [21]:
fig = px.parallel_categories(df, color="total", color_continuous_scale=px.colors.sequential.Inferno)
fig

Split Up: dtreeviz (Part 5)

Goal

This post aims to break down the module dtreeviz module step by step to fully understand what is implemented. After fully understanding this, I would like to contribute to this module and submit a pull request.

I really like this module and would like to see this works for other tree-based modules like XGBoost or Lightgbm. I found the exact same issue (issues 15) in github so I hope I could contribute to this issue.

This post is the 5th part:

  • ctreeviz_univar

Reference

trees.ctreeviz_univar

  • L267: the beginning of the definition for ctreeviz_univar
  • L272-275: treatment for pandas input
  • L277: load color property
  • L280-288: load decision tree classifier object as shadow_tree and other relevant attributes e.g., # of class, target values.
  • L290-302: setting labels and spines visibility
  • L304-319: plotting stacked bar chart with histogram when gtype=='barstacked'
  • L320-330: plotting scatter plot with gitter
  • L332: setting tick parameters
  • L352-353: setting legend
  • L355-358: setting a title
  • L360-362: setting splits vertical line between categories
In [53]:
from pathlib import Path
from graphviz.backend import run, view
import matplotlib.pyplot as plt
from dtreeviz.shadow import *
from numbers import Number
import matplotlib.patches as patches
import tempfile
import os
from sys import platform as PLATFORM
from colour import Color, rgb2hex
from typing import Mapping, List
from dtreeviz.utils import inline_svg_images, myround
from dtreeviz.shadow import ShadowDecTree, ShadowDecTreeNode
from dtreeviz.colors import adjust_colors
from sklearn import tree
import graphviz

from dtreeviz.trees import *

# How many bins should we have based upon number of classes
NUM_BINS = [0, 0, 10, 9, 8, 6, 6, 6, 5, 5, 5]
          # 0, 1, 2,  3, 4, 5, 6, 7, 8, 9, 10

def ctreeviz_univar(ax, x_train, y_train, max_depth, feature_name, class_names,
                    target_name,
                    fontsize=14, fontname="Arial", nbins=25, gtype='strip',
                    show={'title','legend','splits'},
                    colors=None):
    if isinstance(x_train, pd.Series):
        x_train = x_train.values
    if isinstance(y_train, pd.Series):
        y_train = y_train.values

    colors = adjust_colors(colors)

    #    ax.set_facecolor('#F9F9F9')
    ct = tree.DecisionTreeClassifier(max_depth=max_depth)
    ct.fit(x_train.reshape(-1, 1), y_train)

    shadow_tree = ShadowDecTree(ct, x_train.reshape(-1, 1), y_train,
                                feature_names=[feature_name], class_names=class_names)

    n_classes = shadow_tree.nclasses()
    overall_feature_range = (np.min(x_train), np.max(x_train))
    class_values = shadow_tree.unique_target_values
    color_values = colors['classes'][n_classes]
    color_map = {v: color_values[i] for i, v in enumerate(class_values)}
    X_colors = [color_map[cl] for cl in class_values]

    ax.set_xlabel(f"{feature_name}", fontsize=fontsize, fontname=fontname,
                  color=colors['axis_label'])
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.yaxis.set_visible(False)
    ax.spines['left'].set_visible(False)
    ax.spines['bottom'].set_linewidth(.3)

    X_hist = [x_train[y_train == cl] for cl in class_values]

    if gtype == 'barstacked':
        bins = np.linspace(start=overall_feature_range[0], stop=overall_feature_range[1], num=nbins, endpoint=True)
        hist, bins, barcontainers = ax.hist(X_hist,
                                            color=X_colors,
                                            align='mid',
                                            histtype='barstacked',
                                            bins=bins,
                                            label=class_names)

        for patch in barcontainers:
            for rect in patch.patches:
                rect.set_linewidth(.5)
                rect.set_edgecolor(colors['edge'])
        ax.set_xlim(*overall_feature_range)
        ax.set_xticks(overall_feature_range)
        ax.set_yticks([0, max([max(h) for h in hist])])
    elif gtype == 'strip':
        # user should pass in short and wide fig
        sigma = .013
        mu = .08
        class_step = .08
        dot_w = 20
        ax.set_ylim(0, mu + n_classes*class_step)
        print('X_hist', X_hist)
        for i, bucket in enumerate(X_hist):
            y_noise = np.random.normal(mu+i*class_step, sigma, size=len(bucket))
            ax.scatter(bucket, y_noise, alpha=.7, marker='o', s=dot_w, c=color_map[i],
                       edgecolors=colors['scatter_edge'], lw=.3)

    ax.tick_params(axis='both', which='major', width=.3, labelcolor=colors['tick_label'],
                   labelsize=fontsize)

    splits = [node.split() for node in shadow_tree.internal]
    splits = sorted(splits)
    bins = [ax.get_xlim()[0]] + splits + [ax.get_xlim()[1]]

    pred_box_height = .07 * ax.get_ylim()[1]
    preds = []
    for i in range(len(bins) - 1):
        left = bins[i]
        right = bins[i + 1]
        inrange = y_train[(x_train >= left) & (x_train <= right)]
        values, counts = np.unique(inrange, return_counts=True)
        pred = values[np.argmax(counts)]
        rect = patches.Rectangle((left, 0), (right - left), pred_box_height, linewidth=.3,
                                 edgecolor=colors['edge'], facecolor=color_map[pred])
        ax.add_patch(rect)
        preds.append(pred)

    if 'legend' in show:
        add_classifier_legend(ax, class_names, class_values, color_map, target_name, colors)

    if 'title' in show:
        accur = ct.score(x_train.reshape(-1, 1), y_train)
        title = f"Classifier tree depth {max_depth}, training accuracy={accur*100:.2f}%"
        plt.title(title, fontsize=fontsize, color=colors['title'])

    if 'splits' in show:
        for split in splits:
            plt.plot([split, split], [*ax.get_ylim()], '--', color=colors['split_line'], linewidth=1)

Create a toy classification example

In [48]:
import numpy as np
import graphviz 
from sklearn import tree

X = np.array([0, 1, 0.5, 10, 11, 12, 20, 21, 22, 30, 30, 32]).reshape(-1, 1)
Y = np.array(['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd']).reshape(-1, 1)
clf = tree.DecisionTreeClassifier(max_depth=3)
clf = clf.fit(X, Y)

df = pd.DataFrame(data={'X':X.ravel(), 'Y': Y.ravel()}, index=range(len(X)))
df.plot(kind='bar');
plt.title('Sample Data for Univariate Regression');

Visualize classification tree for univariate case

In [54]:
fig, ax = plt.subplots(1)
ctreeviz_univar(ax, pd.Series(X.ravel()), pd.Series(Y.ravel()), 
                feature_name='X', 
                target_name='Y',
                max_depth=4, 
                class_names=['a', 'b', 'c', 'd'], 
                gtype = 'barstacked',
                show={'title', 'splits'}
               )

Note When I apply show={'legend'}, I obtained the error below and still not figured out yet what was wrong.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-42-c31e8b14db34> in <module>
      4                 target_name='Y',
      5                 max_depth=4,
----> 6                 class_names=['a', 'b', 'c', 'd']
      7                )

<ipython-input-41-b466a69d927c> in ctreeviz_univar(ax, x_train, y_train, max_depth, feature_name, class_names, target_name, fontsize, fontname, nbins, gtype, show, colors)
     85         for i, bucket in enumerate(X_hist):
     86             y_noise = np.random.normal(mu+i*class_step, sigma, size=len(bucket))
---> 87             ax.scatter(bucket, y_noise, alpha=.7, marker='o', s=dot_w, c=color_map[i],
     88                        edgecolors=colors['scatter_edge'], lw=.3)
     89 

KeyError: 0

Split Up: dtreeviz (Part 4)

Goal

This post aims to break down the module dtreeviz module step by step to fully understand what is implemented. After fully understanding this, I would like to contribute to this module and submit a pull request.

I really like this module and would like to see this works for other tree-based modules like XGBoost or Lightgbm. I found the exact same issue (issues 15) in github so I hope I could contribute to this issue.

This post is the 4th part: breaking down DTreeViz class and rtreeviz_univar method.

Reference

DTreeViz class

  • L23: the beginning of DTreeViz class
  • L24-25: __init__ method taking dot object as an input
  • L26-78: deal with save, view the visualization as svg file

rtreeviz_univar

  • L81: the beginning of rtreeviz_univar method
  • L94-102: initial setting for the range of X, y data and converting them into numpy array.
  • L104-105: create a scikit-learn decision tree
  • L121-122: plot the original X and y data points
  • L125-126: plot the vertical line for decision boundary (gray line)
  • L128-134: plot the horizontal line for mean line (red line by default)
  • L136: Change the appearance of ticks
  • L138-140: setting title
  • L142-143: setting x and y label based on feature_name and target_name
In [4]:
from pathlib import Path
from graphviz.backend import run, view
import matplotlib.pyplot as plt
from dtreeviz.shadow import *
from numbers import Number
import matplotlib.patches as patches
import tempfile
import os
from sys import platform as PLATFORM
from colour import Color, rgb2hex
from typing import Mapping, List
from dtreeviz.utils import inline_svg_images, myround
from dtreeviz.shadow import ShadowDecTree, ShadowDecTreeNode
from dtreeviz.colors import adjust_colors
from sklearn import tree
import graphviz

# How many bins should we have based upon number of classes
NUM_BINS = [0, 0, 10, 9, 8, 6, 6, 6, 5, 5, 5]
          # 0, 1, 2,  3, 4, 5, 6, 7, 8, 9, 10

def rtreeviz_univar(ax,
                    x_train: (pd.Series, np.ndarray),  # 1 vector of X data
                    y_train: (pd.Series, np.ndarray),
                    max_depth = 10,
                    feature_name: str = None,
                    target_name: str = None,
                    min_samples_leaf = 1,
                    fontsize: int = 14,
                    show={'title','splits'},
                    split_linewidth=.5,
                    mean_linewidth = 2,
                    markersize=None,
                    colors=None):
    if isinstance(x_train, pd.Series):
        x_train = x_train.values
    if isinstance(y_train, pd.Series):
        y_train = y_train.values

    colors = adjust_colors(colors)

    y_range = (min(y_train), max(y_train))  # same y axis for all
    overall_feature_range = (np.min(x_train), np.max(x_train))

    t = tree.DecisionTreeRegressor(max_depth=max_depth, min_samples_leaf=min_samples_leaf)
    t.fit(x_train.reshape(-1,1), y_train)

    shadow_tree = ShadowDecTree(t, x_train.reshape(-1,1), y_train, feature_names=[feature_name])
    splits = []
    for node in shadow_tree.internal:
        splits.append(node.split())
    splits = sorted(splits)
    bins = [overall_feature_range[0]] + splits + [overall_feature_range[1]]

    means = []
    for i in range(len(bins) - 1):
        left = bins[i]
        right = bins[i + 1]
        inrange = y_train[(x_train >= left) & (x_train <= right)]
        means.append(np.mean(inrange))

    ax.scatter(x_train, y_train, marker='o', alpha=.4, c=colors['scatter_marker'], s=markersize,
               edgecolor=colors['scatter_edge'], lw=.3)

    if 'splits' in show:
        for split in splits:
            ax.plot([split, split], [*y_range], '--', color=colors['split_line'], linewidth=split_linewidth)

        prevX = overall_feature_range[0]
        for i, m in enumerate(means):
            split = overall_feature_range[1]
            if i < len(splits):
                split = splits[i]
            ax.plot([prevX, split], [m, m], '-', color=colors['mean_line'], linewidth=mean_linewidth)
            prevX = split

    ax.tick_params(axis='both', which='major', width=.3, labelcolor=colors['tick_label'], labelsize=fontsize)

    if 'title' in show:
        title = f"Regression tree depth {max_depth}, samples per leaf {min_samples_leaf},\nTraining $R^2$={t.score(x_train.reshape(-1,1),y_train):.3f}"
        plt.title(title, fontsize=fontsize, color=colors['title'])

    plt.xlabel(feature_name, fontsize=fontsize, color=colors['axis_label'])
    plt.ylabel(target_name, fontsize=fontsize, color=colors['axis_label'])

Create a toy sample

In [42]:
import numpy as np
import graphviz 
from sklearn import tree

X = np.array([0, 1, 0.5, 10, 11, 12, 20, 21, 22, 30, 30, 32]).reshape(-1, 1)
Y = np.array([0., 0, 0, 50, 49, 50, 20, 21, 19, 90, 89, 91]).reshape(-1, 1)
clf = tree.DecisionTreeRegressor(max_depth=3)
clf = clf.fit(X, Y)

plt.scatter(x=X, y=Y, s=5);
plt.title('Sample Data for Univariate Regression');

Visualize a tree using rtreeviz_univar

In [51]:
fig, ax = plt.subplots(1)
rtreeviz_univar(ax, pd.Series(X.ravel()), pd.Series(Y.ravel()), 
                feature_name='X', 
                target_name='Y',
                markersize=15)

Split Up: dtreeviz (Part 3)

Goal

This post aims to break down the module dtreeviz module step by step to fully understand what is implemented. After fully understanding this, I would like to contribute to this module and submit a pull request.

I really like this module and would like to see this works for other tree-based modules like XGBoost or Lightgbm. I found the exact same issue (issues 15) in github so I hope I could contribute to this issue.

This post is the 3rd part: breaking down ShadowDecTree.

Reference

ShadowDecTreeNode class

Source github

In [2]:
import numpy as np
import pandas as pd
from collections import defaultdict, Sequence
from typing import Mapping, List, Tuple
from numbers import Number
from sklearn.utils import compute_class_weight

from dtreeviz.shadow import ShadowDecTree 
# skip ShadowDecTree Class
#

class ShadowDecTreeNode:
    """
    A node in a shadow tree.  Each node has left and right
    pointers to child nodes, if any.  As part of tree construction process, the
    samples examined at each decision node or at each leaf node are
    saved into field node_samples.
    """
    def __init__(self, shadow_tree, id, left=None, right=None):
        self.shadow_tree = shadow_tree
        self.id = id
        self.left = left
        self.right = right

    def split(self) -> (int,float):
        return self.shadow_tree.tree_model.tree_.threshold[self.id]

    def feature(self) -> int:
        return self.shadow_tree.tree_model.tree_.feature[self.id]

    def feature_name(self) -> (str,None):
        if self.shadow_tree.feature_names is not None:
            return self.shadow_tree.feature_names[ self.feature()]
        return None

    def samples(self) -> List[int]:
        """
        Return a list of sample indexes associated with this node. If this is a
        leaf node, it indicates the samples used to compute the predicted value
        or class.  If this is an internal node, it is the number of samples used
        to compute the split point.
        """
        return self.shadow_tree.node_to_samples[self.id]

    def nsamples(self) -> int:
        """
        Return the number of samples associated with this node. If this is a
        leaf node, it indicates the samples used to compute the predicted value
        or class. If this is an internal node, it is the number of samples used
        to compute the split point.
        """
        return self.shadow_tree.tree_model.tree_.n_node_samples[self.id] # same as len(self.node_samples)

    def split_samples(self) -> Tuple[np.ndarray, np.ndarray]:
        """
        Return the list of indexes to the left and the right of the split value.
        """
        samples = np.array(self.samples())
        node_X_data = self.shadow_tree.X_train[samples, self.feature()]
        split = self.split()
        left = np.nonzero(node_X_data < split)[0]
        right = np.nonzero(node_X_data >= split)[0]
        return left, right

    def isleaf(self) -> bool:
        return self.left is None and self.right is None

    def isclassifier(self):
        return self.shadow_tree.tree_model.tree_.n_classes > 1

    def prediction(self) -> (Number,None):
        """
        If this is a leaf node, return the predicted continuous value, if this is a
        regressor, or the class number, if this is a classifier.
        """
        if not self.isleaf(): return None
        if self.isclassifier():
            counts = np.array(self.shadow_tree.tree_model.tree_.value[self.id][0])
            predicted_class = np.argmax(counts)
            return predicted_class
        else:
            return self.shadow_tree.tree_model.tree_.value[self.id][0][0]

    def prediction_name(self) -> (str,None):
        """
        If the tree model is a classifier and we know the class names,
        return the class name associated with the prediction for this leaf node.
        Return prediction class or value otherwise.
        """
        if self.isclassifier():
            if self.shadow_tree.class_names is not None:
                return self.shadow_tree.class_names[self.prediction()]
        return self.prediction()

    def class_counts(self) -> (List[int],None):
        """
        If this tree model is a classifier, return a list with the count
        associated with each class.
        """
        if self.isclassifier():
            if self.shadow_tree.class_weight is None:
                return np.array(np.round(self.shadow_tree.tree_model.tree_.value[self.id][0]), dtype=int)
            else:
                return np.round(self.shadow_tree.tree_model.tree_.value[self.id][0]/self.shadow_tree.class_weights).astype(int)
        return None

    def __str__(self):
        if self.left is None and self.right is None:
            return "<pred={value},n={n}>".format(value=round(self.prediction(),1), n=self.nsamples())
        else:
            return "({f}@{s} {left} {right})".format(f=self.feature_name(),
                                                     s=round(self.split(),1),
                                                     left=self.left if self.left is not None else '',
                                                     right=self.right if self.right is not None else '')

Instantiate class objects

Create a tree model by scikit learn

In [3]:
import numpy as np
import graphviz 
from sklearn import tree

X = np.array([[0, 0], [1, 1]])
Y = np.array([0, 1])
# Y = [0, 1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
dot_data = tree.export_graphviz(clf, out_file=None, 
                     feature_names=[0, 1],  
                     class_names=['0', '1'],  
                     filled=True, rounded=True,  
                     special_characters=True)  
graph = graphviz.Source(dot_data)  
graph 
Out[3]:
Tree 0 1 ≤ 0.5gini = 0.5samples = 2value = [1, 1]class = 01 gini = 0.0samples = 1value = [1, 0]class = 00->1 True2 gini = 0.0samples = 1value = [0, 1]class = 10->2 False

Create a ShadowDecTreeNode

ShadowDecTreeNode __init__

  • L222-226: store input arguments as class members
  • L228-308: define the same functions in tree objects like split, feature etc. or utility functions
In [4]:
# instantiate ShadowDecTree
shadow_tree = ShadowDecTree(tree_model=clf, X_train=X, y_train=Y, feature_names=[0, 1], class_names=[0, 1])
In [5]:
# instantiate ShadowDecTreeNode
shadow_tree_node0 = ShadowDecTreeNode(shadow_tree=shadow_tree, id=0)
shadow_tree_node0
Out[5]:
<__main__.ShadowDecTreeNode at 0x120eda908>

Methods under ``ShadowTreeDecNode

In [6]:
# L228 split
shadow_tree_node0.split()
Out[6]:
0.5
In [7]:
# L231 feature
shadow_tree_node0.feature()
Out[7]:
1
In [8]:
# L239 samples
shadow_tree_node0.samples()
Out[8]:
[0, 1]
In [9]:
# L248 nsamples
shadow_tree_node0.nsamples()
Out[9]:
2
In [10]:
# L257 split_samples
shadow_tree_node0.split_samples()
Out[10]:
(array([0]), array([1]))
In [11]:
# L268 isleaf
shadow_tree_node0.isleaf()
Out[11]:
True
In [12]:
# L271 isclassifier
shadow_tree_node0.isclassifier()
Out[12]:
array([ True])
In [13]:
# L287 prediction_name
shadow_tree_node0.prediction_name()
Out[13]:
0
In [14]:
# L298 class_counts
shadow_tree_node0.class_counts()
Out[14]:
array([1, 1])

Visualization Samples by Plotly Express

Goal

This post aims to introduce examples of visualization by Plotly Express.

The followings are introduced:

  • Prepared example data
  • Scatter plot
    • basic
    • basic + size
    • basic + size + color
    • basic + size + color + time
    • heatmap + histogram

2019-07-19 21-54-58 2019-07-19 21_56_00_gapminder_plotly_express

Reference

Libraries

In [8]:
import pandas as pd
import numpy as np
import plotly_express as px
import plotly.io as pio
pio.renderers.default = "png"

Load a prepared data

Car Share

In [2]:
df = px.data.carshare()
df.head()
Out[2]:
centroid_lat centroid_lon car_hours peak_hour
0 45.471549 -73.588684 1772.750000 2
1 45.543865 -73.562456 986.333333 23
2 45.487640 -73.642767 354.750000 20
3 45.522870 -73.595677 560.166667 23
4 45.453971 -73.738946 2836.666667 19

Tips

In [3]:
df = px.data.tips()
df.head()
Out[3]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

Election

In [4]:
df = px.data.election()
df.head()
Out[4]:
district Coderre Bergeron Joly total winner result
0 101-Bois-de-Liesse 2481 1829 3024 7334 Joly plurality
1 102-Cap-Saint-Jacques 2525 1163 2675 6363 Joly plurality
2 11-Sault-au-Récollet 3348 2770 2532 8650 Coderre plurality
3 111-Mile-End 1734 4782 2514 9030 Bergeron majority
4 112-DeLorimier 1770 5933 3044 10747 Bergeron majority

Wind

In [5]:
df = px.data.wind()
df.head()
Out[5]:
direction strength frequency
0 N 0-1 0.5
1 NNE 0-1 0.6
2 NE 0-1 0.5
3 ENE 0-1 0.4
4 E 0-1 0.4

Gap Minder

In [6]:
df = px.data.gapminder()
df.head()
Out[6]:
country continent year lifeExp pop gdpPercap iso_alpha iso_num
0 Afghanistan Asia 1952 28.801 8425333 779.445314 AFG 4
1 Afghanistan Asia 1957 30.332 9240934 820.853030 AFG 4
2 Afghanistan Asia 1962 31.997 10267083 853.100710 AFG 4
3 Afghanistan Asia 1967 34.020 11537966 836.197138 AFG 4
4 Afghanistan Asia 1972 36.088 13079460 739.981106 AFG 4

Scatter Plot

Basic Scatter plot

In [10]:
px.scatter(df, x='gdpPercap', y='lifeExp', width=900, height=400)

Scatter plot + size

In [11]:
px.scatter(df, x='gdpPercap', y='lifeExp', size='pop', width=900, height=400)

Scatter plot + size + color

In [12]:
px.scatter(df, x='gdpPercap', y='lifeExp', size='pop', color='country', width=900, height=400)

Scatter plot + size + color + time

In [13]:
px.scatter(df, x='gdpPercap', y='lifeExp', size='pop', color='country', animation_frame='year', width=900, height=400)
In [14]:
px.density_heatmap(df, x="gdpPercap", y="lifeExp", marginal_y="histogram", marginal_x="histogram")