t-Test Basic

T-test Basic

This post covers one of the simplest hypothesis test, so-called T-test. Let's say you subscribe coffee beans weekly basis and measure the weight of each coffee beans bags and obtained the following observation.

Now the package says this coffee beans is 16 oz i.e., 453.592 g. Is this actually true from statistical point of view?

In [8]:
import pandas as pd
import numpy as np
import scipy as sp
from scipy import stats
from icecream import ic
In [34]:
beans = pd.Series([448.2, 451.2, 448.0, 452.5, 447.8, 446.7])

Manual Calculation for P-value and t-value

In [43]:
# mu_0 for Null Hypothesis
mu0 = 453.592

# Sample size
n = len(beans)
ic(n)

# Sample Mean
mu = beans.mean()
ic(mu)
# Degree of Freedam
dof = n - 1
ic(dof)

# Sample Stnadard Deviation
sigma = beans.std(ddof=1)
ic(sigma)

# Standard Error
se = sigma / np.sqrt(n)
ic(se)

# t-value
t = (mu - mu0) / se
ic(t)

# p-value - 2 * (1 - cdf(|ts|))
p = (1 - stats.t.cdf(abs(t), df=dof)) * 2
ic(p);
ic| n: 6
ic| mu: 449.06666666666666
ic| dof: 5
ic| sigma: 2.255363976538303
ic| se: 0.9207484877955424
ic| t: -4.9148419935696905
ic| p: 0.004417201842481733

Using scipy module

In [37]:
# Use scipy stats 
stats.ttest_1samp(beans, mu0)
Out[37]:
Ttest_1sampResult(statistic=-4.914841993569691, pvalue=0.004417201842481753)

If p-value is typically lower than 0.05 (alpha), we can rejust null-hypothesis (its population mean is mu0) and conclude this sample mean is significantly different from mu0.

Python Basic: List and Index

Goal

This post aims to introduce the operation around list and index often used for data structure and algorithm. This post covers the following operations:

  1. Create a list
  2. Access to an element
  3. Remove an element
  4. Divide a list
  5. Scan a list
In [44]:
from icecream import ic

Create a list

In [45]:
# Empty List
a0 = list()

# Create a list by elemnts
a1 = [0, 1, 2, 3, 4, 5]

ic(a0, a1);
ic| a0: [], a1: [0, 1, 2, 3, 4, 5]

When to create a list from p to q as an incremental list of int, we can use range. Note that p is included but q is not included. If we don't specify p, it becomes 0 by default.

In [46]:
p = 0
q = 10
ic(list(range(p, q)), list(range(10)));
ic| list(range(p, q)): [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    list(range(10)): [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Access to an element by index

Index is starting from 0. If you would like to access to N-th element, you need to specify by (N-1)

In [52]:
a2 = list(range(10))

# 1st element
ic(a2[0])

# 5th element
ic(a2[4]);

# The last element
ic(a2[-1]);

# The 2nd last element
ic(a2[-2]);
ic| a2[0]: 0
ic| a2[4]: 4
ic| a2[-1]: 9
ic| a2[-2]: 8

Slice a list by a range of index

When we want to extract a set of elements from pth element to qth element, we can use :. If we want to specify the range counting from the last element, we could use negative index like -p.

In [48]:
p = 2 # From 3rd element (included) 
q = 5 # To 6th element (excluded)
a2 = list(range(10))
ic(p, q, a2, a2[p:q]);
ic| p: 2, q: 5, a2: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], a2[p:q]: [2, 3, 4]
In [49]:
ic(q, a2[0:q], a2[:q]);
ic| q: 5, a2[0:q]: [0, 1, 2, 3, 4], a2[:q]: [0, 1, 2, 3, 4]
In [50]:
ic(q, a2[-3:], a2[-3:-1]);
ic| q: 5, a2[-3:]: [7, 8, 9], a2[-3:-1]: [7, 8]

Remove an element

Remove an element by pop

In [54]:
a2 = list(range(10))

# Get the 1st element and remove it from a list
e0 = a2.pop(0)
ic(e0, a2);

# Get the last element and remove it from a list
e1 = a2.pop(-1)
ic(e1, a2);
ic| e0: 0, a2: [1, 2, 3, 4, 5, 6, 7, 8, 9]
ic| e1: 9, a2: [1, 2, 3, 4, 5, 6, 7, 8]

Divide a list into before and after by a pivot element

When we want to recursively apply a process to before- and after- sublist divided by a pivot element, we need an operation like below:

In [56]:
a2 = list(range(10))

i = 3 # index for the pivot value

ic(a2[:i], a2[i], a2[i+1:]);
ic| a2[:i]: [0, 1, 2], a2[i]: 3, a2[i+1:]: [4, 5, 6, 7, 8, 9]

Scan a list by for loop

  1. Use enumerate to get each index and element itself
  2. Use range to generate an index and access each element by index
In [61]:
a3 = [3,4,1,7,0]

for i, e in enumerate(a3):
    ic(i, e)
ic| i: 0, e: 3
ic| i: 1, e: 4
ic| i: 2, e: 1
ic| i: 3, e: 7
ic| i: 4, e: 0
In [62]:
a3 = [3,4,1,7,0]

for i in range(len(a3)):
    ic(i, a3[i])
ic| i: 0, a3[i]: 3
ic| i: 1, a3[i]: 4
ic| i: 2, a3[i]: 1
ic| i: 3, a3[i]: 7
ic| i: 4, a3[i]: 0

Introduction to Bayesian Optimization

Goal

This notebook aims to introduce how Bayesian Optimization works using bayesian-optimization module.

Bayesian Optimization is the way of estimating the unknown function where we can choose the arbitrary input $x$ and obtain the response from that function. The outcome of Bayesian Optimization is to obtain the mean and confidence interval of the function we look for by step. You could also stop earlier or decide go further iteratively.

This will cover the very first toy example of Bayesian Optimization by defining "black-box" function and show how interactively or step-by-step Bayesian Optimization will figure and estimate this "black-box" function.

Reference

Libraries

In [41]:
from bayes_opt import BayesianOptimization
from bayes_opt import UtilityFunction
import numpy as np
import warnings
from IPython.display import display, clear_output
import matplotlib.pyplot as plt
from matplotlib import gridspec
%matplotlib inline

Unknown Function

We can have any function to estimate here. As an example, we will have 1-D function defined by the following equation:

$$f(x) = 3e^{-(x-3)^{2}} - e^{-(x-2)^2} + 2 e^{-(x+3)^2}$$
In [2]:
def unknown_func(x):
    return 3 * np.exp(-(x-3) **2) - np.exp(-(x-2) **2) + 2 * np.exp(-(x + 3) **2)

If we visualize the unknown function (as a reference), we can plot like below. Note that we are not supposed to know this plot since this function is "black-box"

In [4]:
x = np.linspace(-6, 6, 10000).reshape(-1, 1)
y = unknown_func(x)

plt.plot(x, y);
plt.title('1-D Unknown Function to be estimated');
plt.xlabel('X');
plt.ylabel('Response from the function');

Bayesian Optimization

First of all, we need to create BayesianOptimization object by passing the function f you want to estimate with its input boundary as pbounds.

In [5]:
optimizer = BayesianOptimization(f=unknown_func, pbounds={'x': (-6, 6)}, verbose=0)
optimizer
Out[5]:
<bayes_opt.bayesian_optimization.BayesianOptimization at 0x11ab55dd8>

Then, we can start to explore this function by trying different inputs.

  • init_points is the number of initial points to start with.
  • n_iter is the number of iteration. This optimizer.maximize hold the state so whenever you execute it, it will continue from the last iteration.

Helper functions

In [26]:
def posterior(optimizer, x_obs, y_obs, grid):
    optimizer._gp.fit(x_obs, y_obs)

    mu, sigma = optimizer._gp.predict(grid, return_std=True)
    return mu, sigma

def plot_gp(optimizer, x, y, fig=None, xlim=None):
    if fig is None:
        fig = plt.figure(figsize=(16, 10))
    steps = len(optimizer.space)
    fig.suptitle(
        'Gaussian Process and Utility Function After {} Steps'.format(steps),
        fontdict={'size':30}
    )
    
    gs = gridspec.GridSpec(2, 1, height_ratios=[3, 1]) 
    axis = plt.subplot(gs[0])
    acq = plt.subplot(gs[1])
    
    x_obs = np.array([[res["params"]["x"]] for res in optimizer.res])
    y_obs = np.array([res["target"] for res in optimizer.res])
    
    mu, sigma = posterior(optimizer, x_obs, y_obs, x)
    axis.plot(x, y, linewidth=3, label='Target')
    axis.plot(x_obs.flatten(), y_obs, 'D', markersize=8, label=u'Observations', color='r')
    axis.plot(x, mu, '--', color='k', label='Prediction')

    axis.fill(np.concatenate([x, x[::-1]]), 
              np.concatenate([mu - 1.9600 * sigma, (mu + 1.9600 * sigma)[::-1]]),
        alpha=.3, fc='C0', ec='None', label='95% confidence interval')
    if xlim is not None:
        axis.set_xlim(xlim)
    axis.set_ylim((None, None))
    axis.set_ylabel('f(x)', fontdict={'size':20})
    axis.set_xlabel('x', fontdict={'size':20})
    
    utility_function = UtilityFunction(kind="ucb", kappa=5, xi=0)
    utility = utility_function.utility(x, optimizer._gp, 0)
    acq.plot(x, utility, label='Utility Function', color='C3')
    acq.plot(x[np.argmax(utility)], np.max(utility), 'o', markersize=15, 
             label=u'Next Best Guess', markerfacecolor='gold', markeredgecolor='k', markeredgewidth=1)
    if xlim is not None:
        acq.set_xlim(xlim)
    acq.set_ylim((np.min(utility) , np.max(utility) + 0.5))
    acq.set_ylabel('Utility', fontdict={'size':20})
    acq.set_xlabel('x', fontdict={'size':20})
    
    axis.legend(loc=2, bbox_to_anchor=(1.01, 1), borderaxespad=0.)
    acq.legend(loc=2, bbox_to_anchor=(1.01, 1), borderaxespad=0.)
    return fig

Visualize the iterative step

In [54]:
# fig = plt.figure(figsize=(16, 10))
xlim = (-6, 6)
optimizer = BayesianOptimization(f=unknown_func, pbounds={'x': xlim}, verbose=0)

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    for i in range(15):
        break
#         optimizer.maximize(init_points=0, n_iter=1, kappa=5)
#         fig = plot_gp(optimizer, x, y, fig=fig, xlim=xlim)
#         display(plt.gcf())
#         clear_output(wait=True)

2019-10-31 16-11-38 2019-10-31 16_15_50_bayesian_optimization

cProfile Examples in Python

Goal

This post aims to introduce how to use cProfile to measure the running time for each statement and find the bottleneck of your program.

Libraries

In [1]:
import cProfile

Define your function

define your sub functions

In [37]:
def linear_func(n):
    for i in range(n):
        pass
    return

def quad_func(n):
    for i in range(n):
        for i in range(n):
            pass

    return

def exp_func(n):
    if n <= 1:
        return n
    return exp_func(n-1) + exp_func(n-2)
In [38]:
def main_func(n):
    linear_func(n)
    quad_func(n)
    exp_func(n)

Profile the main function

In [40]:
cProfile.run('main_func(n=20)')
         21897 function calls (7 primitive calls) in 0.006 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 <ipython-input-37-393400eb079b>:1(linear_func)
  21891/1    0.006    0.000    0.006    0.006 <ipython-input-37-393400eb079b>:13(exp_func)
        1    0.000    0.000    0.000    0.000 <ipython-input-37-393400eb079b>:6(quad_func)
        1    0.000    0.000    0.006    0.006 <ipython-input-38-1333493d3326>:1(main_func)
        1    0.000    0.000    0.006    0.006 <string>:1(<module>)
        1    0.000    0.000    0.006    0.006 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


How to read the result

  • ncalls is the number of calls.
  • tottime is a total of the time spent.
  • percall is the average time for each call, i.e., tottime divided by ncalls
  • cumtime is the cumulative time spent.
  • (2nd) percall is the quotient of cumtime divided by primitive calls
  • filename:lineno(function) indicates the information about the function with the format "{file name}:{line number}{function name}"

Time Series Forecasting for Daily Births Dataset by Prophet

Goal

This post aims to introduce how to forecast the time series data for daily births dataset using Prophet.

Prohpeht Forecasting Model

$$ y (y) = g(t) + s(t) + h(t) + \epsilon_t $$
  • $g(t)$: a growth term, which is a trend curve
  • $s(t)$: a seasonality term, which periodically changes
  • $h(t)$: a holiday term, which indicates irregular events (given by users)
  • $\epsilon_t$: an error term

Reference

Libraries

In [55]:
import pandas as pd
import numpy as np
import fbprophet
from fbprophet.plot import add_changepoints_to_plot
import warnings
import matplotlib.pyplot as plt
%matplotlib inline

Load a female daily births dataset

In [30]:
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-total-female-births.csv"
df = pd.read_csv(url, parse_dates=['Date'], date_parser=pd.to_datetime)
df.columns = ['ds', 'y']
df.head()
Out[30]:
ds y
0 1959-01-01 35
1 1959-01-02 32
2 1959-01-03 30
3 1959-01-04 31
4 1959-01-05 44

Visualize the raw data

In [40]:
plt.plot(df['ds'], df['y']);
plt.title('Daily Female Births in 1959');

Create a Prophet instance

In [76]:
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    m = fbprophet.Prophet(yearly_seasonality=True, daily_seasonality=False, 
                          changepoint_range=0.9, 
                          changepoint_prior_scale=0.5,
                          seasonality_mode='multiplicative')
    m.fit(df);
In [77]:
future = m.make_future_dataframe(periods=50, freq='d')
forecast = m.predict(future)

Visualize

By Component

In [78]:
m.plot_components(forecast);

Plot raw data with forecast

In [79]:
m.plot(forecast);

Change point detection

No change is detected.

In [80]:
fig = m.plot(forecast)
a = add_changepoints_to_plot(fig.gca(), m, forecast);

Gradient descent implementation from scratch

Goal

This post aims to introduce how to implement Gradient Descent from scratch.

image

Reference

Libraries

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Define the function to be explored

In [2]:
def surface_function(x):
    return np.sum(x**2)
In [63]:
delta = 0.025
x = np.arange(-3.0, 3.0, delta)
y = np.arange(-3.0, 3.0, delta)
X, Y = np.meshgrid(x, y)
Z = np.add(X**2, Y**2)
fig, ax = plt.subplots()
CS = ax.contour(X, Y, Z)
ax.clabel(CS, inline=1, fontsize=10);

Compute gradient given a function

In [36]:
def numerical_gradient(f, x0):
    h = 1e-5
    grad = np.zeros_like(x0)

    
    for i in range(x0.size):
        tmp = x0[i]
        
        # compute f(x0 + h)
        x0[i] = tmp + h
        fx0_p = f(x0)

        
        # compute f(x0 - h)
        x0[i] = tmp - h
        fx0_m = f(x0)
        
        grad[i] = (fx0_p - fx0_m) / (2*h)
        x0[i] = tmp
        
    return grad
        
        
In [37]:
numerical_gradient(surface_function, np.array([0., 1.]))
Out[37]:
array([0., 2.])

Define gradient descent function

In [84]:
class GradientDescent:
    def __init__(self):
        self.trace = []
    
    def gradient_descent(self, f, init_x, lr=0.01, steps=200):
        x = init_x
        for i in range(steps):
            self.trace.append(list(x))
            gradient = numerical_gradient(f, x)
            x -= lr * gradient
        

        return x
In [85]:
init = np.array([3., 3.])
gd = GradientDescent()
gd.gradient_descent(surface_function, init)
Out[85]:
array([0.05276384, 0.05276384])

Plot the trace on the surface function

In [89]:
df_trace = pd.DataFrame(gd.trace, columns=['x', 'y']).reset_index()
df_trace.head()
Out[89]:
index x y
0 0 3.000000 3.000000
1 1 2.940000 2.940000
2 2 2.881200 2.881200
3 3 2.823576 2.823576
4 4 2.767104 2.767104
In [106]:
fig, ax = plt.subplots(figsize=(8, 5))
CS = ax.contour(X, Y, Z)
ax.clabel(CS, inline=1, fontsize=10);
df_trace.iloc[::15, :].plot(kind='scatter', x='x', y='y', c='index', cmap='Reds', marker='x', s=200, 
              ax=ax, title='Trace of Gradient Descent');

Saving Machine Learning Models by joblib

Goal

This post aims to introduce how to save the machine learning model using joblib.

Libraries

In [4]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
import joblib

Create a data

In [5]:
boston = load_boston()
X, y = boston.data, boston.target

Train a model

In [6]:
reg = LinearRegression()
reg.fit(X, y)
Out[6]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

Save the model as pickle

In [7]:
joblib.dump(reg, 'linear_regression.pkl')
Out[7]:
['linear_regression.pkl']

check the saved model

In [11]:
!ls | grep linear*pkl
linear_regression.pkl

Load the saved model

In [13]:
loaded_reg = joblib.load('linear_regression.pkl')
loaded_reg
Out[13]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

Explain the prediction for ImageNet using SHAP

Goal

This post aims to introduce how to explain the prediction for ImageNet using SHAP.

image

Reference

Libraries

In [82]:
import keras
from keras.applications.vgg16 import VGG16, preprocess_input, decode_predictions
from keras.preprocessing import image
import requests
from skimage.segmentation import slic
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import shap
import warnings

%matplotlib inline

Configuration

In [77]:
# make a color map
from matplotlib.colors import LinearSegmentedColormap
colors = []
for l in np.linspace(1, 0, 100):
    colors.append((245 / 255, 39 / 255, 87 / 255, l))
for l in np.linspace(0, 1, 100):
    colors.append((24 / 255, 196 / 255, 93 / 255, l))
cm = LinearSegmentedColormap.from_list("shap", colors)

Load pre-trained VGG16 model

In [2]:
# load model data
r = requests.get('https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json')
feature_names = r.json()
model = VGG16()
WARNING:tensorflow:From /Users/hiro/anaconda3/envs/py367/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.1/vgg16_weights_tf_dim_ordering_tf_kernels.h5
553467904/553467096 [==============================] - 62s 0us/step

Load an image data

In [16]:
# load an image
file = "../images/apple-banana.jpg"
img = image.load_img(file, target_size=(224, 224))
img_orig = image.img_to_array(img)
plt.imshow(img);
plt.axis('off');

Segmentation

In [18]:
# Create segmentation to explain by segment, not every pixel
segments_slic = slic(img, n_segments=30, compactness=30, sigma=3)

plt.imshow(segments_slic);
plt.axis('off');
In [37]:
segments_slic
Out[37]:
array([[ 0,  0,  0, ...,  4,  4,  4],
       [ 0,  0,  0, ...,  4,  4,  4],
       [ 0,  0,  0, ...,  4,  4,  4],
       ...,
       [22, 22, 22, ..., 21, 21, 21],
       [22, 22, 22, ..., 21, 21, 21],
       [22, 22, 22, ..., 21, 21, 21]])

Utility Functions for masking and preprocessing

In [19]:
# define a function that depends on a binary mask representing if an image region is hidden
def mask_image(zs, segmentation, image, background=None):
    
    if background is None:
        background = image.mean((0, 1))
        
    # Create an empty 4D array
    out = np.zeros((zs.shape[0], 
                    image.shape[0], 
                    image.shape[1], 
                    image.shape[2]))
    
    for i in range(zs.shape[0]):
        out[i, :, :, :] = image
        for j in range(zs.shape[1]):
            if zs[i, j] == 0:
                out[i][segmentation == j, :] = background
    return out


def f(z):
    return model.predict(
        preprocess_input(mask_image(z, segments_slic, img_orig, 255)))

def fill_segmentation(values, segmentation):
    out = np.zeros(segmentation.shape)
    for i in range(len(values)):
        out[segmentation == i] = values[i]
    return out
In [39]:
masked_images = mask_image(np.zeros((1,50)), segments_slic, img_orig, 255)

plt.imshow(masked_images[0][:,:, 0]);
plt.axis('off');

Create an explainer and shap values

In [83]:
# use Kernel SHAP to explain the network's predictions
explainer = shap.KernelExplainer(f, np.zeros((1,50)))

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    shap_values = explainer.shap_values(np.ones((1,50)), nsamples=100) # runs VGG16 1000 times

Obtain the prediction with the highest probability

In [85]:
predictions = model.predict(preprocess_input(np.expand_dims(img_orig.copy(), axis=0)))
top_preds = np.argsort(-predictions)
In [86]:
pd.Series(data={feature_names[str(inds[i])][1]:predictions[0, inds[i]]  for i in range(10)}).plot(kind='bar', title='Top 10 Predictions');

Explain the prediction by visualization

The following image is explained well for banana.

In [87]:
# Visualize the explanations
fig, axes = plt.subplots(nrows=1, ncols=4, figsize=(12,4))
inds = top_preds[0]
axes[0].imshow(img)
axes[0].axis('off')

max_val = np.max([np.max(np.abs(shap_values[i][:,:-1])) for i in range(len(shap_values))])
for i in range(3):
    m = fill_segmentation(shap_values[inds[i]][0], segments_slic)
    axes[i+1].set_title(feature_names[str(inds[i])][1])
    axes[i+1].imshow(np.array(img.convert('LA'))[:, :, 0], alpha=0.15)
    im = axes[i+1].imshow(m, cmap=cm, vmin=-max_val, vmax=max_val)
    axes[i+1].axis('off')
cb = fig.colorbar(im, ax=axes.ravel().tolist(), label="SHAP value", orientation="horizontal", aspect=60)
cb.outline.set_visible(False)
plt.show()