PyTorch Basic Operations

Goal

This post aims to introduce basic PyTorch operations e.g., addition, multiplication,

Libraries

In [2]:
import numpy as np
import pandas as pd
import torch

Create a Tensor

In [5]:
t_x1 = torch.Tensor([[1, 2, 3],
                     [4, 5, 6],
                     [7, 8, 9]])

t_x2 = torch.Tensor([[9, 8, 7],
                     [6, 5, 4],
                     [3, 2, 1]])
print(t_x1)
print(t_x2)
tensor([[1., 2., 3.],
        [4., 5., 6.],
        [7., 8., 9.]])
tensor([[9., 8., 7.],
        [6., 5., 4.],
        [3., 2., 1.]])

Addition

+ operator

In [6]:
t_x1 + t_x2
Out[6]:
tensor([[10., 10., 10.],
        [10., 10., 10.],
        [10., 10., 10.]])

Neural Network for Classification

Goal

This post aims to introduce (shallow) neural network for classification using scikit-learn.

Reference

Libraries

In [2]:
import pandas as pd
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
%matplotlib inline

Load Breast Cancer dataset

In [5]:
breast_cancer = load_breast_cancer()
df_breast_cancer = pd.DataFrame(breast_cancer['data'], columns=breast_cancer['feature_names'])
df_breast_cancer['target'] = breast_cancer['target']

df_breast_cancer.head()
Out[5]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension target
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 0
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 0
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 0
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 0
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 0

5 rows × 31 columns

Create Neural Network

In [18]:
clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
                    hidden_layer_sizes=(10,3,3), random_state=1)
 
In [19]:
cv_score = cross_val_score(clf,
                           X=df_breast_cancer.iloc[:, :-1],
                           y=df_breast_cancer['target'],
                           cv=5)
plt.plot(cv_score);

Bayesian Regression using pymc3

Goal

This post aims to introduce how to use pymc3 for Bayesian regression by showing the simplest single variable example.

image

Reference

Libraries

In [63]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import pymc3 as pm
%matplotlib inline

Create a data for Bayesian regression

To compare non-Bayesian linear regression, the way to generate data follows the one used in this post Linear Regression

\begin{equation*} \mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{b} + \mathbf{e} \end{equation*}

Here $\mathbf{x}$ is a 1 dimension vector, $\mathbf{b}$ is a constant variable, $\mathbf{e}$ is white noise.

In [58]:
a = 10
b = 4
n = 100
sigma = 3
e = sigma * np.random.randn(n) 
x = np.linspace(-1, 1, num=n)
y = a * x + b + e

plt.plot(x, y, '.', label='observed y');
plt.plot(x, a * x + b, 'r', label='true y');
plt.legend();

Modeling using pymc

In Bayesian world, the above formula is reformulated as below:

\begin{equation*} \mathbf{y} \sim \mathcal{N} (\mathbf{A}\mathbf{x} + \mathbf{b}, \sigma^2) \end{equation*}

In this case, we regard $\mathbf{y}$ as a random variable following the Normal distribution defined by the mean $\mathbf{A}\mathbf{x} + \mathbf{b}$ and the variance $\sigma^2$.

In [59]:
model = pm.Model()
with model:
    a_0 = pm.Normal('a_0', mu=1, sigma=10)
    b_0 = pm.Normal('b_0', mu=1, sigma=10)
    x_0 = pm.Normal('x_0', mu=0, sigma=1, observed=x)
    mu_0 = a_0 * x_0 + b_0 
    sigma_0 = pm.HalfCauchy('sigma_0', beta=10)

    y_0 = pm.Normal('y_0', mu=mu_0, sigma=sigma_0, observed=y)

    trace = pm.sample(500)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma_0, b_0, a_0]
Sampling 4 chains: 100%|██████████| 4000/4000 [00:01<00:00, 2748.91draws/s]
The acceptance probability does not match the target. It is 0.880262225775949, but should be close to 0.8. Try to increase the number of tuning steps.
In [60]:
pm.traceplot(trace);
In [61]:
pm.summary(trace)
Out[61]:
mean sd mc_error hpd_2.5 hpd_97.5 n_eff Rhat
a_0 10.033580 0.529140 0.011378 8.924638 11.006548 2367.657604 1.001081
b_0 4.304798 0.312978 0.006715 3.694924 4.881839 2080.493833 1.000428
sigma_0 3.046738 0.226399 0.004346 2.648484 3.520783 2335.392735 0.999326

Linear Regression

In [65]:
reg = LinearRegression()
reg.fit(x.reshape(-1, 1), y);
print(f'Coefficients A: {reg.coef_[0]:.3}, Intercept b: {reg.intercept_:.2}')
Coefficients A: 10.1, Intercept b: 4.3

Plot Detrministic and Bayesian Regression Lines

In [83]:
plt.plot(x, y, '.', label='observed y', c='C0')
plt.plot(x, a * x + b, label='true y', lw=3., c='C3')
pm.plot_posterior_predictive_glm(trace, samples=30, 
                                 eval=x,
                                 lm=lambda x, sample: sample['b_0'] + sample['a_0'] * x, 
                                 label='posterior predictive regression', c='C2')
plt.plot(x, reg.coef_[0] * x + reg.intercept_ , label='deterministic linear regression', ls='dotted',c='C1', lw=3)
plt.legend(loc=0);
plt.title('Posterior Predictive Regression Lines');

XKCD-style Plot using matplotlib

Goal

This post aims to introduce how to plot the data using matplotlib in an XKCD style.

image

Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Create data to plot

In [2]:
x = np.linspace(-np.pi, np.pi, 100)
df_data = pd.DataFrame(data={'sin x': np.sin(x), 'cos x': np.cos(x)}, index=x)
df_data.head()
Out[2]:
sin x cos x
-3.141593 -1.224647e-16 -1.000000
-3.078126 -6.342392e-02 -0.997987
-3.014660 -1.265925e-01 -0.991955
-2.951193 -1.892512e-01 -0.981929
-2.887727 -2.511480e-01 -0.967949
In [3]:
df_data.plot(title='Normal Matplotlib Style');
In [4]:
with plt.xkcd():
    df_data.plot(title='XKCD Style');

Converting A Dictionary Into A Matrix using DictVectorizer

Goal

This post aims to introduce how to convert a dictionary into a matrix using DictVectorizer from scikit-learn. This is useful when you have data stored in a list of a sparse dictionary format and would like to convert it into a feature vector digestable in a scikit-learn format.

image

Reference

Libraries

In [6]:
from sklearn.feature_extraction import DictVectorizer
import pandas as pd

Create a list of a dictionary as an input

In [20]:
d_house= [{'area': 300.0, 'price': 1000, 'location': 'NY'},
          {'area': 600.0, 'price': 2000, 'location': 'CA'},
          {'price': 1500, 'location': 'CH'}
         ]
d_house
Out[20]:
[{'area': 300.0, 'price': 1000, 'location': 'NY'},
 {'area': 600.0, 'price': 2000, 'location': 'CA'},
 {'price': 1500, 'location': 'CH'}]

Convert a list of dictionary into a feature vector

In [18]:
dv = DictVectorizer()
dv.fit(d_house)
Out[18]:
DictVectorizer(dtype=<class 'numpy.float64'>, separator='=', sort=True,
        sparse=True)
In [19]:
pd.DataFrame(dv.fit_transform(d_house).todense(), columns=dv.feature_names_)
Out[19]:
area location=CA location=CH location=NY price
0 300.0 0.0 0.0 1.0 1000.0
1 600.0 1.0 0.0 0.0 2000.0
2 0.0 0.0 1.0 0.0 1500.0

Precision

Goal

This post aims to introduce one of the model evaluation metrics, called Precision score. Precision score is used to measure the prediction ratio of how many of predictions were correct out of the total number of the predictions. As the precision score is higher, the prediction would be high likely true whenever such prediction is made.

Precision score is defined as the following equations:

$$ {\displaystyle {\text{Precision}}={\frac {True\;Positive}{True\;Positive + False\;Positive}}\,} = \frac {True \;Positive}{total\;\#\,of\;samples\;predicated\;as\;True} $$

Reference

Libraries

In [2]:
from sklearn.metrics import precision_score
import pandas as pd

Create a prediction and ground truth

In [21]:
df_prediction = pd.DataFrame([0, 1, 0, 1, 1 ,1, 1, 1], 
                             columns=['prediction'])
df_prediction
Out[21]:
prediction
0 0
1 1
2 0
3 1
4 1
5 1
6 1
7 1
In [22]:
df_groundtruth = pd.DataFrame([0, 0, 0 , 0, 1, 1, 1, 1], 
                              columns=['gt'])
df_groundtruth
Out[22]:
gt
0 0
1 0
2 0
3 0
4 1
5 1
6 1
7 1

Compute F1 Score

In [24]:
precision_score(y_true=df_groundtruth, 
        y_pred=df_prediction, average='binary')
Out[24]:
0.6666666666666666

double check by precision and recall

In [19]:
TP = (df_prediction.loc[df_prediction['prediction']==1,'prediction'] == df_groundtruth.loc[df_prediction['prediction']==1,'gt']).sum()
TP
Out[19]:
4
In [25]:
FP = (df_prediction.loc[df_prediction['prediction']==1,'prediction'] != df_groundtruth.loc[df_prediction['prediction']==1,'gt']).sum()
FP
Out[25]:
2
In [26]:
TP / (TP + FP)
Out[26]:
0.6666666666666666

Variance Thresholding For Feature Selection

Goal

This post aims to introduce how to conduct feature selection by variance thresholding

Reference

Libraries

In [26]:
from sklearn.feature_selection import VarianceThreshold
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Load Boston housing data

In [4]:
boston = load_boston()
df_boston = pd.DataFrame(boston.data, columns=boston.feature_names)
df_boston.head()
Out[4]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

Add low variance columns

In [14]:
np.var(np.random.random(size=df_boston.shape[0]))
Out[14]:
0.08309365785384086
In [16]:
df_boston['low_variance'] = 1
df_boston['low_variance2'] = (np.random.random(size=df_boston.shape[0]) > 0.5) * 1
df_boston.head()
Out[16]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT low_variance low_variance2
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 1 0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 1 1
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 1 0
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 1 1
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 1 1

Variance Thresholding

In [50]:
variance_threshold = 0.1
selection = VarianceThreshold(threshold=variance_threshold)
selection.fit(df_boston)
Out[50]:
VarianceThreshold(threshold=0.1)
In [51]:
ax = pd.Series(selection.variances_, index=df_boston.columns).plot(kind='bar', logy=True);
ax.axhline(variance_threshold, ls='dotted', c='r');
In [52]:
df_boston_selected = pd.DataFrame(selection.transform(df_boston), columns=df_boston.columns[selection.get_support()])
df_boston_selected.head()
Out[52]:
CRIM ZN INDUS RM AGE DIS RAD TAX PTRATIO B LSTAT low_variance2
0 0.00632 18.0 2.31 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 0.0
1 0.02731 0.0 7.07 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 1.0
2 0.02729 0.0 7.07 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 0.0
3 0.03237 0.0 2.18 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 1.0
4 0.06905 0.0 2.18 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 1.0

F1 Score

Goal

This post aims to introduce one of the model evaluation metrics, called F1 score. F1 score is used to measure the overall model performance. As F1 score is higher, the model performance would be better in general.

F1 score is defined as the following equations:

$$ F_1 = 2 \cdot \frac{precision \cdot recall}{precision + recall} $$

Reference

Libraries

In [10]:
from sklearn.metrics import f1_score, precision_score, recall_score
import pandas as pd

Create a prediction and ground truth

In [7]:
df_prediction = pd.DataFrame([0, 1, 0, 1, 0 ,1, 0, 1], 
                             columns=['prediction'])
df_prediction
Out[7]:
prediction
0 0
1 1
2 0
3 1
4 0
5 1
6 0
7 1
In [8]:
df_groundtruth = pd.DataFrame([0, 0, 0 , 0, 1, 1, 1, 1], 
                              columns=['gt'])
df_groundtruth
Out[8]:
gt
0 0
1 0
2 0
3 0
4 1
5 1
6 1
7 1

Compute F1 Score

In [9]:
f1_score(y_true=df_groundtruth, 
        y_pred=df_prediction)
Out[9]:
0.5

double check by precision and recall

In [11]:
precision_score(y_true=df_groundtruth, 
        y_pred=df_prediction)
Out[11]:
0.5
In [12]:
recall_score(y_true=df_groundtruth, 
        y_pred=df_prediction)
Out[12]:
0.5
In [13]:
2 * (0.5 * 0.5) / (0.5 + 0.5)
Out[13]:
0.5