Lasso Regression

h1ros

Jun 13, 2019, 12:09:15 AM

Goal¶

This post aims to introduce lasso regression using dummy data. This method would be more powerful when the dependency variables has correlation or multi co-linearity between them.

Example

Reference

PyTorch Basic Operations

h1ros

Jun 11, 2019, 8:26:00 AM

Comments

Goal¶

This post aims to introduce basic PyTorch operations e.g., addition, multiplication,

Libraries¶

In [2]:

import numpy as np
import pandas as pd
import torch

Create a Tensor¶

In [5]:

t_x1 = torch.Tensor([[1, 2, 3],
                     [4, 5, 6],
                     [7, 8, 9]])

t_x2 = torch.Tensor([[9, 8, 7],
                     [6, 5, 4],
                     [3, 2, 1]])
print(t_x1)
print(t_x2)

tensor([[1., 2., 3.],
        [4., 5., 6.],
        [7., 8., 9.]])
tensor([[9., 8., 7.],
        [6., 5., 4.],
        [3., 2., 1.]])

Addition¶

+ operator¶

In [6]:

t_x1 + t_x2

Out[6]:

tensor([[10., 10., 10.],
        [10., 10., 10.],
        [10., 10., 10.]])

Neural Network for Classification

h1ros

Jun 10, 2019, 6:42:28 AM

Comments

Goal¶

This post aims to introduce (shallow) neural network for classification using scikit-learn.

Reference

Libraries¶

In [2]:

import pandas as pd
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
%matplotlib inline

Load Breast Cancer dataset¶

In [5]:

breast_cancer = load_breast_cancer()
df_breast_cancer = pd.DataFrame(breast_cancer['data'], columns=breast_cancer['feature_names'])
df_breast_cancer['target'] = breast_cancer['target']

df_breast_cancer.head()

Out[5]:

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
0	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	0.07871	...	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
1	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	...	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
2	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	0.2069	0.05999	...	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
3	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	0.2597	0.09744	...	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
4	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	0.1809	0.05883	...	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678

5 rows × 31 columns

Create Neural Network¶

In [18]:

clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
                    hidden_layer_sizes=(10,3,3), random_state=1)

In [19]:

cv_score = cross_val_score(clf,
                           X=df_breast_cancer.iloc[:, :-1],
                           y=df_breast_cancer['target'],
                           cv=5)
plt.plot(cv_score);

Bayesian Regression using pymc3

h1ros

Jun 9, 2019, 5:32:16 AM

Comments

Goal¶

This post aims to introduce how to use pymc3 for Bayesian regression by showing the simplest single variable example.

Reference

Libraries¶

In [63]:

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import pymc3 as pm
%matplotlib inline

Create a data for Bayesian regression¶

To compare non-Bayesian linear regression, the way to generate data follows the one used in this post Linear Regression

\begin{equation*} \mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{b} + \mathbf{e} \end{equation*}

Here $\mathbf{x}$ is a 1 dimension vector, $\mathbf{b}$ is a constant variable, $\mathbf{e}$ is white noise.

In [58]:

a = 10
b = 4
n = 100
sigma = 3
e = sigma * np.random.randn(n) 
x = np.linspace(-1, 1, num=n)
y = a * x + b + e

plt.plot(x, y, '.', label='observed y');
plt.plot(x, a * x + b, 'r', label='true y');
plt.legend();

Modeling using `pymc`¶

In Bayesian world, the above formula is reformulated as below:

\begin{equation*} \mathbf{y} \sim \mathcal{N} (\mathbf{A}\mathbf{x} + \mathbf{b}, \sigma^2) \end{equation*}

In this case, we regard $\mathbf{y}$ as a random variable following the Normal distribution defined by the mean $\mathbf{A}\mathbf{x} + \mathbf{b}$ and the variance $\sigma^2$.

In [59]:

model = pm.Model()
with model:
    a_0 = pm.Normal('a_0', mu=1, sigma=10)
    b_0 = pm.Normal('b_0', mu=1, sigma=10)
    x_0 = pm.Normal('x_0', mu=0, sigma=1, observed=x)
    mu_0 = a_0 * x_0 + b_0 
    sigma_0 = pm.HalfCauchy('sigma_0', beta=10)

    y_0 = pm.Normal('y_0', mu=mu_0, sigma=sigma_0, observed=y)

    trace = pm.sample(500)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma_0, b_0, a_0]
Sampling 4 chains: 100%|██████████| 4000/4000 [00:01<00:00, 2748.91draws/s]
The acceptance probability does not match the target. It is 0.880262225775949, but should be close to 0.8. Try to increase the number of tuning steps.

In [60]:

pm.traceplot(trace);

In [61]:

pm.summary(trace)

Out[61]:

	mean	sd	mc_error	hpd_2.5	hpd_97.5	n_eff	Rhat
a_0	10.033580	0.529140	0.011378	8.924638	11.006548	2367.657604	1.001081
b_0	4.304798	0.312978	0.006715	3.694924	4.881839	2080.493833	1.000428
sigma_0	3.046738	0.226399	0.004346	2.648484	3.520783	2335.392735	0.999326

Linear Regression¶

In [65]:

reg = LinearRegression()
reg.fit(x.reshape(-1, 1), y);
print(f'Coefficients A: {reg.coef_[0]:.3}, Intercept b: {reg.intercept_:.2}')

Coefficients A: 10.1, Intercept b: 4.3

Plot Detrministic and Bayesian Regression Lines¶

In [83]:

plt.plot(x, y, '.', label='observed y', c='C0')
plt.plot(x, a * x + b, label='true y', lw=3., c='C3')
pm.plot_posterior_predictive_glm(trace, samples=30, 
                                 eval=x,
                                 lm=lambda x, sample: sample['b_0'] + sample['a_0'] * x, 
                                 label='posterior predictive regression', c='C2')
plt.plot(x, reg.coef_[0] * x + reg.intercept_ , label='deterministic linear regression', ls='dotted',c='C1', lw=3)
plt.legend(loc=0);
plt.title('Posterior Predictive Regression Lines');

XKCD-style Plot using matplotlib

h1ros

Jun 8, 2019, 12:52:53 AM

Comments

Goal¶

This post aims to introduce how to plot the data using matplotlib in an XKCD style.

Libraries¶

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Create data to plot¶

In [2]:

x = np.linspace(-np.pi, np.pi, 100)
df_data = pd.DataFrame(data={'sin x': np.sin(x), 'cos x': np.cos(x)}, index=x)
df_data.head()

Out[2]:

	sin x	cos x
-3.141593	-1.224647e-16	-1.000000
-3.078126	-6.342392e-02	-0.997987
-3.014660	-1.265925e-01	-0.991955
-2.951193	-1.892512e-01	-0.981929
-2.887727	-2.511480e-01	-0.967949

In [3]:

df_data.plot(title='Normal Matplotlib Style');

In [4]:

with plt.xkcd():
    df_data.plot(title='XKCD Style');

Converting A Dictionary Into A Matrix using DictVectorizer

h1ros

Jun 7, 2019, 6:08:08 AM

Comments

Goal¶

This post aims to introduce how to convert a dictionary into a matrix using DictVectorizer from scikit-learn. This is useful when you have data stored in a list of a sparse dictionary format and would like to convert it into a feature vector digestable in a scikit-learn format.

Reference

Scikit-learn DictVectorizer

Libraries¶

In [6]:

from sklearn.feature_extraction import DictVectorizer
import pandas as pd

Create a list of a dictionary as an input¶

In [20]:

d_house= [{'area': 300.0, 'price': 1000, 'location': 'NY'},
          {'area': 600.0, 'price': 2000, 'location': 'CA'},
          {'price': 1500, 'location': 'CH'}
         ]
d_house

Out[20]:

[{'area': 300.0, 'price': 1000, 'location': 'NY'},
 {'area': 600.0, 'price': 2000, 'location': 'CA'},
 {'price': 1500, 'location': 'CH'}]

Convert a list of dictionary into a feature vector¶

In [18]:

dv = DictVectorizer()
dv.fit(d_house)

Out[18]:

DictVectorizer(dtype=<class 'numpy.float64'>, separator='=', sort=True,
        sparse=True)

In [19]:

pd.DataFrame(dv.fit_transform(d_house).todense(), columns=dv.feature_names_)

Out[19]:

	area	location=CA	location=CH	location=NY	price
0	300.0	0.0	0.0	1.0	1000.0
1	600.0	1.0	0.0	0.0	2000.0
2	0.0	0.0	1.0	0.0	1500.0

Logistic Regression

h1ros

Jun 5, 2019, 10:25:33 PM

Comments

Goal¶

This post aims to introduce logistic regression using dummy data.

Reference

Precision

h1ros

Jun 4, 2019, 8:05:41 PM

Comments

Goal¶

This post aims to introduce one of the model evaluation metrics, called Precision score. Precision score is used to measure the prediction ratio of how many of predictions were correct out of the total number of the predictions. As the precision score is higher, the prediction would be high likely true whenever such prediction is made.

Precision score is defined as the following equations:

$$ {\displaystyle {\text{Precision}}={\frac {True\;Positive}{True\;Positive + False\;Positive}}\,} = \frac {True \;Positive}{total\;\#\,of\;samples\;predicated\;as\;True} $$

Reference

Libraries¶

In [2]:

from sklearn.metrics import precision_score
import pandas as pd

Create a prediction and ground truth¶

In [21]:

df_prediction = pd.DataFrame([0, 1, 0, 1, 1 ,1, 1, 1], 
                             columns=['prediction'])
df_prediction

Out[21]:

	prediction
0	0
1	1
2	0
3	1
4	1
5	1
6	1
7	1

In [22]:

df_groundtruth = pd.DataFrame([0, 0, 0 , 0, 1, 1, 1, 1], 
                              columns=['gt'])
df_groundtruth

Out[22]:

	gt
0	0
1	0
2	0
3	0
4	1
5	1
6	1
7	1

Compute F1 Score¶

In [24]:

precision_score(y_true=df_groundtruth, 
        y_pred=df_prediction, average='binary')

Out[24]:

0.6666666666666666

double check by precision and recall¶

In [19]:

TP = (df_prediction.loc[df_prediction['prediction']==1,'prediction'] == df_groundtruth.loc[df_prediction['prediction']==1,'gt']).sum()
TP

Out[19]:

In [25]:

FP = (df_prediction.loc[df_prediction['prediction']==1,'prediction'] != df_groundtruth.loc[df_prediction['prediction']==1,'gt']).sum()
FP

Out[25]:

In [26]:

TP / (TP + FP)

Out[26]:

0.6666666666666666

Variance Thresholding For Feature Selection

h1ros

Jun 3, 2019, 3:16:12 PM

Comments

Goal¶

This post aims to introduce how to conduct feature selection by variance thresholding

Reference

scikit-learn - Feature Selection

Libraries¶

In [26]:

from sklearn.feature_selection import VarianceThreshold
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Load Boston housing data¶

In [4]:

boston = load_boston()
df_boston = pd.DataFrame(boston.data, columns=boston.feature_names)
df_boston.head()

Out[4]:

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33

Add low variance columns¶

In [14]:

np.var(np.random.random(size=df_boston.shape[0]))

Out[14]:

0.08309365785384086

In [16]:

df_boston['low_variance'] = 1
df_boston['low_variance2'] = (np.random.random(size=df_boston.shape[0]) > 0.5) * 1
df_boston.head()

Out[16]:

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	low_variance	low_variance2
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	1	0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	1	1
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	1	0
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	1	1
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	1	1

Variance Thresholding¶

In [50]:

variance_threshold = 0.1
selection = VarianceThreshold(threshold=variance_threshold)
selection.fit(df_boston)

Out[50]:

VarianceThreshold(threshold=0.1)

In [51]:

ax = pd.Series(selection.variances_, index=df_boston.columns).plot(kind='bar', logy=True);
ax.axhline(variance_threshold, ls='dotted', c='r');

In [52]:

df_boston_selected = pd.DataFrame(selection.transform(df_boston), columns=df_boston.columns[selection.get_support()])
df_boston_selected.head()

Out[52]:

	CRIM	ZN	INDUS	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	low_variance2
0	0.00632	18.0	2.31	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	0.0
1	0.02731	0.0	7.07	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	1.0
2	0.02729	0.0	7.07	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	0.0
3	0.03237	0.0	2.18	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	1.0
4	0.06905	0.0	2.18	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	1.0

F1 Score

h1ros

Jun 2, 2019, 12:30:51 AM

Comments

Goal¶

This post aims to introduce one of the model evaluation metrics, called F1 score. F1 score is used to measure the overall model performance. As F1 score is higher, the model performance would be better in general.

F1 score is defined as the following equations:

$$ F_1 = 2 \cdot \frac{precision \cdot recall}{precision + recall} $$

Reference

Libraries¶

In [10]:

from sklearn.metrics import f1_score, precision_score, recall_score
import pandas as pd

Create a prediction and ground truth¶

In [7]:

df_prediction = pd.DataFrame([0, 1, 0, 1, 0 ,1, 0, 1], 
                             columns=['prediction'])
df_prediction

Out[7]:

	prediction
0	0
1	1
2	0
3	1
4	0
5	1
6	0
7	1

In [8]:

df_groundtruth = pd.DataFrame([0, 0, 0 , 0, 1, 1, 1, 1], 
                              columns=['gt'])
df_groundtruth

Out[8]:

	gt
0	0
1	0
2	0
3	0
4	1
5	1
6	1
7	1

Compute F1 Score¶

In [9]:

f1_score(y_true=df_groundtruth, 
        y_pred=df_prediction)

Out[9]:

0.5

double check by precision and recall¶

In [11]:

precision_score(y_true=df_groundtruth, 
        y_pred=df_prediction)

Out[11]:

0.5

In [12]:

recall_score(y_true=df_groundtruth, 
        y_pred=df_prediction)

Out[12]:

0.5

In [13]:

2 * (0.5 * 0.5) / (0.5 + 0.5)

Out[13]:

0.5

Goal¶

Goal¶

Libraries¶

Create a Tensor¶

Addition¶

+ operator¶

Goal¶

Libraries¶

Load Breast Cancer dataset¶

Create Neural Network¶

Goal¶

Libraries¶

Create a data for Bayesian regression¶

Modeling using pymc¶

Linear Regression¶

Plot Detrministic and Bayesian Regression Lines¶

Goal¶

Libraries¶

Create data to plot¶

Goal¶

Libraries¶

Create a list of a dictionary as an input¶

Convert a list of dictionary into a feature vector¶

Goal¶

Goal¶

Libraries¶

Create a prediction and ground truth¶

Compute F1 Score¶

double check by precision and recall¶

Goal¶

Libraries¶

Load Boston housing data¶

Add low variance columns¶

Variance Thresholding¶

Goal¶

Libraries¶

Create a prediction and ground truth¶

Compute F1 Score¶

double check by precision and recall¶

Modeling using `pymc`¶