Posts about Machine Learning (old posts, page 3)

Bayesian Regression using pymc3

Goal

This post aims to introduce how to use pymc3 for Bayesian regression by showing the simplest single variable example.

image

Reference

Libraries

In [63]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import pymc3 as pm
%matplotlib inline

Create a data for Bayesian regression

To compare non-Bayesian linear regression, the way to generate data follows the one used in this post Linear Regression

\begin{equation*} \mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{b} + \mathbf{e} \end{equation*}

Here $\mathbf{x}$ is a 1 dimension vector, $\mathbf{b}$ is a constant variable, $\mathbf{e}$ is white noise.

In [58]:
a = 10
b = 4
n = 100
sigma = 3
e = sigma * np.random.randn(n) 
x = np.linspace(-1, 1, num=n)
y = a * x + b + e

plt.plot(x, y, '.', label='observed y');
plt.plot(x, a * x + b, 'r', label='true y');
plt.legend();

Modeling using pymc

In Bayesian world, the above formula is reformulated as below:

\begin{equation*} \mathbf{y} \sim \mathcal{N} (\mathbf{A}\mathbf{x} + \mathbf{b}, \sigma^2) \end{equation*}

In this case, we regard $\mathbf{y}$ as a random variable following the Normal distribution defined by the mean $\mathbf{A}\mathbf{x} + \mathbf{b}$ and the variance $\sigma^2$.

In [59]:
model = pm.Model()
with model:
    a_0 = pm.Normal('a_0', mu=1, sigma=10)
    b_0 = pm.Normal('b_0', mu=1, sigma=10)
    x_0 = pm.Normal('x_0', mu=0, sigma=1, observed=x)
    mu_0 = a_0 * x_0 + b_0 
    sigma_0 = pm.HalfCauchy('sigma_0', beta=10)

    y_0 = pm.Normal('y_0', mu=mu_0, sigma=sigma_0, observed=y)

    trace = pm.sample(500)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma_0, b_0, a_0]
Sampling 4 chains: 100%|██████████| 4000/4000 [00:01<00:00, 2748.91draws/s]
The acceptance probability does not match the target. It is 0.880262225775949, but should be close to 0.8. Try to increase the number of tuning steps.
In [60]:
pm.traceplot(trace);
In [61]:
pm.summary(trace)
Out[61]:
mean sd mc_error hpd_2.5 hpd_97.5 n_eff Rhat
a_0 10.033580 0.529140 0.011378 8.924638 11.006548 2367.657604 1.001081
b_0 4.304798 0.312978 0.006715 3.694924 4.881839 2080.493833 1.000428
sigma_0 3.046738 0.226399 0.004346 2.648484 3.520783 2335.392735 0.999326

Linear Regression

In [65]:
reg = LinearRegression()
reg.fit(x.reshape(-1, 1), y);
print(f'Coefficients A: {reg.coef_[0]:.3}, Intercept b: {reg.intercept_:.2}')
Coefficients A: 10.1, Intercept b: 4.3

Plot Detrministic and Bayesian Regression Lines

In [83]:
plt.plot(x, y, '.', label='observed y', c='C0')
plt.plot(x, a * x + b, label='true y', lw=3., c='C3')
pm.plot_posterior_predictive_glm(trace, samples=30, 
                                 eval=x,
                                 lm=lambda x, sample: sample['b_0'] + sample['a_0'] * x, 
                                 label='posterior predictive regression', c='C2')
plt.plot(x, reg.coef_[0] * x + reg.intercept_ , label='deterministic linear regression', ls='dotted',c='C1', lw=3)
plt.legend(loc=0);
plt.title('Posterior Predictive Regression Lines');

XKCD-style Plot using matplotlib

Goal

This post aims to introduce how to plot the data using matplotlib in an XKCD style.

image

Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Create data to plot

In [2]:
x = np.linspace(-np.pi, np.pi, 100)
df_data = pd.DataFrame(data={'sin x': np.sin(x), 'cos x': np.cos(x)}, index=x)
df_data.head()
Out[2]:
sin x cos x
-3.141593 -1.224647e-16 -1.000000
-3.078126 -6.342392e-02 -0.997987
-3.014660 -1.265925e-01 -0.991955
-2.951193 -1.892512e-01 -0.981929
-2.887727 -2.511480e-01 -0.967949
In [3]:
df_data.plot(title='Normal Matplotlib Style');
In [4]:
with plt.xkcd():
    df_data.plot(title='XKCD Style');

Converting A Dictionary Into A Matrix using DictVectorizer

Goal

This post aims to introduce how to convert a dictionary into a matrix using DictVectorizer from scikit-learn. This is useful when you have data stored in a list of a sparse dictionary format and would like to convert it into a feature vector digestable in a scikit-learn format.

image

Reference

Libraries

In [6]:
from sklearn.feature_extraction import DictVectorizer
import pandas as pd

Create a list of a dictionary as an input

In [20]:
d_house= [{'area': 300.0, 'price': 1000, 'location': 'NY'},
          {'area': 600.0, 'price': 2000, 'location': 'CA'},
          {'price': 1500, 'location': 'CH'}
         ]
d_house
Out[20]:
[{'area': 300.0, 'price': 1000, 'location': 'NY'},
 {'area': 600.0, 'price': 2000, 'location': 'CA'},
 {'price': 1500, 'location': 'CH'}]

Convert a list of dictionary into a feature vector

In [18]:
dv = DictVectorizer()
dv.fit(d_house)
Out[18]:
DictVectorizer(dtype=<class 'numpy.float64'>, separator='=', sort=True,
        sparse=True)
In [19]:
pd.DataFrame(dv.fit_transform(d_house).todense(), columns=dv.feature_names_)
Out[19]:
area location=CA location=CH location=NY price
0 300.0 0.0 0.0 1.0 1000.0
1 600.0 1.0 0.0 0.0 2000.0
2 0.0 0.0 1.0 0.0 1500.0

Precision

Goal

This post aims to introduce one of the model evaluation metrics, called Precision score. Precision score is used to measure the prediction ratio of how many of predictions were correct out of the total number of the predictions. As the precision score is higher, the prediction would be high likely true whenever such prediction is made.

Precision score is defined as the following equations:

$$ {\displaystyle {\text{Precision}}={\frac {True\;Positive}{True\;Positive + False\;Positive}}\,} = \frac {True \;Positive}{total\;\#\,of\;samples\;predicated\;as\;True} $$

Reference

Libraries

In [2]:
from sklearn.metrics import precision_score
import pandas as pd

Create a prediction and ground truth

In [21]:
df_prediction = pd.DataFrame([0, 1, 0, 1, 1 ,1, 1, 1], 
                             columns=['prediction'])
df_prediction
Out[21]:
prediction
0 0
1 1
2 0
3 1
4 1
5 1
6 1
7 1
In [22]:
df_groundtruth = pd.DataFrame([0, 0, 0 , 0, 1, 1, 1, 1], 
                              columns=['gt'])
df_groundtruth
Out[22]:
gt
0 0
1 0
2 0
3 0
4 1
5 1
6 1
7 1

Compute F1 Score

In [24]:
precision_score(y_true=df_groundtruth, 
        y_pred=df_prediction, average='binary')
Out[24]:
0.6666666666666666

double check by precision and recall

In [19]:
TP = (df_prediction.loc[df_prediction['prediction']==1,'prediction'] == df_groundtruth.loc[df_prediction['prediction']==1,'gt']).sum()
TP
Out[19]:
4
In [25]:
FP = (df_prediction.loc[df_prediction['prediction']==1,'prediction'] != df_groundtruth.loc[df_prediction['prediction']==1,'gt']).sum()
FP
Out[25]:
2
In [26]:
TP / (TP + FP)
Out[26]:
0.6666666666666666

Variance Thresholding For Feature Selection

Goal

This post aims to introduce how to conduct feature selection by variance thresholding

Reference

Libraries

In [26]:
from sklearn.feature_selection import VarianceThreshold
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Load Boston housing data

In [4]:
boston = load_boston()
df_boston = pd.DataFrame(boston.data, columns=boston.feature_names)
df_boston.head()
Out[4]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

Add low variance columns

In [14]:
np.var(np.random.random(size=df_boston.shape[0]))
Out[14]:
0.08309365785384086
In [16]:
df_boston['low_variance'] = 1
df_boston['low_variance2'] = (np.random.random(size=df_boston.shape[0]) > 0.5) * 1
df_boston.head()
Out[16]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT low_variance low_variance2
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 1 0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 1 1
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 1 0
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 1 1
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 1 1

Variance Thresholding

In [50]:
variance_threshold = 0.1
selection = VarianceThreshold(threshold=variance_threshold)
selection.fit(df_boston)
Out[50]:
VarianceThreshold(threshold=0.1)
In [51]:
ax = pd.Series(selection.variances_, index=df_boston.columns).plot(kind='bar', logy=True);
ax.axhline(variance_threshold, ls='dotted', c='r');
In [52]:
df_boston_selected = pd.DataFrame(selection.transform(df_boston), columns=df_boston.columns[selection.get_support()])
df_boston_selected.head()
Out[52]:
CRIM ZN INDUS RM AGE DIS RAD TAX PTRATIO B LSTAT low_variance2
0 0.00632 18.0 2.31 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 0.0
1 0.02731 0.0 7.07 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 1.0
2 0.02729 0.0 7.07 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 0.0
3 0.03237 0.0 2.18 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 1.0
4 0.06905 0.0 2.18 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 1.0

F1 Score

Goal

This post aims to introduce one of the model evaluation metrics, called F1 score. F1 score is used to measure the overall model performance. As F1 score is higher, the model performance would be better in general.

F1 score is defined as the following equations:

$$ F_1 = 2 \cdot \frac{precision \cdot recall}{precision + recall} $$

Reference

Libraries

In [10]:
from sklearn.metrics import f1_score, precision_score, recall_score
import pandas as pd

Create a prediction and ground truth

In [7]:
df_prediction = pd.DataFrame([0, 1, 0, 1, 0 ,1, 0, 1], 
                             columns=['prediction'])
df_prediction
Out[7]:
prediction
0 0
1 1
2 0
3 1
4 0
5 1
6 0
7 1
In [8]:
df_groundtruth = pd.DataFrame([0, 0, 0 , 0, 1, 1, 1, 1], 
                              columns=['gt'])
df_groundtruth
Out[8]:
gt
0 0
1 0
2 0
3 0
4 1
5 1
6 1
7 1

Compute F1 Score

In [9]:
f1_score(y_true=df_groundtruth, 
        y_pred=df_prediction)
Out[9]:
0.5

double check by precision and recall

In [11]:
precision_score(y_true=df_groundtruth, 
        y_pred=df_prediction)
Out[11]:
0.5
In [12]:
recall_score(y_true=df_groundtruth, 
        y_pred=df_prediction)
Out[12]:
0.5
In [13]:
2 * (0.5 * 0.5) / (0.5 + 0.5)
Out[13]:
0.5

Select Date And Time Ranges

Goal

This post aims to introduce how to select a subset of the pandas dataframe by selecting a date range.

Libraries

In [2]:
import pandas as pd
import numpy as np

Create a dataframe

In [18]:
date_ranges = pd.date_range('20190101', '20191231', freq='d')
df_rand = pd.DataFrame({'date': date_ranges, 
                       'value': np.random.random(date_ranges.shape[0])})
df_rand.head()
Out[18]:
date value
0 2019-01-01 0.332090
1 2019-01-02 0.690167
2 2019-01-03 0.237744
3 2019-01-04 0.060678
4 2019-01-05 0.572691

Select a range using .between

In [19]:
df_rand.loc[df_rand.date.between('20190201', '20190211'),:]
Out[19]:
date value
31 2019-02-01 0.449901
32 2019-02-02 0.803429
33 2019-02-03 0.299074
34 2019-02-04 0.630970
35 2019-02-05 0.294973
36 2019-02-06 0.510857
37 2019-02-07 0.345567
38 2019-02-08 0.877957
39 2019-02-09 0.990186
40 2019-02-10 0.000186
41 2019-02-11 0.378379

Convert Strings To Dates

Goal

This post aims to introduce how to convert strings to dates using pandas

Libraries

In [1]:
import pandas as pd

Date in string

In [3]:
df_date = pd.DataFrame({'date':['20190101', '20190102', '20190105'], 
                       'temperature': [23.5, 32, 25]})
df_date
Out[3]:
date temperature
0 20190101 23.5
1 20190102 32.0
2 20190105 25.0

Convert strings to date format

In [4]:
pd.to_datetime(df_date['date'])
Out[4]:
0   2019-01-01
1   2019-01-02
2   2019-01-05
Name: date, dtype: datetime64[ns]

Tokenize Text

Goal

This post aims to introduce how to tokenize text using nltk.

Reference

Libraries

In [5]:
from nltk.tokenize import sent_tokenize, word_tokenize

Create a sentences

In [8]:
paragraph = "Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aims to help programmers write clear, logical code for small and large-scale projects"
paragraph
Out[8]:
"Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aims to help programmers write clear, logical code for small and large-scale projects"

Tokenize a paragraph into sentences

In [9]:
sent_tokenize(paragraph)
Out[9]:
['Python is an interpreted, high-level, general-purpose programming language.',
 "Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace.",
 'Its language constructs and object-oriented approach aims to help programmers write clear, logical code for small and large-scale projects']

Tokenize a paragraph into words

In [10]:
word_tokenize(paragraph)
Out[10]:
['Python',
 'is',
 'an',
 'interpreted',
 ',',
 'high-level',
 ',',
 'general-purpose',
 'programming',
 'language',
 '.',
 'Created',
 'by',
 'Guido',
 'van',
 'Rossum',
 'and',
 'first',
 'released',
 'in',
 '1991',
 ',',
 'Python',
 "'s",
 'design',
 'philosophy',
 'emphasizes',
 'code',
 'readability',
 'with',
 'its',
 'notable',
 'use',
 'of',
 'significant',
 'whitespace',
 '.',
 'Its',
 'language',
 'constructs',
 'and',
 'object-oriented',
 'approach',
 'aims',
 'to',
 'help',
 'programmers',
 'write',
 'clear',
 ',',
 'logical',
 'code',
 'for',
 'small',
 'and',
 'large-scale',
 'projects']