Bayesian Regression using pymc3

h1ros

Jun 9, 2019, 5:32:16 AM

Goal¶

This post aims to introduce how to use pymc3 for Bayesian regression by showing the simplest single variable example.

Reference

Libraries¶

In [63]:

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import pymc3 as pm
%matplotlib inline

Create a data for Bayesian regression¶

To compare non-Bayesian linear regression, the way to generate data follows the one used in this post Linear Regression

\begin{equation*} \mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{b} + \mathbf{e} \end{equation*}

Here $\mathbf{x}$ is a 1 dimension vector, $\mathbf{b}$ is a constant variable, $\mathbf{e}$ is white noise.

In [58]:

a = 10
b = 4
n = 100
sigma = 3
e = sigma * np.random.randn(n) 
x = np.linspace(-1, 1, num=n)
y = a * x + b + e

plt.plot(x, y, '.', label='observed y');
plt.plot(x, a * x + b, 'r', label='true y');
plt.legend();

Modeling using `pymc`¶

In Bayesian world, the above formula is reformulated as below:

\begin{equation*} \mathbf{y} \sim \mathcal{N} (\mathbf{A}\mathbf{x} + \mathbf{b}, \sigma^2) \end{equation*}

In this case, we regard $\mathbf{y}$ as a random variable following the Normal distribution defined by the mean $\mathbf{A}\mathbf{x} + \mathbf{b}$ and the variance $\sigma^2$.

In [59]:

model = pm.Model()
with model:
    a_0 = pm.Normal('a_0', mu=1, sigma=10)
    b_0 = pm.Normal('b_0', mu=1, sigma=10)
    x_0 = pm.Normal('x_0', mu=0, sigma=1, observed=x)
    mu_0 = a_0 * x_0 + b_0 
    sigma_0 = pm.HalfCauchy('sigma_0', beta=10)

    y_0 = pm.Normal('y_0', mu=mu_0, sigma=sigma_0, observed=y)

    trace = pm.sample(500)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma_0, b_0, a_0]
Sampling 4 chains: 100%|██████████| 4000/4000 [00:01<00:00, 2748.91draws/s]
The acceptance probability does not match the target. It is 0.880262225775949, but should be close to 0.8. Try to increase the number of tuning steps.

In [60]:

pm.traceplot(trace);

In [61]:

pm.summary(trace)

Out[61]:

	mean	sd	mc_error	hpd_2.5	hpd_97.5	n_eff	Rhat
a_0	10.033580	0.529140	0.011378	8.924638	11.006548	2367.657604	1.001081
b_0	4.304798	0.312978	0.006715	3.694924	4.881839	2080.493833	1.000428
sigma_0	3.046738	0.226399	0.004346	2.648484	3.520783	2335.392735	0.999326

Linear Regression¶

In [65]:

reg = LinearRegression()
reg.fit(x.reshape(-1, 1), y);
print(f'Coefficients A: {reg.coef_[0]:.3}, Intercept b: {reg.intercept_:.2}')

Coefficients A: 10.1, Intercept b: 4.3

Plot Detrministic and Bayesian Regression Lines¶

In [83]:

plt.plot(x, y, '.', label='observed y', c='C0')
plt.plot(x, a * x + b, label='true y', lw=3., c='C3')
pm.plot_posterior_predictive_glm(trace, samples=30, 
                                 eval=x,
                                 lm=lambda x, sample: sample['b_0'] + sample['a_0'] * x, 
                                 label='posterior predictive regression', c='C2')
plt.plot(x, reg.coef_[0] * x + reg.intercept_ , label='deterministic linear regression', ls='dotted',c='C1', lw=3)
plt.legend(loc=0);
plt.title('Posterior Predictive Regression Lines');

XKCD-style Plot using matplotlib

h1ros

Jun 8, 2019, 12:52:53 AM

Comments

Goal¶

This post aims to introduce how to plot the data using matplotlib in an XKCD style.

Libraries¶

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Create data to plot¶

In [2]:

x = np.linspace(-np.pi, np.pi, 100)
df_data = pd.DataFrame(data={'sin x': np.sin(x), 'cos x': np.cos(x)}, index=x)
df_data.head()

Out[2]:

	sin x	cos x
-3.141593	-1.224647e-16	-1.000000
-3.078126	-6.342392e-02	-0.997987
-3.014660	-1.265925e-01	-0.991955
-2.951193	-1.892512e-01	-0.981929
-2.887727	-2.511480e-01	-0.967949

In [3]:

df_data.plot(title='Normal Matplotlib Style');

In [4]:

with plt.xkcd():
    df_data.plot(title='XKCD Style');

Converting A Dictionary Into A Matrix using DictVectorizer

h1ros

Jun 7, 2019, 6:08:08 AM

Comments

Goal¶

This post aims to introduce how to convert a dictionary into a matrix using DictVectorizer from scikit-learn. This is useful when you have data stored in a list of a sparse dictionary format and would like to convert it into a feature vector digestable in a scikit-learn format.

Reference

Scikit-learn DictVectorizer

Libraries¶

In [6]:

from sklearn.feature_extraction import DictVectorizer
import pandas as pd

Create a list of a dictionary as an input¶

In [20]:

d_house= [{'area': 300.0, 'price': 1000, 'location': 'NY'},
          {'area': 600.0, 'price': 2000, 'location': 'CA'},
          {'price': 1500, 'location': 'CH'}
         ]
d_house

Out[20]:

[{'area': 300.0, 'price': 1000, 'location': 'NY'},
 {'area': 600.0, 'price': 2000, 'location': 'CA'},
 {'price': 1500, 'location': 'CH'}]

Convert a list of dictionary into a feature vector¶

In [18]:

dv = DictVectorizer()
dv.fit(d_house)

Out[18]:

DictVectorizer(dtype=<class 'numpy.float64'>, separator='=', sort=True,
        sparse=True)

In [19]:

pd.DataFrame(dv.fit_transform(d_house).todense(), columns=dv.feature_names_)

Out[19]:

	area	location=CA	location=CH	location=NY	price
0	300.0	0.0	0.0	1.0	1000.0
1	600.0	1.0	0.0	0.0	2000.0
2	0.0	0.0	1.0	0.0	1500.0

Logistic Regression

h1ros

Jun 5, 2019, 10:25:33 PM

Comments

Goal¶

This post aims to introduce logistic regression using dummy data.

Reference

Precision

h1ros

Jun 4, 2019, 8:05:41 PM

Comments

Goal¶

This post aims to introduce one of the model evaluation metrics, called Precision score. Precision score is used to measure the prediction ratio of how many of predictions were correct out of the total number of the predictions. As the precision score is higher, the prediction would be high likely true whenever such prediction is made.

Precision score is defined as the following equations:

$$ {\displaystyle {\text{Precision}}={\frac {True\;Positive}{True\;Positive + False\;Positive}}\,} = \frac {True \;Positive}{total\;\#\,of\;samples\;predicated\;as\;True} $$

Reference

Libraries¶

In [2]:

from sklearn.metrics import precision_score
import pandas as pd

Create a prediction and ground truth¶

In [21]:

df_prediction = pd.DataFrame([0, 1, 0, 1, 1 ,1, 1, 1], 
                             columns=['prediction'])
df_prediction

Out[21]:

	prediction
0	0
1	1
2	0
3	1
4	1
5	1
6	1
7	1

In [22]:

df_groundtruth = pd.DataFrame([0, 0, 0 , 0, 1, 1, 1, 1], 
                              columns=['gt'])
df_groundtruth

Out[22]:

	gt
0	0
1	0
2	0
3	0
4	1
5	1
6	1
7	1

Compute F1 Score¶

In [24]:

precision_score(y_true=df_groundtruth, 
        y_pred=df_prediction, average='binary')

Out[24]:

0.6666666666666666

double check by precision and recall¶

In [19]:

TP = (df_prediction.loc[df_prediction['prediction']==1,'prediction'] == df_groundtruth.loc[df_prediction['prediction']==1,'gt']).sum()
TP

Out[19]:

In [25]:

FP = (df_prediction.loc[df_prediction['prediction']==1,'prediction'] != df_groundtruth.loc[df_prediction['prediction']==1,'gt']).sum()
FP

Out[25]:

In [26]:

TP / (TP + FP)

Out[26]:

0.6666666666666666

Variance Thresholding For Feature Selection

h1ros

Jun 3, 2019, 3:16:12 PM

Comments

Goal¶

This post aims to introduce how to conduct feature selection by variance thresholding

Reference

scikit-learn - Feature Selection

Libraries¶

In [26]:

from sklearn.feature_selection import VarianceThreshold
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Load Boston housing data¶

In [4]:

boston = load_boston()
df_boston = pd.DataFrame(boston.data, columns=boston.feature_names)
df_boston.head()

Out[4]:

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33

Add low variance columns¶

In [14]:

np.var(np.random.random(size=df_boston.shape[0]))

Out[14]:

0.08309365785384086

In [16]:

df_boston['low_variance'] = 1
df_boston['low_variance2'] = (np.random.random(size=df_boston.shape[0]) > 0.5) * 1
df_boston.head()

Out[16]:

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	low_variance	low_variance2
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	1	0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	1	1
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	1	0
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	1	1
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	1	1

Variance Thresholding¶

In [50]:

variance_threshold = 0.1
selection = VarianceThreshold(threshold=variance_threshold)
selection.fit(df_boston)

Out[50]:

VarianceThreshold(threshold=0.1)

In [51]:

ax = pd.Series(selection.variances_, index=df_boston.columns).plot(kind='bar', logy=True);
ax.axhline(variance_threshold, ls='dotted', c='r');

In [52]:

df_boston_selected = pd.DataFrame(selection.transform(df_boston), columns=df_boston.columns[selection.get_support()])
df_boston_selected.head()

Out[52]:

	CRIM	ZN	INDUS	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	low_variance2
0	0.00632	18.0	2.31	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	0.0
1	0.02731	0.0	7.07	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	1.0
2	0.02729	0.0	7.07	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	0.0
3	0.03237	0.0	2.18	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	1.0
4	0.06905	0.0	2.18	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	1.0

F1 Score

h1ros

Jun 2, 2019, 12:30:51 AM

Comments

Goal¶

This post aims to introduce one of the model evaluation metrics, called F1 score. F1 score is used to measure the overall model performance. As F1 score is higher, the model performance would be better in general.

F1 score is defined as the following equations:

$$ F_1 = 2 \cdot \frac{precision \cdot recall}{precision + recall} $$

Reference

Libraries¶

In [10]:

from sklearn.metrics import f1_score, precision_score, recall_score
import pandas as pd

Create a prediction and ground truth¶

In [7]:

df_prediction = pd.DataFrame([0, 1, 0, 1, 0 ,1, 0, 1], 
                             columns=['prediction'])
df_prediction

Out[7]:

	prediction
0	0
1	1
2	0
3	1
4	0
5	1
6	0
7	1

In [8]:

df_groundtruth = pd.DataFrame([0, 0, 0 , 0, 1, 1, 1, 1], 
                              columns=['gt'])
df_groundtruth

Out[8]:

	gt
0	0
1	0
2	0
3	0
4	1
5	1
6	1
7	1

Compute F1 Score¶

In [9]:

f1_score(y_true=df_groundtruth, 
        y_pred=df_prediction)

Out[9]:

0.5

double check by precision and recall¶

In [11]:

precision_score(y_true=df_groundtruth, 
        y_pred=df_prediction)

Out[11]:

0.5

In [12]:

recall_score(y_true=df_groundtruth, 
        y_pred=df_prediction)

Out[12]:

0.5

In [13]:

2 * (0.5 * 0.5) / (0.5 + 0.5)

Out[13]:

0.5

Select Date And Time Ranges

h1ros

Jun 1, 2019, 12:10:46 AM

Comments

Goal¶

This post aims to introduce how to select a subset of the pandas dataframe by selecting a date range.

Libraries¶

In [2]:

import pandas as pd
import numpy as np

Create a dataframe¶

In [18]:

date_ranges = pd.date_range('20190101', '20191231', freq='d')
df_rand = pd.DataFrame({'date': date_ranges, 
                       'value': np.random.random(date_ranges.shape[0])})
df_rand.head()

Out[18]:

	date	value
0	2019-01-01	0.332090
1	2019-01-02	0.690167
2	2019-01-03	0.237744
3	2019-01-04	0.060678
4	2019-01-05	0.572691

Select a range using `.between`¶

In [19]:

df_rand.loc[df_rand.date.between('20190201', '20190211'),:]

Out[19]:

	date	value
31	2019-02-01	0.449901
32	2019-02-02	0.803429
33	2019-02-03	0.299074
34	2019-02-04	0.630970
35	2019-02-05	0.294973
36	2019-02-06	0.510857
37	2019-02-07	0.345567
38	2019-02-08	0.877957
39	2019-02-09	0.990186
40	2019-02-10	0.000186
41	2019-02-11	0.378379

Convert Strings To Dates

h1ros

May 31, 2019, 1:12:43 AM

Comments

Goal¶

This post aims to introduce how to convert strings to dates using pandas

Libraries¶

In [1]:

import pandas as pd

Date in string¶

In [3]:

df_date = pd.DataFrame({'date':['20190101', '20190102', '20190105'], 
                       'temperature': [23.5, 32, 25]})
df_date

Out[3]:

	date	temperature
0	20190101	23.5
1	20190102	32.0
2	20190105	25.0

Convert strings to date format¶

In [4]:

pd.to_datetime(df_date['date'])

Out[4]:

0   2019-01-01
1   2019-01-02
2   2019-01-05
Name: date, dtype: datetime64[ns]

Tokenize Text

h1ros

May 30, 2019, 5:40:08 AM

Comments

Goal¶

This post aims to introduce how to tokenize text using nltk.

Reference

Libraries¶

In [5]:

from nltk.tokenize import sent_tokenize, word_tokenize

Create a sentences¶

In [8]:

paragraph = "Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aims to help programmers write clear, logical code for small and large-scale projects"
paragraph

Out[8]:

"Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aims to help programmers write clear, logical code for small and large-scale projects"

Tokenize a paragraph into sentences¶

In [9]:

sent_tokenize(paragraph)

Out[9]:

['Python is an interpreted, high-level, general-purpose programming language.',
 "Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace.",
 'Its language constructs and object-oriented approach aims to help programmers write clear, logical code for small and large-scale projects']

Tokenize a paragraph into words¶

In [10]:

word_tokenize(paragraph)

Out[10]:

['Python',
 'is',
 'an',
 'interpreted',
 ',',
 'high-level',
 ',',
 'general-purpose',
 'programming',
 'language',
 '.',
 'Created',
 'by',
 'Guido',
 'van',
 'Rossum',
 'and',
 'first',
 'released',
 'in',
 '1991',
 ',',
 'Python',
 "'s",
 'design',
 'philosophy',
 'emphasizes',
 'code',
 'readability',
 'with',
 'its',
 'notable',
 'use',
 'of',
 'significant',
 'whitespace',
 '.',
 'Its',
 'language',
 'constructs',
 'and',
 'object-oriented',
 'approach',
 'aims',
 'to',
 'help',
 'programmers',
 'write',
 'clear',
 ',',
 'logical',
 'code',
 'for',
 'small',
 'and',
 'large-scale',
 'projects']

Goal¶

Libraries¶

Create a data for Bayesian regression¶

Modeling using pymc¶

Linear Regression¶

Plot Detrministic and Bayesian Regression Lines¶

Goal¶

Libraries¶

Create data to plot¶

Goal¶

Libraries¶

Create a list of a dictionary as an input¶

Convert a list of dictionary into a feature vector¶

Goal¶

Goal¶

Libraries¶

Create a prediction and ground truth¶

Compute F1 Score¶

double check by precision and recall¶

Goal¶

Libraries¶

Load Boston housing data¶

Add low variance columns¶

Variance Thresholding¶

Goal¶

Libraries¶

Create a prediction and ground truth¶

Compute F1 Score¶

double check by precision and recall¶

Goal¶

Libraries¶

Create a dataframe¶

Select a range using .between¶

Goal¶

Libraries¶

Date in string¶

Convert strings to date format¶

Goal¶

Libraries¶

Create a sentences¶

Tokenize a paragraph into sentences¶

Tokenize a paragraph into words¶

Modeling using `pymc`¶

Select a range using `.between`¶