Lasso Regression
Goal¶
This post aims to introduce lasso regression using dummy data. This method would be more powerful when the dependency variables has correlation or multi co-linearity between them.
Reference
This post aims to introduce lasso regression using dummy data. This method would be more powerful when the dependency variables has correlation or multi co-linearity between them.
Reference
This post aims to introduce basic PyTorch operations e.g., addition, multiplication,
import numpy as np
import pandas as pd
import torch
t_x1 = torch.Tensor([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
t_x2 = torch.Tensor([[9, 8, 7],
[6, 5, 4],
[3, 2, 1]])
print(t_x1)
print(t_x2)
t_x1 + t_x2
This post aims to introduce (shallow) neural network for classification using scikit-learn
.
Reference
import pandas as pd
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
%matplotlib inline
breast_cancer = load_breast_cancer()
df_breast_cancer = pd.DataFrame(breast_cancer['data'], columns=breast_cancer['feature_names'])
df_breast_cancer['target'] = breast_cancer['target']
df_breast_cancer.head()
clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
hidden_layer_sizes=(10,3,3), random_state=1)
cv_score = cross_val_score(clf,
X=df_breast_cancer.iloc[:, :-1],
y=df_breast_cancer['target'],
cv=5)
plt.plot(cv_score);
This post aims to introduce how to use pymc3
for Bayesian regression by showing the simplest single variable example.
Reference
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import pymc3 as pm
%matplotlib inline
To compare non-Bayesian linear regression, the way to generate data follows the one used in this post Linear Regression
\begin{equation*} \mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{b} + \mathbf{e} \end{equation*}Here $\mathbf{x}$ is a 1 dimension vector, $\mathbf{b}$ is a constant variable, $\mathbf{e}$ is white noise.
a = 10
b = 4
n = 100
sigma = 3
e = sigma * np.random.randn(n)
x = np.linspace(-1, 1, num=n)
y = a * x + b + e
plt.plot(x, y, '.', label='observed y');
plt.plot(x, a * x + b, 'r', label='true y');
plt.legend();
pymc
¶
In Bayesian world, the above formula is reformulated as below:
\begin{equation*} \mathbf{y} \sim \mathcal{N} (\mathbf{A}\mathbf{x} + \mathbf{b}, \sigma^2) \end{equation*}In this case, we regard $\mathbf{y}$ as a random variable following the Normal distribution defined by the mean $\mathbf{A}\mathbf{x} + \mathbf{b}$ and the variance $\sigma^2$.
model = pm.Model()
with model:
a_0 = pm.Normal('a_0', mu=1, sigma=10)
b_0 = pm.Normal('b_0', mu=1, sigma=10)
x_0 = pm.Normal('x_0', mu=0, sigma=1, observed=x)
mu_0 = a_0 * x_0 + b_0
sigma_0 = pm.HalfCauchy('sigma_0', beta=10)
y_0 = pm.Normal('y_0', mu=mu_0, sigma=sigma_0, observed=y)
trace = pm.sample(500)
pm.traceplot(trace);
pm.summary(trace)
reg = LinearRegression()
reg.fit(x.reshape(-1, 1), y);
print(f'Coefficients A: {reg.coef_[0]:.3}, Intercept b: {reg.intercept_:.2}')
plt.plot(x, y, '.', label='observed y', c='C0')
plt.plot(x, a * x + b, label='true y', lw=3., c='C3')
pm.plot_posterior_predictive_glm(trace, samples=30,
eval=x,
lm=lambda x, sample: sample['b_0'] + sample['a_0'] * x,
label='posterior predictive regression', c='C2')
plt.plot(x, reg.coef_[0] * x + reg.intercept_ , label='deterministic linear regression', ls='dotted',c='C1', lw=3)
plt.legend(loc=0);
plt.title('Posterior Predictive Regression Lines');
This post aims to introduce how to plot the data using matplotlib
in an XKCD style.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
x = np.linspace(-np.pi, np.pi, 100)
df_data = pd.DataFrame(data={'sin x': np.sin(x), 'cos x': np.cos(x)}, index=x)
df_data.head()
df_data.plot(title='Normal Matplotlib Style');
with plt.xkcd():
df_data.plot(title='XKCD Style');
This post aims to introduce how to convert a dictionary into a matrix using DictVectorizer from scikit-learn. This is useful when you have data stored in a list of a sparse dictionary format and would like to convert it into a feature vector digestable in a scikit-learn format.
Reference
from sklearn.feature_extraction import DictVectorizer
import pandas as pd
d_house= [{'area': 300.0, 'price': 1000, 'location': 'NY'},
{'area': 600.0, 'price': 2000, 'location': 'CA'},
{'price': 1500, 'location': 'CH'}
]
d_house
dv = DictVectorizer()
dv.fit(d_house)
pd.DataFrame(dv.fit_transform(d_house).todense(), columns=dv.feature_names_)
This post aims to introduce one of the model evaluation metrics, called Precision score. Precision score is used to measure the prediction ratio of how many of predictions were correct out of the total number of the predictions. As the precision score is higher, the prediction would be high likely true whenever such prediction is made.
Precision score is defined as the following equations:
$$ {\displaystyle {\text{Precision}}={\frac {True\;Positive}{True\;Positive + False\;Positive}}\,} = \frac {True \;Positive}{total\;\#\,of\;samples\;predicated\;as\;True} $$Reference
from sklearn.metrics import precision_score
import pandas as pd
df_prediction = pd.DataFrame([0, 1, 0, 1, 1 ,1, 1, 1],
columns=['prediction'])
df_prediction
df_groundtruth = pd.DataFrame([0, 0, 0 , 0, 1, 1, 1, 1],
columns=['gt'])
df_groundtruth
precision_score(y_true=df_groundtruth,
y_pred=df_prediction, average='binary')
TP = (df_prediction.loc[df_prediction['prediction']==1,'prediction'] == df_groundtruth.loc[df_prediction['prediction']==1,'gt']).sum()
TP
FP = (df_prediction.loc[df_prediction['prediction']==1,'prediction'] != df_groundtruth.loc[df_prediction['prediction']==1,'gt']).sum()
FP
TP / (TP + FP)
This post aims to introduce how to conduct feature selection by variance thresholding
Reference
from sklearn.feature_selection import VarianceThreshold
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
boston = load_boston()
df_boston = pd.DataFrame(boston.data, columns=boston.feature_names)
df_boston.head()
np.var(np.random.random(size=df_boston.shape[0]))
df_boston['low_variance'] = 1
df_boston['low_variance2'] = (np.random.random(size=df_boston.shape[0]) > 0.5) * 1
df_boston.head()
variance_threshold = 0.1
selection = VarianceThreshold(threshold=variance_threshold)
selection.fit(df_boston)
ax = pd.Series(selection.variances_, index=df_boston.columns).plot(kind='bar', logy=True);
ax.axhline(variance_threshold, ls='dotted', c='r');
df_boston_selected = pd.DataFrame(selection.transform(df_boston), columns=df_boston.columns[selection.get_support()])
df_boston_selected.head()
This post aims to introduce one of the model evaluation metrics, called F1 score. F1 score is used to measure the overall model performance. As F1 score is higher, the model performance would be better in general.
F1 score is defined as the following equations:
$$ F_1 = 2 \cdot \frac{precision \cdot recall}{precision + recall} $$Reference
from sklearn.metrics import f1_score, precision_score, recall_score
import pandas as pd
df_prediction = pd.DataFrame([0, 1, 0, 1, 0 ,1, 0, 1],
columns=['prediction'])
df_prediction
df_groundtruth = pd.DataFrame([0, 0, 0 , 0, 1, 1, 1, 1],
columns=['gt'])
df_groundtruth
f1_score(y_true=df_groundtruth,
y_pred=df_prediction)
precision_score(y_true=df_groundtruth,
y_pred=df_prediction)
recall_score(y_true=df_groundtruth,
y_pred=df_prediction)
2 * (0.5 * 0.5) / (0.5 + 0.5)