Posts about Machine Learning (old posts, page 5)

RSS feed

Loading Audio Data

h1ros

2019-07-01

Comments

Goal¶

This post aims to introduce how to load wave audio data as an array.

Reference

Stackoverflow - Importing sound files into Python as NumPy arrays (alternatives to audiolab)

Libraries¶

In [7]:

import pandas as pd
import numpy as np
from scipy.io import wavfile
import matplotlib.pyplot as plt
%matplotlib inline

Load a file as numpy¶

In [18]:

filename = '../data/sound00.wav'
fs, data = wavfile.read(filename)
data

Out[18]:

array([0, 0, 0, ..., 0, 0, 0], dtype=int16)

Visualize audio data¶

In [17]:

plt.figure(figsize=(16, 4))
plt.plot(data, lw=0.5, alpha=0.8);

Anomaly Detection by Auto Encoder (Deep Learning) in PyOD

h1ros

2019-06-29

Comments

Goal¶

This post aims to introduce how to detect anomaly using Auto Encoder (Deep Learning) in PyODand Keras / Tensorflow as backend.

Anomaly Detection by PCA in PyOD

h1ros

2019-06-28

Comments

Goal¶

This post aims to introduce how to detect anomaly using PCA in pyod.

Reference

Libraries¶

In [31]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# PyOD
from pyod.utils.data import generate_data, get_outliers_inliers
from pyod.models.pca import PCA
from pyod.utils.data import evaluate_print
from pyod.utils.example import visualize

Create a data¶

In [66]:

X_train, y_train = generate_data(behaviour='new', n_features=5, train_only=True)
df_train = pd.DataFrame(X_train)
df_train['y'] = y_train

In [50]:

df_train.head()

Out[50]:

	0	1	2	3	4
0	5.475324	4.882372	5.337351	5.376340	4.104947
1	5.244566	5.626358	5.356578	4.341500	4.856838
2	4.597031	5.787669	5.959738	5.823086	6.012408
3	4.637728	4.639901	5.400144	6.074926	4.627883
4	4.639908	4.667926	6.077212	5.012901	3.718718

In [57]:

sns.scatterplot(x=0, y=1, hue='y', data=df_train);
plt.title('Ground Truth');

Train an unsupervised PCA¶

In [52]:

clf = PCA()
clf.fit(X_train)

Out[52]:

PCA(contamination=0.1, copy=True, iterated_power='auto', n_components=None,
  n_selected_components=None, random_state=None, standardization=True,
  svd_solver='auto', tol=0.0, weighted=True, whiten=False)

Evaluate training score¶

In [65]:

y_train_pred = clf.labels_
y_train_scores = clf.decision_scores_
sns.scatterplot(x=0, y=1, hue=y_train_scores, data=df_train, palette='RdBu_r');
plt.title('Anomaly Scores by PCA');

Make Simulated Data For Anomaly Detection

h1ros

2019-06-27

Comments

Goal¶

This post aims to introduce how to make simulated data for anomaly detection using PyOD, which is outlier detection package.

Reference

Libraries¶

In [58]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# PyOD
from pyod.utils.data import generate_data, get_outliers_inliers

Create an anomaly dataset¶

Create random data with 5 features¶

In [21]:

X_train, X_test, y_train, y_test = generate_data(behaviour='new', n_features=5)
df_tr = pd.DataFrame(X_train)
df_tr['y'] = y_train
df_te = pd.DataFrame(X_test)
df_te['y'] = y_test

In [22]:

df_tr.head()

Out[22]:

	0	1	2	3	4
0	2.392715	3.084379	2.972580	2.907177	3.155727
1	3.185049	2.789920	2.648234	3.062398	2.673828
2	3.683184	3.169288	2.973224	2.725969	2.213359
3	2.928545	2.823802	2.888037	3.109228	2.813928
4	3.112898	3.365741	2.599102	3.090721	3.391458

Visualize created anomaly data¶

In [57]:

axes = df_tr.plot(subplots=True, figsize=(16, 8), title='Simulated Anomaly Data for Training');
plt.tight_layout(rect=[0, 0.03, 1, 0.95])

In [56]:

axes = df_te.plot(subplots=True, figsize=(16, 8), title='Simulated Anomaly Data for Test');
plt.tight_layout(rect=[0, 0.03, 1, 0.95])

Drop Highly Correlated Features

h1ros

2019-06-26

Comments

Goal¶

This post aims to introduce how to drop highly correlated features.

Reference

Towards Data Science - Feature Selection with sklearn and Pandas

Libraries¶

In [8]:

import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
import seaborn as sns

Create a data with highly correlated variables¶

Load boston housing data¶

In [4]:

boston = load_boston()
df_boston = pd.DataFrame(boston.data, columns=boston.feature_names)
df_boston.head()

Out[4]:

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33

Add another correlated feature¶

In [6]:

df_boston['CRIM_correlated'] = df_boston['CRIM'] * 3 + 10 + np.random.random(df_boston.shape[0])
df_boston.head()

Out[6]:

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	CRIM_correlated
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	10.284178
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	10.102942
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	10.387687
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	10.607908
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	10.824663

Calclate Correlation¶

In [7]:

df_corr = df_boston.corr()
df_corr.head()

Out[7]:

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	CRIM_correlated
CRIM	1.000000	-0.200469	0.406583	-0.055892	0.420972	-0.219247	0.352734	-0.379670	0.625505	0.582764	0.289946	-0.385064	0.455621	0.999937
ZN	-0.200469	1.000000	-0.533828	-0.042697	-0.516604	0.311991	-0.569537	0.664408	-0.311948	-0.314563	-0.391679	0.175520	-0.412995	-0.200756
INDUS	0.406583	-0.533828	1.000000	0.062938	0.763651	-0.391676	0.644779	-0.708027	0.595129	0.720760	0.383248	-0.356977	0.603800	0.406720
CHAS	-0.055892	-0.042697	0.062938	1.000000	0.091203	0.091251	0.086518	-0.099176	-0.007368	-0.035587	-0.121515	0.048788	-0.053929	-0.055514
NOX	0.420972	-0.516604	0.763651	0.091203	1.000000	-0.302188	0.731470	-0.769230	0.611441	0.668023	0.188933	-0.380051	0.590879	0.421744

In [10]:

sns.heatmap(df_corr);

Drop highly correlated feature¶

In [35]:

threshold = 0.9


columns = np.full((df_corr.shape[0],), True, dtype=bool)
for i in range(df_corr.shape[0]):
    for j in range(i+1, df_corr.shape[0]):
        if df_corr.iloc[i,j] >= threshold:
            if columns[j]:
                columns[j] = False
selected_columns = df_boston.columns[columns]
selected_columns
df_boston = df_boston[selected_columns]

In [36]:

df_boston.head()

Out[36]:

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	PTRATIO	B	LSTAT
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	15.3	396.90	4.98
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	17.8	396.90	9.14
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	17.8	392.83	4.03
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	18.7	394.63	2.94
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	18.7	396.90	5.33

Feature Importance

h1ros

2019-06-25

Comments

Goal¶

This post aims to introduce how to obtain feature importance using random forest and visualize it in a different format

Reference

Libraries¶

In [29]:

import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Configuration¶

In [69]:

import matplotlib as mpl
mpl.rcParams['figure.figsize'] = (16, 6)

Load data¶

In [3]:

boston = load_boston()

df_boston = pd.DataFrame(data=boston.data, columns=boston.feature_names)
df_boston.head()

Out[3]:

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33

Train a Random Forest Regressor¶

In [56]:

reg = RandomForestRegressor(n_estimators=50)
reg.fit(df_boston, boston.target)

Out[56]:

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

Obtain feature importance¶

average feature importance¶

In [70]:

df_feature_importance = pd.DataFrame(reg.feature_importances_, index=boston.feature_names, columns=['feature importance']).sort_values('feature importance', ascending=False)
df_feature_importance

Out[70]:

	feature importance
RM	0.434691
LSTAT	0.362675
DIS	0.065282
CRIM	0.048311
NOX	0.024685
PTRATIO	0.018163
TAX	0.012388
AGE	0.011825
B	0.010220
INDUS	0.006348
RAD	0.002961
ZN	0.001503
CHAS	0.000950

all feature importance for each tree¶

In [58]:

df_feature_all = pd.DataFrame([tree.feature_importances_ for tree in reg.estimators_], columns=boston.feature_names)
df_feature_all.head()

Out[58]:

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.014397	0.000270	0.000067	0.001098	0.030470	0.160704	0.005805	0.040896	0.000915	0.009357	0.006712	0.008223	0.721085
1	0.027748	0.000151	0.004632	0.000844	0.079595	0.290730	0.020392	0.055907	0.012544	0.011589	0.018765	0.006700	0.470404
2	0.082172	0.000353	0.003930	0.002729	0.009873	0.182772	0.009487	0.053868	0.002023	0.014475	0.025605	0.004799	0.607914
3	0.020085	0.000592	0.006886	0.001462	0.016882	0.290993	0.007097	0.074538	0.001960	0.003679	0.012879	0.011265	0.551682
4	0.012873	0.001554	0.003002	0.000521	0.013372	0.251145	0.010757	0.110498	0.002889	0.007838	0.009357	0.027501	0.548694

In [97]:

# Melted data i.e., long format
df_feature_long = pd.melt(df_feature_all,var_name='feature name', value_name='values')

Visualize feature importance¶

The feature importance is visualized in the following format:

Bar chart
Box Plot
Strip Plot
Swarm Plot
~~Factor plot~~

Bar chart¶

In [71]:

df_feature_importance.plot(kind='bar');

Box plot¶

In [98]:

sns.boxplot(x="feature name", y="values", data=df_feature_long, order=df_feature_importance.index);

Strip Plot¶

In [99]:

sns.stripplot(x="feature name", y="values", data=df_feature_long, order=df_feature_importance.index);

Swarm plot¶

In [78]:

sns.swarmplot(x="feature name", y="values", data=df_feature_long, order=df_feature_importance.index);

All¶

In [108]:

fig, axes = plt.subplots(4, 1, figsize=(16, 8))
df_feature_importance.plot(kind='bar', ax=axes[0], title='Plots Comparison for Feature Importance');
sns.boxplot(ax=axes[1], x="feature name", y="values", data=df_feature_long, order=df_feature_importance.index);
sns.stripplot(ax=axes[2], x="feature name", y="values", data=df_feature_long, order=df_feature_importance.index);
sns.swarmplot(ax=axes[3], x="feature name", y="values", data=df_feature_long, order=df_feature_importance.index);
plt.tight_layout()

Save Images

h1ros

2019-06-24

Comments

Goal¶

This post aims to introduce how to save images using matplotlib.

Reference

matplotlib documentation - matplotlib.pyplot.savefig

Libraries¶

In [2]:

import numpy as np
import matplotlib.pyplot as plt

Create a image to save¶

In [6]:

img = np.random.randint(0, 255, (64, 64))
img

Out[6]:

array([[216, 200, 219, ...,  82, 176, 244],
       [ 17,  90,  86, ...,  91, 234, 195],
       [ 34, 226, 103, ..., 230,  86, 127],
       ...,
       [191, 110,  33, ...,  62, 109,  26],
       [ 43, 238, 208, ...,  51,   0, 123],
       [163, 156, 235, ..., 212, 188,  25]])

Show an image¶

In [9]:

plt.imshow(img);
plt.axis('off');

Save an image as png¶

In [33]:

plt.savefig('../images/savefig_sample_image.png')

<Figure size 432x288 with 0 Axes>

In [36]:

!ls ../images | grep image.png

savefig_sample_image.png

Train the image classifier using PyTorch

h1ros

2019-06-24

Comments

Goal¶

This post aims to introduce how to train the image classifier for MNIST dataset using PyTorch

Reference

Libraries¶

In [8]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Dataset
from sklearn.datasets import load_digits

# PyTorch
import torch 
import torchvision
import torchvision.transforms as transforms

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

Functions¶

In [32]:

def imshow(img):
    img = img / 2 + 0.5 # unnormalize
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.axis('off')
    plt.show()

Load MNIST dataset¶

When downloading the image dataset, we also need to define transform function that apply pixel normalization from [0, 1] to [-1, +1]

In [50]:

transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, ), (0.5, ))])

trainset = torchvision.datasets.MNIST(root='~/data', 
                                        train=True,
                                        download=True,
                                        transform=transform)

testset = torchvision.datasets.MNIST(root='~/data', 
                                        train=False, 
                                        download=True, 
                                        transform=transform)

9920512it [00:28, 1582029.76it/s]                             

1654784it [00:23, 573182.07it/s]

Create a dataloader¶

In [13]:

trainloader = torch.utils.data.DataLoader(trainset,
                                          batch_size=100,
                                          shuffle=True,
                                          num_workers=2)

In [14]:

testloader = torch.utils.data.DataLoader(testset,
                                         batch_size=100,
                                         shuffle=False,
                                         num_workers=2)

Define a model¶

In [6]:

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3)  # 28x28x32 -> 26x26x32
        self.conv2 = nn.Conv2d(32, 64, 3)  # 26x26x64 -> 24x24x64
        self.pool = nn.MaxPool2d(2, 2)  # 24x24x64 -> 12x12x64
        self.dropout1 = nn.Dropout2d()
        self.fc1 = nn.Linear(12 * 12 * 64, 128)
        self.dropout2 = nn.Dropout2d()
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = self.pool(F.relu(self.conv2(x)))
        x = self.dropout1(x)
        x = x.view(-1, 12 * 12 * 64)
        x = F.relu(self.fc1(x))
        x = self.dropout2(x)
        x = self.fc2(x)
        return x

Create a loss function and optimizer¶

In [10]:

model = Net()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

Training a model¶

In [22]:

epochs = 5
for epoch in range(epochs):
    running_loss = 0.0
    for i, (inputs, labels) in enumerate(trainloader, 0):
        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 100 == 99:
            print(f'[{epoch + 1}, {i+1}] loss: {running_loss / 100:.2}')
            running_loss = 0.0

print('Finished Training')

[1, 100] loss: 1.5
[1, 200] loss: 0.82
[1, 300] loss: 0.63
[1, 400] loss: 0.55
[1, 500] loss: 0.5
[1, 600] loss: 0.46
[2, 100] loss: 0.42
[2, 200] loss: 0.4
[2, 300] loss: 0.38
[2, 400] loss: 0.38
[2, 500] loss: 0.34
[2, 600] loss: 0.34
[3, 100] loss: 0.31
[3, 200] loss: 0.31
[3, 300] loss: 0.29
[3, 400] loss: 0.28
[3, 500] loss: 0.28
[3, 600] loss: 0.26
[4, 100] loss: 0.24
[4, 200] loss: 0.24
[4, 300] loss: 0.24
[4, 400] loss: 0.23
[4, 500] loss: 0.23
[4, 600] loss: 0.22
[5, 100] loss: 0.2
[5, 200] loss: 0.21
[5, 300] loss: 0.2
[5, 400] loss: 0.19
[5, 500] loss: 0.19
[5, 600] loss: 0.19
Finished Training

Test¶

In [37]:

dataiter = iter(testloader)
images, labels = dataiter.next()
outputs = model(images)
_, predicted = torch.max(outputs, 1)

In [48]:

n_test = 10
df_result = pd.DataFrame({
    'Ground Truth': labels[:n_test],
    'Predicted label': predicted[:n_test]})
display(df_result.T)
imshow(torchvision.utils.make_grid(images[:n_test, :, :, :], nrow=n_test))