Posts about Machine Learning (old posts, page 5)

Loading Audio Data

Goal

This post aims to introduce how to load wave audio data as an array.

Reference

Libraries

In [7]:
import pandas as pd
import numpy as np
from scipy.io import wavfile
import matplotlib.pyplot as plt
%matplotlib inline

Load a file as numpy

In [18]:
filename = '../data/sound00.wav'
fs, data = wavfile.read(filename)
data
Out[18]:
array([0, 0, 0, ..., 0, 0, 0], dtype=int16)

Visualize audio data

In [17]:
plt.figure(figsize=(16, 4))
plt.plot(data, lw=0.5, alpha=0.8);

Anomaly Detection by PCA in PyOD

Goal

This post aims to introduce how to detect anomaly using PCA in pyod.

image

Reference

Libraries

In [31]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# PyOD
from pyod.utils.data import generate_data, get_outliers_inliers
from pyod.models.pca import PCA
from pyod.utils.data import evaluate_print
from pyod.utils.example import visualize

Create a data

In [66]:
X_train, y_train = generate_data(behaviour='new', n_features=5, train_only=True)
df_train = pd.DataFrame(X_train)
df_train['y'] = y_train
In [50]:
df_train.head()
Out[50]:
0 1 2 3 4 y
0 5.475324 4.882372 5.337351 5.376340 4.104947 0.0
1 5.244566 5.626358 5.356578 4.341500 4.856838 0.0
2 4.597031 5.787669 5.959738 5.823086 6.012408 0.0
3 4.637728 4.639901 5.400144 6.074926 4.627883 0.0
4 4.639908 4.667926 6.077212 5.012901 3.718718 0.0
In [57]:
sns.scatterplot(x=0, y=1, hue='y', data=df_train);
plt.title('Ground Truth');

Train an unsupervised PCA

In [52]:
clf = PCA()
clf.fit(X_train)
Out[52]:
PCA(contamination=0.1, copy=True, iterated_power='auto', n_components=None,
  n_selected_components=None, random_state=None, standardization=True,
  svd_solver='auto', tol=0.0, weighted=True, whiten=False)

Evaluate training score

In [65]:
y_train_pred = clf.labels_
y_train_scores = clf.decision_scores_
sns.scatterplot(x=0, y=1, hue=y_train_scores, data=df_train, palette='RdBu_r');
plt.title('Anomaly Scores by PCA');

Make Simulated Data For Anomaly Detection

Goal

This post aims to introduce how to make simulated data for anomaly detection using PyOD, which is outlier detection package. image

Reference

Libraries

In [58]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# PyOD
from pyod.utils.data import generate_data, get_outliers_inliers

Create an anomaly dataset

Create random data with 5 features

In [21]:
X_train, X_test, y_train, y_test = generate_data(behaviour='new', n_features=5)
df_tr = pd.DataFrame(X_train)
df_tr['y'] = y_train
df_te = pd.DataFrame(X_test)
df_te['y'] = y_test
In [22]:
df_tr.head()
Out[22]:
0 1 2 3 4 y
0 2.392715 3.084379 2.972580 2.907177 3.155727 0.0
1 3.185049 2.789920 2.648234 3.062398 2.673828 0.0
2 3.683184 3.169288 2.973224 2.725969 2.213359 0.0
3 2.928545 2.823802 2.888037 3.109228 2.813928 0.0
4 3.112898 3.365741 2.599102 3.090721 3.391458 0.0

Visualize created anomaly data

In [57]:
axes = df_tr.plot(subplots=True, figsize=(16, 8), title='Simulated Anomaly Data for Training');
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
In [56]:
axes = df_te.plot(subplots=True, figsize=(16, 8), title='Simulated Anomaly Data for Test');
plt.tight_layout(rect=[0, 0.03, 1, 0.95])

Drop Highly Correlated Features

Goal

This post aims to introduce how to drop highly correlated features.

Reference

Libraries

In [8]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
import seaborn as sns

Create a data with highly correlated variables

Load boston housing data

In [4]:
boston = load_boston()
df_boston = pd.DataFrame(boston.data, columns=boston.feature_names)
df_boston.head()
Out[4]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

Add another correlated feature

In [6]:
df_boston['CRIM_correlated'] = df_boston['CRIM'] * 3 + 10 + np.random.random(df_boston.shape[0])
df_boston.head()
Out[6]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT CRIM_correlated
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 10.284178
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 10.102942
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 10.387687
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 10.607908
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 10.824663

Calclate Correlation

In [7]:
df_corr = df_boston.corr()
df_corr.head()
Out[7]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT CRIM_correlated
CRIM 1.000000 -0.200469 0.406583 -0.055892 0.420972 -0.219247 0.352734 -0.379670 0.625505 0.582764 0.289946 -0.385064 0.455621 0.999937
ZN -0.200469 1.000000 -0.533828 -0.042697 -0.516604 0.311991 -0.569537 0.664408 -0.311948 -0.314563 -0.391679 0.175520 -0.412995 -0.200756
INDUS 0.406583 -0.533828 1.000000 0.062938 0.763651 -0.391676 0.644779 -0.708027 0.595129 0.720760 0.383248 -0.356977 0.603800 0.406720
CHAS -0.055892 -0.042697 0.062938 1.000000 0.091203 0.091251 0.086518 -0.099176 -0.007368 -0.035587 -0.121515 0.048788 -0.053929 -0.055514
NOX 0.420972 -0.516604 0.763651 0.091203 1.000000 -0.302188 0.731470 -0.769230 0.611441 0.668023 0.188933 -0.380051 0.590879 0.421744
In [10]:
sns.heatmap(df_corr);

Drop highly correlated feature

In [35]:
threshold = 0.9


columns = np.full((df_corr.shape[0],), True, dtype=bool)
for i in range(df_corr.shape[0]):
    for j in range(i+1, df_corr.shape[0]):
        if df_corr.iloc[i,j] >= threshold:
            if columns[j]:
                columns[j] = False
selected_columns = df_boston.columns[columns]
selected_columns
df_boston = df_boston[selected_columns]
In [36]:
df_boston.head()
Out[36]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 18.7 396.90 5.33

Feature Importance

Goal

This post aims to introduce how to obtain feature importance using random forest and visualize it in a different format

image

Reference

Libraries

In [29]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Configuration

In [69]:
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = (16, 6)

Load data

In [3]:
boston = load_boston()

df_boston = pd.DataFrame(data=boston.data, columns=boston.feature_names)
df_boston.head()
Out[3]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

Train a Random Forest Regressor

In [56]:
reg = RandomForestRegressor(n_estimators=50)
reg.fit(df_boston, boston.target)
Out[56]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

Obtain feature importance

average feature importance

In [70]:
df_feature_importance = pd.DataFrame(reg.feature_importances_, index=boston.feature_names, columns=['feature importance']).sort_values('feature importance', ascending=False)
df_feature_importance
Out[70]:
feature importance
RM 0.434691
LSTAT 0.362675
DIS 0.065282
CRIM 0.048311
NOX 0.024685
PTRATIO 0.018163
TAX 0.012388
AGE 0.011825
B 0.010220
INDUS 0.006348
RAD 0.002961
ZN 0.001503
CHAS 0.000950

all feature importance for each tree

In [58]:
df_feature_all = pd.DataFrame([tree.feature_importances_ for tree in reg.estimators_], columns=boston.feature_names)
df_feature_all.head()
Out[58]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.014397 0.000270 0.000067 0.001098 0.030470 0.160704 0.005805 0.040896 0.000915 0.009357 0.006712 0.008223 0.721085
1 0.027748 0.000151 0.004632 0.000844 0.079595 0.290730 0.020392 0.055907 0.012544 0.011589 0.018765 0.006700 0.470404
2 0.082172 0.000353 0.003930 0.002729 0.009873 0.182772 0.009487 0.053868 0.002023 0.014475 0.025605 0.004799 0.607914
3 0.020085 0.000592 0.006886 0.001462 0.016882 0.290993 0.007097 0.074538 0.001960 0.003679 0.012879 0.011265 0.551682
4 0.012873 0.001554 0.003002 0.000521 0.013372 0.251145 0.010757 0.110498 0.002889 0.007838 0.009357 0.027501 0.548694
In [97]:
# Melted data i.e., long format
df_feature_long = pd.melt(df_feature_all,var_name='feature name', value_name='values')

Visualize feature importance

The feature importance is visualized in the following format:

  • Bar chart
  • Box Plot
  • Strip Plot
  • Swarm Plot
  • Factor plot

Bar chart

In [71]:
df_feature_importance.plot(kind='bar');

Box plot

In [98]:
sns.boxplot(x="feature name", y="values", data=df_feature_long, order=df_feature_importance.index);

Strip Plot

In [99]:
sns.stripplot(x="feature name", y="values", data=df_feature_long, order=df_feature_importance.index);

Swarm plot

In [78]:
sns.swarmplot(x="feature name", y="values", data=df_feature_long, order=df_feature_importance.index);

All

In [108]:
fig, axes = plt.subplots(4, 1, figsize=(16, 8))
df_feature_importance.plot(kind='bar', ax=axes[0], title='Plots Comparison for Feature Importance');
sns.boxplot(ax=axes[1], x="feature name", y="values", data=df_feature_long, order=df_feature_importance.index);
sns.stripplot(ax=axes[2], x="feature name", y="values", data=df_feature_long, order=df_feature_importance.index);
sns.swarmplot(ax=axes[3], x="feature name", y="values", data=df_feature_long, order=df_feature_importance.index);
plt.tight_layout()

Save Images

Goal

This post aims to introduce how to save images using matplotlib.

Reference

Libraries

In [2]:
import numpy as np
import matplotlib.pyplot as plt

Create a image to save

In [6]:
img = np.random.randint(0, 255, (64, 64))
img
Out[6]:
array([[216, 200, 219, ...,  82, 176, 244],
       [ 17,  90,  86, ...,  91, 234, 195],
       [ 34, 226, 103, ..., 230,  86, 127],
       ...,
       [191, 110,  33, ...,  62, 109,  26],
       [ 43, 238, 208, ...,  51,   0, 123],
       [163, 156, 235, ..., 212, 188,  25]])

Show an image

In [9]:
plt.imshow(img);
plt.axis('off');

Save an image as png

In [33]:
plt.savefig('../images/savefig_sample_image.png')
<Figure size 432x288 with 0 Axes>
In [36]:
!ls ../images | grep image.png
savefig_sample_image.png

Train the image classifier using PyTorch

Goal

This post aims to introduce how to train the image classifier for MNIST dataset using PyTorch

image

Reference

Libraries

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Dataset
from sklearn.datasets import load_digits

# PyTorch
import torch 
import torchvision
import torchvision.transforms as transforms

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

Functions

In [32]:
def imshow(img):
    img = img / 2 + 0.5 # unnormalize
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.axis('off')
    plt.show()

Load MNIST dataset

When downloading the image dataset, we also need to define transform function that apply pixel normalization from [0, 1] to [-1, +1]

In [50]:
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, ), (0.5, ))])

trainset = torchvision.datasets.MNIST(root='~/data', 
                                        train=True,
                                        download=True,
                                        transform=transform)

testset = torchvision.datasets.MNIST(root='~/data', 
                                        train=False, 
                                        download=True, 
                                        transform=transform)
9920512it [00:28, 1582029.76it/s]                             

1654784it [00:23, 573182.07it/s]                             

Create a dataloader

In [13]:
trainloader = torch.utils.data.DataLoader(trainset,
                                          batch_size=100,
                                          shuffle=True,
                                          num_workers=2)
In [14]:
testloader = torch.utils.data.DataLoader(testset,
                                         batch_size=100,
                                         shuffle=False,
                                         num_workers=2)

Define a model

In [6]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3)  # 28x28x32 -> 26x26x32
        self.conv2 = nn.Conv2d(32, 64, 3)  # 26x26x64 -> 24x24x64
        self.pool = nn.MaxPool2d(2, 2)  # 24x24x64 -> 12x12x64
        self.dropout1 = nn.Dropout2d()
        self.fc1 = nn.Linear(12 * 12 * 64, 128)
        self.dropout2 = nn.Dropout2d()
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = self.pool(F.relu(self.conv2(x)))
        x = self.dropout1(x)
        x = x.view(-1, 12 * 12 * 64)
        x = F.relu(self.fc1(x))
        x = self.dropout2(x)
        x = self.fc2(x)
        return x

Create a loss function and optimizer

In [10]:
model = Net()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

Training a model

In [22]:
epochs = 5
for epoch in range(epochs):
    running_loss = 0.0
    for i, (inputs, labels) in enumerate(trainloader, 0):
        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 100 == 99:
            print(f'[{epoch + 1}, {i+1}] loss: {running_loss / 100:.2}')
            running_loss = 0.0

print('Finished Training')
[1, 100] loss: 1.5
[1, 200] loss: 0.82
[1, 300] loss: 0.63
[1, 400] loss: 0.55
[1, 500] loss: 0.5
[1, 600] loss: 0.46
[2, 100] loss: 0.42
[2, 200] loss: 0.4
[2, 300] loss: 0.38
[2, 400] loss: 0.38
[2, 500] loss: 0.34
[2, 600] loss: 0.34
[3, 100] loss: 0.31
[3, 200] loss: 0.31
[3, 300] loss: 0.29
[3, 400] loss: 0.28
[3, 500] loss: 0.28
[3, 600] loss: 0.26
[4, 100] loss: 0.24
[4, 200] loss: 0.24
[4, 300] loss: 0.24
[4, 400] loss: 0.23
[4, 500] loss: 0.23
[4, 600] loss: 0.22
[5, 100] loss: 0.2
[5, 200] loss: 0.21
[5, 300] loss: 0.2
[5, 400] loss: 0.19
[5, 500] loss: 0.19
[5, 600] loss: 0.19
Finished Training

Test

In [37]:
dataiter = iter(testloader)
images, labels = dataiter.next()
outputs = model(images)
_, predicted = torch.max(outputs, 1)
In [48]:
n_test = 10
df_result = pd.DataFrame({
    'Ground Truth': labels[:n_test],
    'Predicted label': predicted[:n_test]})
display(df_result.T)
imshow(torchvision.utils.make_grid(images[:n_test, :, :, :], nrow=n_test))
0 1 2 3 4 5 6 7 8 9
Ground Truth 7 2 1 0 4 1 4 9 5 9
Predicted label 7 2 1 0 4 1 4 9 6 9