How to Develop a 1D Generative Adversarial Network From Scratch in PyTorch (Part 1)

Goal

This post is inspired by the blog "Machine Learning Mastery - How to Develop a 1D Generative Adversarial Network From Scratch in Keras" written by Jason Brownlee, PhD. But to learn step-by-step, I will describe the same concept with PyTorch.

This post will cover the followings:

Part 1:

  • Select a One-Dimensional Function
  • Define a Discriminator Model

Reference

Libraries

In [33]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# PyTorch
import torch
from torch import nn
from torch import optim
from torchviz import make_dot

Create a target 1-D function

In [3]:
def f(x):
    return x **2
In [17]:
n = 100
sigma = 10
x = sigma * (np.random.random(size=n) - 0.5)
plt.plot(x, f(x), '.');
plt.title('Target function $f(x)$');
plt.xlabel('randomly sampled x');
plt.ylabel('$f(x)$');

Define a Discriminator Model

The definition of a discriminator model is that it will classify the input data into real or fake

In [35]:
# Build a feed-forward network
model = nn.Sequential(nn.Linear(2, 25),
                      nn.ReLU(),
                      nn.Sigmoid()
                     )

# Loss
criterion = nn.CrossEntropyLoss()

# Optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)
In [36]:
# Visualize this neural network
x = torch.zeros(0, 2, dtype=torch.float, requires_grad=False)
out = model(x)
make_dot(out)
Out[36]:
%3 4870964168 SigmoidBackward4870966520 ThresholdBackward04870966520->4870964168 4870966072 AddmmBackward4870966072->4870966520 4870966296 (25)4870966296->4870966072 4870966240 TBackward4870966240->4870966072 4940643408 (25, 2)4940643408->4870966240

Create real and fake samples

In [79]:
def generate_samples(size=100, label='real'):
    """Generate samples with real or fake label
    """
    x = np.random.randn(size, 1)
    x2 = f(x)
    
    y = np.ones((size, 1)) * (label == 'real')
    return np.hstack([x, x2]), y
    
In [80]:
X, y = generate_samples()
In [82]:
x[:5]
Out[82]:
array([[ 1.2483621 ,  1.55840793],
       [ 0.57980381,  0.33617245],
       [-0.06718955,  0.00451444],
       [-1.95352245,  3.81624995],
       [-1.14922801,  1.32072501]])
In [83]:
y[:5]
Out[83]:
array([[1.],
       [1.],
       [1.],
       [1.],
       [1.]])

Loading Audio Data

Goal

This post aims to introduce how to load wave audio data as an array.

Reference

Libraries

In [7]:
import pandas as pd
import numpy as np
from scipy.io import wavfile
import matplotlib.pyplot as plt
%matplotlib inline

Load a file as numpy

In [18]:
filename = '../data/sound00.wav'
fs, data = wavfile.read(filename)
data
Out[18]:
array([0, 0, 0, ..., 0, 0, 0], dtype=int16)

Visualize audio data

In [17]:
plt.figure(figsize=(16, 4))
plt.plot(data, lw=0.5, alpha=0.8);

1105. Filling Bookcase Shelves [Dynamic Programming]

Goal

This post aims to describe the solution for 1105. Filling Bookcase Shelves based on the solution by a respectful coder Hexadecimal. This problem could be solved by dynamic programming.

Problem

We have a sequence of books: the i-th book has thickness books[i][0] and height books[i][1].

We want to place these books in order onto bookcase shelves that have total width shelf_width.

We choose some of the books to place on this shelf (such that the sum of their thickness is <= shelf_width), then build another level of shelf of the bookcase so that the total height of the bookcase has increased by the maximum height of the books we just put down. We repeat this process until there are no more books to place.

Note again that at each step of the above process, the order of the books we place is the same order as the given sequence of books. For example, if we have an ordered list of 5 books, we might place the first and second book onto the first shelf, the third book on the second shelf, and the fourth and fifth book on the last shelf.

Return the minimum possible height that the total bookshelf can be after placing shelves in this manner.

Anomaly Detection by PCA in PyOD

Goal

This post aims to introduce how to detect anomaly using PCA in pyod.

image

Reference

Libraries

In [31]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# PyOD
from pyod.utils.data import generate_data, get_outliers_inliers
from pyod.models.pca import PCA
from pyod.utils.data import evaluate_print
from pyod.utils.example import visualize

Create a data

In [66]:
X_train, y_train = generate_data(behaviour='new', n_features=5, train_only=True)
df_train = pd.DataFrame(X_train)
df_train['y'] = y_train
In [50]:
df_train.head()
Out[50]:
0 1 2 3 4 y
0 5.475324 4.882372 5.337351 5.376340 4.104947 0.0
1 5.244566 5.626358 5.356578 4.341500 4.856838 0.0
2 4.597031 5.787669 5.959738 5.823086 6.012408 0.0
3 4.637728 4.639901 5.400144 6.074926 4.627883 0.0
4 4.639908 4.667926 6.077212 5.012901 3.718718 0.0
In [57]:
sns.scatterplot(x=0, y=1, hue='y', data=df_train);
plt.title('Ground Truth');

Train an unsupervised PCA

In [52]:
clf = PCA()
clf.fit(X_train)
Out[52]:
PCA(contamination=0.1, copy=True, iterated_power='auto', n_components=None,
  n_selected_components=None, random_state=None, standardization=True,
  svd_solver='auto', tol=0.0, weighted=True, whiten=False)

Evaluate training score

In [65]:
y_train_pred = clf.labels_
y_train_scores = clf.decision_scores_
sns.scatterplot(x=0, y=1, hue=y_train_scores, data=df_train, palette='RdBu_r');
plt.title('Anomaly Scores by PCA');

Make Simulated Data For Anomaly Detection

Goal

This post aims to introduce how to make simulated data for anomaly detection using PyOD, which is outlier detection package. image

Reference

Libraries

In [58]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# PyOD
from pyod.utils.data import generate_data, get_outliers_inliers

Create an anomaly dataset

Create random data with 5 features

In [21]:
X_train, X_test, y_train, y_test = generate_data(behaviour='new', n_features=5)
df_tr = pd.DataFrame(X_train)
df_tr['y'] = y_train
df_te = pd.DataFrame(X_test)
df_te['y'] = y_test
In [22]:
df_tr.head()
Out[22]:
0 1 2 3 4 y
0 2.392715 3.084379 2.972580 2.907177 3.155727 0.0
1 3.185049 2.789920 2.648234 3.062398 2.673828 0.0
2 3.683184 3.169288 2.973224 2.725969 2.213359 0.0
3 2.928545 2.823802 2.888037 3.109228 2.813928 0.0
4 3.112898 3.365741 2.599102 3.090721 3.391458 0.0

Visualize created anomaly data

In [57]:
axes = df_tr.plot(subplots=True, figsize=(16, 8), title='Simulated Anomaly Data for Training');
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
In [56]:
axes = df_te.plot(subplots=True, figsize=(16, 8), title='Simulated Anomaly Data for Test');
plt.tight_layout(rect=[0, 0.03, 1, 0.95])

Drop Highly Correlated Features

Goal

This post aims to introduce how to drop highly correlated features.

Reference

Libraries

In [8]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
import seaborn as sns

Create a data with highly correlated variables

Load boston housing data

In [4]:
boston = load_boston()
df_boston = pd.DataFrame(boston.data, columns=boston.feature_names)
df_boston.head()
Out[4]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

Add another correlated feature

In [6]:
df_boston['CRIM_correlated'] = df_boston['CRIM'] * 3 + 10 + np.random.random(df_boston.shape[0])
df_boston.head()
Out[6]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT CRIM_correlated
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 10.284178
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 10.102942
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 10.387687
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 10.607908
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 10.824663

Calclate Correlation

In [7]:
df_corr = df_boston.corr()
df_corr.head()
Out[7]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT CRIM_correlated
CRIM 1.000000 -0.200469 0.406583 -0.055892 0.420972 -0.219247 0.352734 -0.379670 0.625505 0.582764 0.289946 -0.385064 0.455621 0.999937
ZN -0.200469 1.000000 -0.533828 -0.042697 -0.516604 0.311991 -0.569537 0.664408 -0.311948 -0.314563 -0.391679 0.175520 -0.412995 -0.200756
INDUS 0.406583 -0.533828 1.000000 0.062938 0.763651 -0.391676 0.644779 -0.708027 0.595129 0.720760 0.383248 -0.356977 0.603800 0.406720
CHAS -0.055892 -0.042697 0.062938 1.000000 0.091203 0.091251 0.086518 -0.099176 -0.007368 -0.035587 -0.121515 0.048788 -0.053929 -0.055514
NOX 0.420972 -0.516604 0.763651 0.091203 1.000000 -0.302188 0.731470 -0.769230 0.611441 0.668023 0.188933 -0.380051 0.590879 0.421744
In [10]:
sns.heatmap(df_corr);

Drop highly correlated feature

In [35]:
threshold = 0.9


columns = np.full((df_corr.shape[0],), True, dtype=bool)
for i in range(df_corr.shape[0]):
    for j in range(i+1, df_corr.shape[0]):
        if df_corr.iloc[i,j] >= threshold:
            if columns[j]:
                columns[j] = False
selected_columns = df_boston.columns[columns]
selected_columns
df_boston = df_boston[selected_columns]
In [36]:
df_boston.head()
Out[36]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 18.7 396.90 5.33

Feature Importance

Goal

This post aims to introduce how to obtain feature importance using random forest and visualize it in a different format

image

Reference

Libraries

In [29]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Configuration

In [69]:
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = (16, 6)

Load data

In [3]:
boston = load_boston()

df_boston = pd.DataFrame(data=boston.data, columns=boston.feature_names)
df_boston.head()
Out[3]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

Train a Random Forest Regressor

In [56]:
reg = RandomForestRegressor(n_estimators=50)
reg.fit(df_boston, boston.target)
Out[56]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

Obtain feature importance

average feature importance

In [70]:
df_feature_importance = pd.DataFrame(reg.feature_importances_, index=boston.feature_names, columns=['feature importance']).sort_values('feature importance', ascending=False)
df_feature_importance
Out[70]:
feature importance
RM 0.434691
LSTAT 0.362675
DIS 0.065282
CRIM 0.048311
NOX 0.024685
PTRATIO 0.018163
TAX 0.012388
AGE 0.011825
B 0.010220
INDUS 0.006348
RAD 0.002961
ZN 0.001503
CHAS 0.000950

all feature importance for each tree

In [58]:
df_feature_all = pd.DataFrame([tree.feature_importances_ for tree in reg.estimators_], columns=boston.feature_names)
df_feature_all.head()
Out[58]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.014397 0.000270 0.000067 0.001098 0.030470 0.160704 0.005805 0.040896 0.000915 0.009357 0.006712 0.008223 0.721085
1 0.027748 0.000151 0.004632 0.000844 0.079595 0.290730 0.020392 0.055907 0.012544 0.011589 0.018765 0.006700 0.470404
2 0.082172 0.000353 0.003930 0.002729 0.009873 0.182772 0.009487 0.053868 0.002023 0.014475 0.025605 0.004799 0.607914
3 0.020085 0.000592 0.006886 0.001462 0.016882 0.290993 0.007097 0.074538 0.001960 0.003679 0.012879 0.011265 0.551682
4 0.012873 0.001554 0.003002 0.000521 0.013372 0.251145 0.010757 0.110498 0.002889 0.007838 0.009357 0.027501 0.548694
In [97]:
# Melted data i.e., long format
df_feature_long = pd.melt(df_feature_all,var_name='feature name', value_name='values')

Visualize feature importance

The feature importance is visualized in the following format:

  • Bar chart
  • Box Plot
  • Strip Plot
  • Swarm Plot
  • Factor plot

Bar chart

In [71]:
df_feature_importance.plot(kind='bar');

Box plot

In [98]:
sns.boxplot(x="feature name", y="values", data=df_feature_long, order=df_feature_importance.index);

Strip Plot

In [99]:
sns.stripplot(x="feature name", y="values", data=df_feature_long, order=df_feature_importance.index);

Swarm plot

In [78]:
sns.swarmplot(x="feature name", y="values", data=df_feature_long, order=df_feature_importance.index);

All

In [108]:
fig, axes = plt.subplots(4, 1, figsize=(16, 8))
df_feature_importance.plot(kind='bar', ax=axes[0], title='Plots Comparison for Feature Importance');
sns.boxplot(ax=axes[1], x="feature name", y="values", data=df_feature_long, order=df_feature_importance.index);
sns.stripplot(ax=axes[2], x="feature name", y="values", data=df_feature_long, order=df_feature_importance.index);
sns.swarmplot(ax=axes[3], x="feature name", y="values", data=df_feature_long, order=df_feature_importance.index);
plt.tight_layout()

Save Images

Goal

This post aims to introduce how to save images using matplotlib.

Reference

Libraries

In [2]:
import numpy as np
import matplotlib.pyplot as plt

Create a image to save

In [6]:
img = np.random.randint(0, 255, (64, 64))
img
Out[6]:
array([[216, 200, 219, ...,  82, 176, 244],
       [ 17,  90,  86, ...,  91, 234, 195],
       [ 34, 226, 103, ..., 230,  86, 127],
       ...,
       [191, 110,  33, ...,  62, 109,  26],
       [ 43, 238, 208, ...,  51,   0, 123],
       [163, 156, 235, ..., 212, 188,  25]])

Show an image

In [9]:
plt.imshow(img);
plt.axis('off');

Save an image as png

In [33]:
plt.savefig('../images/savefig_sample_image.png')
<Figure size 432x288 with 0 Axes>
In [36]:
!ls ../images | grep image.png
savefig_sample_image.png

Train the image classifier using PyTorch

Goal

This post aims to introduce how to train the image classifier for MNIST dataset using PyTorch

image

Reference

Libraries

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Dataset
from sklearn.datasets import load_digits

# PyTorch
import torch 
import torchvision
import torchvision.transforms as transforms

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

Functions

In [32]:
def imshow(img):
    img = img / 2 + 0.5 # unnormalize
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.axis('off')
    plt.show()

Load MNIST dataset

When downloading the image dataset, we also need to define transform function that apply pixel normalization from [0, 1] to [-1, +1]

In [50]:
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, ), (0.5, ))])

trainset = torchvision.datasets.MNIST(root='~/data', 
                                        train=True,
                                        download=True,
                                        transform=transform)

testset = torchvision.datasets.MNIST(root='~/data', 
                                        train=False, 
                                        download=True, 
                                        transform=transform)
9920512it [00:28, 1582029.76it/s]                             

1654784it [00:23, 573182.07it/s]                             

Create a dataloader

In [13]:
trainloader = torch.utils.data.DataLoader(trainset,
                                          batch_size=100,
                                          shuffle=True,
                                          num_workers=2)
In [14]:
testloader = torch.utils.data.DataLoader(testset,
                                         batch_size=100,
                                         shuffle=False,
                                         num_workers=2)

Define a model

In [6]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3)  # 28x28x32 -> 26x26x32
        self.conv2 = nn.Conv2d(32, 64, 3)  # 26x26x64 -> 24x24x64
        self.pool = nn.MaxPool2d(2, 2)  # 24x24x64 -> 12x12x64
        self.dropout1 = nn.Dropout2d()
        self.fc1 = nn.Linear(12 * 12 * 64, 128)
        self.dropout2 = nn.Dropout2d()
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = self.pool(F.relu(self.conv2(x)))
        x = self.dropout1(x)
        x = x.view(-1, 12 * 12 * 64)
        x = F.relu(self.fc1(x))
        x = self.dropout2(x)
        x = self.fc2(x)
        return x

Create a loss function and optimizer

In [10]:
model = Net()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

Training a model

In [22]:
epochs = 5
for epoch in range(epochs):
    running_loss = 0.0
    for i, (inputs, labels) in enumerate(trainloader, 0):
        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 100 == 99:
            print(f'[{epoch + 1}, {i+1}] loss: {running_loss / 100:.2}')
            running_loss = 0.0

print('Finished Training')
[1, 100] loss: 1.5
[1, 200] loss: 0.82
[1, 300] loss: 0.63
[1, 400] loss: 0.55
[1, 500] loss: 0.5
[1, 600] loss: 0.46
[2, 100] loss: 0.42
[2, 200] loss: 0.4
[2, 300] loss: 0.38
[2, 400] loss: 0.38
[2, 500] loss: 0.34
[2, 600] loss: 0.34
[3, 100] loss: 0.31
[3, 200] loss: 0.31
[3, 300] loss: 0.29
[3, 400] loss: 0.28
[3, 500] loss: 0.28
[3, 600] loss: 0.26
[4, 100] loss: 0.24
[4, 200] loss: 0.24
[4, 300] loss: 0.24
[4, 400] loss: 0.23
[4, 500] loss: 0.23
[4, 600] loss: 0.22
[5, 100] loss: 0.2
[5, 200] loss: 0.21
[5, 300] loss: 0.2
[5, 400] loss: 0.19
[5, 500] loss: 0.19
[5, 600] loss: 0.19
Finished Training

Test

In [37]:
dataiter = iter(testloader)
images, labels = dataiter.next()
outputs = model(images)
_, predicted = torch.max(outputs, 1)
In [48]:
n_test = 10
df_result = pd.DataFrame({
    'Ground Truth': labels[:n_test],
    'Predicted label': predicted[:n_test]})
display(df_result.T)
imshow(torchvision.utils.make_grid(images[:n_test, :, :, :], nrow=n_test))
0 1 2 3 4 5 6 7 8 9
Ground Truth 7 2 1 0 4 1 4 9 5 9
Predicted label 7 2 1 0 4 1 4 9 6 9