Posts about Machine Learning (old posts, page 1)

Random Forest Classifer

Goal

This post aims to introduce how to train random forest classifier, which is one of most popular machine learning model.

Reference

Libraries

In [12]:
import pandas as pd
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
%matplotlib inline

Load Data

In [6]:
X, y = make_blobs(n_samples=10000, n_features=10, centers=100, random_state=0)
df_X = pd.DataFrame(X)
df_X.head()
Out[6]:
0 1 2 3 4 5 6 7 8 9
0 6.469076 4.250703 -8.636944 4.044785 9.017254 4.535872 -4.670276 -0.481728 -6.449961 -2.659850
1 6.488564 9.379570 10.327917 -1.765055 -2.068842 -9.537790 3.936380 3.375421 7.412737 -9.722844
2 8.373928 -10.143423 -3.527536 -7.338834 1.385557 6.961417 -4.504456 -7.315360 -2.330709 6.440872
3 -3.414101 -2.019790 -2.748108 4.168691 -5.788652 -7.468685 -1.719800 -5.302655 4.534099 -4.613695
4 -1.330023 -3.725465 9.559999 -6.751356 -7.407864 -2.131515 1.766013 2.381506 -1.886568 8.667311
In [8]:
df_y = pd.DataFrame(y, columns=['y'])
df_y.head()
Out[8]:
y
0 85
1 64
2 93
3 46
4 61

Train a model using Cross Validation

In [19]:
clf = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
scores = cross_val_score(clf, X, y, cv=5, verbose=1)
scores.mean()                               
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    1.8s finished
Out[19]:
0.9997
In [15]:
pd.DataFrame(scores, columns=['CV Scores']).plot();

Loading scikit-learn's Boston Housing Dataset

Goal

This post aims to introduce how to load Boston housing using scikit-learn

Library

In [8]:
from sklearn.datasets import load_boston
import pandas as pd

Load Dataset

In [3]:
boston = load_boston()
In [4]:
type(boston)
Out[4]:
sklearn.utils.Bunch
In [6]:
boston.keys()
Out[6]:
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

Data

In [9]:
pd.DataFrame(boston.data).head()
Out[9]:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

Target

In [12]:
pd.DataFrame(boston.target).head()
Out[12]:
0
0 24.0
1 21.6
2 34.7
3 33.4
4 36.2

Feature Name

In [17]:
print(boston.feature_names)
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']

Description

In [19]:
print(boston.DESCR)
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

Adding Or Substracting Time

Goal

This post aims to add or subtract time from date column using pandas:

  • Pandas

Reference:

  • Chris Albon's blog (I look at his post's title and wrote my own contents to deepen my understanding about the topic.)

Library

In [1]:
import pandas as pd

Create date columns using date_range

In [2]:
date_rng = pd.date_range(start='20160101', end='20190101', freq='m', closed='left')
date_rng
Out[2]:
DatetimeIndex(['2016-01-31', '2016-02-29', '2016-03-31', '2016-04-30',
               '2016-05-31', '2016-06-30', '2016-07-31', '2016-08-31',
               '2016-09-30', '2016-10-31', '2016-11-30', '2016-12-31',
               '2017-01-31', '2017-02-28', '2017-03-31', '2017-04-30',
               '2017-05-31', '2017-06-30', '2017-07-31', '2017-08-31',
               '2017-09-30', '2017-10-31', '2017-11-30', '2017-12-31',
               '2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
               '2018-05-31', '2018-06-30', '2018-07-31', '2018-08-31',
               '2018-09-30', '2018-10-31', '2018-11-30', '2018-12-31'],
              dtype='datetime64[ns]', freq='M')

Ordinal Encoding using Scikit-learn

Goal

This post aims to convert one of the categorical columns for further process using scikit-learn:

Library

In [1]:
import pandas as pd
import sklearn.preprocessing

Create categorical data

In [2]:
df = pd.DataFrame(data={'type': ['cat', 'dog', 'sheep'], 
                       'weight': [10, 15, 50]})
df
Out[2]:
type weight
0 cat 10
1 dog 15
2 sheep 50

Ordinal Encoding

Ordinal encoding is replacing the categories into numbers.

In [3]:
# Instanciate ordinal encoder class
oe = sklearn.preprocessing.OrdinalEncoder()

# Learn the mapping from categories to the numbers
oe.fit(df.loc[:, ['type']])
Out[3]:
OrdinalEncoder(categories='auto', dtype=<class 'numpy.float64'>)
In [4]:
# Apply this ordinal encoder to new data 
oe.transform(pd.DataFrame(['cat'] * 3 + 
                          ['dog'] * 2 + 
                          ['sheep'] * 5))
Out[4]:
array([[0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.]])

Create A Sparse Matrix

Goal

This post aims to create a sparse matrix in python using following modules:

  • Numpy
  • Scipy

Reference:

  • Scipy Document
  • Chris Albon's blog (I look at his post's title and wrote my own contents to deepen my understanding about the topic.)

Library

In [8]:
import numpy as np
import scipy.sparse

Create a sparse matrix using csr_matrix

CSR stands for "Compressed Sparse Row" matrix

In [9]:
nrow = 10000
ncol = 10000

# CSR stands for "Compressed Sparse Row" matrix
arr_sparse = scipy.sparse.csr_matrix((nrow, ncol))
arr_sparse
Out[9]:
<10000x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in Compressed Sparse Row format>

Split-Up: dtreeviz (Part 1)

Goal

This post aims to go through each function in dtreeviz module to fully understand what is implemented. After fully understanding this, I would like to contribute to this module and submit a pull request.

I really like this module and would like to see this works for other tree-based modules like XGBoost or Lightgbm. I found the exact same issue (issues 15) in github so I hope I could contribute to this issue.

You would just have to get ShadowDecisionTree wrappers for those trees.

Based on this comment, I need first understand the class object ShadowDecisionTree

image

Understand folder structure

In this post, we will deep dive into the core module dtreeviz image

This module comprises of 4 python files. image

__init__.py is empty so we can skip it.

Let's see one by one.

Data Exploration Tool - Lantern Part 1

Overview

Lantern is a python module for a toolkit collection for data exploration from a variety of dataset to visualization.

In this post, I will walk through the followings:

  • How to set up lantern
  • What lantern can do
    • dataset
    • plot (visualization)
    • grid (interactive table view)
    • widget

Introduction to Graphviz in Jupyter Notebook

Goal

The goal in this post is to introduce graphviz to draw the graph when we explain graph-related algorithm e.g., tree, binary search etc. It would be nicer to have such a visualization to quickly digest problems and solutions.

Since we work with TreeNode and trees in a list-expresion e.g., [1, 2, null, 3] in LeetCode, the goal of this post is to easily convert the given tree in a list-expression into the visualization like below.

In [1]:
from IPython.display import Image
Image('digraph.png')
Out[1]: