Random Forest Classifer


This post aims to introduce how to train random forest classifier, which is one of most popular machine learning model.



import pandas as pd
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
%matplotlib inline

Load Data

X, y = make_blobs(n_samples=10000, n_features=10, centers=100, random_state=0)
df_X = pd.DataFrame(X)
0 1 2 3 4 5 6 7 8 9
0 6.469076 4.250703 -8.636944 4.044785 9.017254 4.535872 -4.670276 -0.481728 -6.449961 -2.659850
1 6.488564 9.379570 10.327917 -1.765055 -2.068842 -9.537790 3.936380 3.375421 7.412737 -9.722844
2 8.373928 -10.143423 -3.527536 -7.338834 1.385557 6.961417 -4.504456 -7.315360 -2.330709 6.440872
3 -3.414101 -2.019790 -2.748108 4.168691 -5.788652 -7.468685 -1.719800 -5.302655 4.534099 -4.613695
4 -1.330023 -3.725465 9.559999 -6.751356 -7.407864 -2.131515 1.766013 2.381506 -1.886568 8.667311
df_y = pd.DataFrame(y, columns=['y'])
0 85
1 64
2 93
3 46
4 61

Train a model using Cross Validation

clf = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
scores = cross_val_score(clf, X, y, cv=5, verbose=1)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    1.8s finished
pd.DataFrame(scores, columns=['CV Scores']).plot();

Loading scikit-learn's Boston Housing Dataset


This post aims to introduce how to load Boston housing using scikit-learn


from sklearn.datasets import load_boston
import pandas as pd

Load Dataset

boston = load_boston()
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])


0 1 2 3 4 5 6 7 8 9 10 11 12
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33


0 24.0
1 21.6
2 34.7
3 33.4
4 36.2

Feature Name

 'B' 'LSTAT']


.. _boston_dataset:

Boston house prices dataset

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

Adding Or Substracting Time


This post aims to add or subtract time from date column using pandas:

  • Pandas


  • Chris Albon's blog (I look at his post's title and wrote my own contents to deepen my understanding about the topic.)


import pandas as pd

Create date columns using date_range

date_rng = pd.date_range(start='20160101', end='20190101', freq='m', closed='left')
DatetimeIndex(['2016-01-31', '2016-02-29', '2016-03-31', '2016-04-30',
               '2016-05-31', '2016-06-30', '2016-07-31', '2016-08-31',
               '2016-09-30', '2016-10-31', '2016-11-30', '2016-12-31',
               '2017-01-31', '2017-02-28', '2017-03-31', '2017-04-30',
               '2017-05-31', '2017-06-30', '2017-07-31', '2017-08-31',
               '2017-09-30', '2017-10-31', '2017-11-30', '2017-12-31',
               '2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
               '2018-05-31', '2018-06-30', '2018-07-31', '2018-08-31',
               '2018-09-30', '2018-10-31', '2018-11-30', '2018-12-31'],
              dtype='datetime64[ns]', freq='M')

Ordinal Encoding using Scikit-learn


This post aims to convert one of the categorical columns for further process using scikit-learn:


import pandas as pd
import sklearn.preprocessing

Create categorical data

df = pd.DataFrame(data={'type': ['cat', 'dog', 'sheep'], 
                       'weight': [10, 15, 50]})
type weight
0 cat 10
1 dog 15
2 sheep 50

Ordinal Encoding

Ordinal encoding is replacing the categories into numbers.

# Instanciate ordinal encoder class
oe = sklearn.preprocessing.OrdinalEncoder()

# Learn the mapping from categories to the numbers[:, ['type']])
OrdinalEncoder(categories='auto', dtype=<class 'numpy.float64'>)
# Apply this ordinal encoder to new data 
oe.transform(pd.DataFrame(['cat'] * 3 + 
                          ['dog'] * 2 + 
                          ['sheep'] * 5))

Create A Sparse Matrix


This post aims to create a sparse matrix in python using following modules:

  • Numpy
  • Scipy


  • Scipy Document
  • Chris Albon's blog (I look at his post's title and wrote my own contents to deepen my understanding about the topic.)


import numpy as np
import scipy.sparse

Create a sparse matrix using csr_matrix

CSR stands for "Compressed Sparse Row" matrix

nrow = 10000
ncol = 10000

# CSR stands for "Compressed Sparse Row" matrix
arr_sparse = scipy.sparse.csr_matrix((nrow, ncol))
<10000x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in Compressed Sparse Row format>

Split-Up: dtreeviz (Part 1)


This post aims to go through each function in dtreeviz module to fully understand what is implemented. After fully understanding this, I would like to contribute to this module and submit a pull request.

I really like this module and would like to see this works for other tree-based modules like XGBoost or Lightgbm. I found the exact same issue (issues 15) in github so I hope I could contribute to this issue.

You would just have to get ShadowDecisionTree wrappers for those trees.

Based on this comment, I need first understand the class object ShadowDecisionTree


Understand folder structure

In this post, we will deep dive into the core module dtreeviz image

This module comprises of 4 python files. image is empty so we can skip it.

Let's see one by one.

Data Exploration Tool - Lantern Part 1


Lantern is a python module for a toolkit collection for data exploration from a variety of dataset to visualization.

In this post, I will walk through the followings:

  • How to set up lantern
  • What lantern can do
    • dataset
    • plot (visualization)
    • grid (interactive table view)
    • widget

Introduction to Graphviz in Jupyter Notebook


The goal in this post is to introduce graphviz to draw the graph when we explain graph-related algorithm e.g., tree, binary search etc. It would be nicer to have such a visualization to quickly digest problems and solutions.

Since we work with TreeNode and trees in a list-expresion e.g., [1, 2, null, 3] in LeetCode, the goal of this post is to easily convert the given tree in a list-expression into the visualization like below.

from IPython.display import Image