Random Forest Classifer

h1ros

May 15, 2019, 11:42:05 PM

Comments

Goal¶

This post aims to introduce how to train random forest classifier, which is one of most popular machine learning model.

Reference

Scikit learn - Ensemble methods

Libraries¶

In [12]:

import pandas as pd
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
%matplotlib inline

Load Data¶

In [6]:

X, y = make_blobs(n_samples=10000, n_features=10, centers=100, random_state=0)
df_X = pd.DataFrame(X)
df_X.head()

Out[6]:

	0	1	2	3	4	5	6	7	8	9
0	6.469076	4.250703	-8.636944	4.044785	9.017254	4.535872	-4.670276	-0.481728	-6.449961	-2.659850
1	6.488564	9.379570	10.327917	-1.765055	-2.068842	-9.537790	3.936380	3.375421	7.412737	-9.722844
2	8.373928	-10.143423	-3.527536	-7.338834	1.385557	6.961417	-4.504456	-7.315360	-2.330709	6.440872
3	-3.414101	-2.019790	-2.748108	4.168691	-5.788652	-7.468685	-1.719800	-5.302655	4.534099	-4.613695
4	-1.330023	-3.725465	9.559999	-6.751356	-7.407864	-2.131515	1.766013	2.381506	-1.886568	8.667311

In [8]:

df_y = pd.DataFrame(y, columns=['y'])
df_y.head()

Out[8]:

	y
0	85
1	64
2	93
3	46
4	61

Train a model using Cross Validation¶

In [19]:

clf = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
scores = cross_val_score(clf, X, y, cv=5, verbose=1)
scores.mean()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    1.8s finished

Out[19]:

0.9997

In [15]:

pd.DataFrame(scores, columns=['CV Scores']).plot();

Linear Regression

h1ros

May 13, 2019, 10:25:33 PM

Comments

Goal¶

This post aims to introduce linear regression using dummy data.

Reference

Loading scikit-learn's Boston Housing Dataset

h1ros

May 12, 2019, 11:08:53 PM

Comments

Goal¶

This post aims to introduce how to load Boston housing using scikit-learn

Library¶

In [8]:

from sklearn.datasets import load_boston
import pandas as pd

Load Dataset¶

In [3]:

boston = load_boston()

In [4]:

type(boston)

Out[4]:

sklearn.utils.Bunch

In [6]:

boston.keys()

Out[6]:

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

Data¶

In [9]:

pd.DataFrame(boston.data).head()

Out[9]:

	0	1	2	4	5	6	7	8	9	10	11	12
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33

Target¶

In [12]:

pd.DataFrame(boston.target).head()

Out[12]:

	0
0	24.0
1	21.6
2	34.7
3	33.4
4	36.2

Feature Name¶

In [17]:

print(boston.feature_names)

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']

Description¶

In [19]:

print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

Create a word cloud

h1ros

May 7, 2019, 12:08:10 AM

Comments

Goal¶

This post aims to introduce how to create a word cloud using wordcloud

As the source of words, I use one of my posts in 200Wordsaday a.k.a. 200WaD where is the community for those who want to build a writing habit.

Reference

Datacamp - Generating WordClouds in Python

Adding Or Substracting Time

h1ros

May 2, 2019, 11:10:13 PM

Comments

Goal¶

This post aims to add or subtract time from date column using pandas:

Pandas

Reference:

Chris Albon's blog (I look at his post's title and wrote my own contents to deepen my understanding about the topic.)

Library¶

In [1]:

import pandas as pd

Create date columns using date_range¶

In [2]:

date_rng = pd.date_range(start='20160101', end='20190101', freq='m', closed='left')
date_rng

Out[2]:

DatetimeIndex(['2016-01-31', '2016-02-29', '2016-03-31', '2016-04-30',
               '2016-05-31', '2016-06-30', '2016-07-31', '2016-08-31',
               '2016-09-30', '2016-10-31', '2016-11-30', '2016-12-31',
               '2017-01-31', '2017-02-28', '2017-03-31', '2017-04-30',
               '2017-05-31', '2017-06-30', '2017-07-31', '2017-08-31',
               '2017-09-30', '2017-10-31', '2017-11-30', '2017-12-31',
               '2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
               '2018-05-31', '2018-06-30', '2018-07-31', '2018-08-31',
               '2018-09-30', '2018-10-31', '2018-11-30', '2018-12-31'],
              dtype='datetime64[ns]', freq='M')

Ordinal Encoding using Scikit-learn

h1ros

Apr 30, 2019, 8:17:07 PM

Comments

Goal¶

This post aims to convert one of the categorical columns for further process using scikit-learn:

Library¶

In [1]:

import pandas as pd
import sklearn.preprocessing

Create categorical data¶

In [2]:

df = pd.DataFrame(data={'type': ['cat', 'dog', 'sheep'], 
                       'weight': [10, 15, 50]})
df

Out[2]:

	type	weight
0	cat	10
1	dog	15
2	sheep	50

Ordinal Encoding¶

Ordinal encoding is replacing the categories into numbers.

In [3]:

# Instanciate ordinal encoder class
oe = sklearn.preprocessing.OrdinalEncoder()

# Learn the mapping from categories to the numbers
oe.fit(df.loc[:, ['type']])

Out[3]:

OrdinalEncoder(categories='auto', dtype=<class 'numpy.float64'>)

In [4]:

# Apply this ordinal encoder to new data 
oe.transform(pd.DataFrame(['cat'] * 3 + 
                          ['dog'] * 2 + 
                          ['sheep'] * 5))

Out[4]:

array([[0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.]])

Create A Sparse Matrix

h1ros

Apr 29, 2019, 11:29:15 PM

Comments

Goal¶

This post aims to create a sparse matrix in python using following modules:

Numpy
Scipy

Reference:

Scipy Document
Chris Albon's blog (I look at his post's title and wrote my own contents to deepen my understanding about the topic.)

Library¶

In [8]:

import numpy as np
import scipy.sparse

Create a sparse matrix using csr_matrix¶

CSR stands for "Compressed Sparse Row" matrix

In [9]:

nrow = 10000
ncol = 10000

# CSR stands for "Compressed Sparse Row" matrix
arr_sparse = scipy.sparse.csr_matrix((nrow, ncol))
arr_sparse

Out[9]:

<10000x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in Compressed Sparse Row format>

Split-Up: dtreeviz (Part 1)

h1ros

Apr 15, 2019, 10:27:13 PM

Comments

Goal¶

This post aims to go through each function in dtreeviz module to fully understand what is implemented. After fully understanding this, I would like to contribute to this module and submit a pull request.

I really like this module and would like to see this works for other tree-based modules like XGBoost or Lightgbm. I found the exact same issue (issues 15) in github so I hope I could contribute to this issue.

You would just have to get ShadowDecisionTree wrappers for those trees.

Based on this comment, I need first understand the class object ShadowDecisionTree

Understand folder structure¶

In this post, we will deep dive into the core module dtreeviz

This module comprises of 4 python files.

__init__.py is empty so we can skip it.

Let's see one by one.

Data Exploration Tool - Lantern Part 1

h1ros

Mar 29, 2019, 11:12:36 PM

Comments

Overview¶

Lantern is a python module for a toolkit collection for data exploration from a variety of dataset to visualization.

In this post, I will walk through the followings:

How to set up lantern
What lantern can do
- dataset
- plot (visualization)
- grid (interactive table view)
- widget

Introduction to Graphviz in Jupyter Notebook

h1ros

Feb 26, 2019, 9:43:34 PM

Comments

Goal¶

The goal in this post is to introduce graphviz to draw the graph when we explain graph-related algorithm e.g., tree, binary search etc. It would be nicer to have such a visualization to quickly digest problems and solutions.

Since we work with TreeNode and trees in a list-expresion e.g., [1, 2, null, 3] in LeetCode, the goal of this post is to easily convert the given tree in a list-expression into the visualization like below.

In [1]:

from IPython.display import Image
Image('digraph.png')

Out[1]: