Posts about Boosting

Random Forest Classifer

Goal

This post aims to introduce how to train random forest classifier, which is one of most popular machine learning model.

Reference

Libraries

In [12]:
import pandas as pd
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
%matplotlib inline

Load Data

In [6]:
X, y = make_blobs(n_samples=10000, n_features=10, centers=100, random_state=0)
df_X = pd.DataFrame(X)
df_X.head()
Out[6]:
0 1 2 3 4 5 6 7 8 9
0 6.469076 4.250703 -8.636944 4.044785 9.017254 4.535872 -4.670276 -0.481728 -6.449961 -2.659850
1 6.488564 9.379570 10.327917 -1.765055 -2.068842 -9.537790 3.936380 3.375421 7.412737 -9.722844
2 8.373928 -10.143423 -3.527536 -7.338834 1.385557 6.961417 -4.504456 -7.315360 -2.330709 6.440872
3 -3.414101 -2.019790 -2.748108 4.168691 -5.788652 -7.468685 -1.719800 -5.302655 4.534099 -4.613695
4 -1.330023 -3.725465 9.559999 -6.751356 -7.407864 -2.131515 1.766013 2.381506 -1.886568 8.667311
In [8]:
df_y = pd.DataFrame(y, columns=['y'])
df_y.head()
Out[8]:
y
0 85
1 64
2 93
3 46
4 61

Train a model using Cross Validation

In [19]:
clf = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
scores = cross_val_score(clf, X, y, cv=5, verbose=1)
scores.mean()                               
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    1.8s finished
Out[19]:
0.9997
In [15]:
pd.DataFrame(scores, columns=['CV Scores']).plot();