Posts about Basic

Calculate The Average, Variance, And Standard Deviation

Goal

This post aims to introduce how to calculate the average, variance and standard deviation of matrix using pandas.

Libraries

In [2]:
import pandas as pd
import numpy as np

Create a matrix

In [13]:
n = 1000
df = pd.DataFrame({'rand': np.random.rand(n),
                   'randint': np.random.randint(low=0, high=100, size=n),
                   'randn': np.random.randn(n),
                   'random_sample': np.random.random_sample(size=n),
                   'binomial': np.random.binomial(n=1, p=.5, size=n),
                   'beta': np.random.beta(a=1, b=1, size=n),
                   })
df.head()
Out[13]:
rand randint randn random_sample binomial beta
0 0.689690 59 0.416245 0.607567 1 0.532052
1 0.288356 2 0.092351 0.311634 0 0.192651
2 0.173002 50 -0.626691 0.920702 0 0.342812
3 0.953088 17 -0.149677 0.316060 1 0.792191
4 0.693120 94 0.264678 0.060313 1 0.059370

Calculate average, variance, and standard deviation

Calculate by each function

In [16]:
df.mean()
Out[16]:
rand              0.497015
randint          49.224000
randn            -0.054651
random_sample     0.504412
binomial          0.490000
beta              0.508469
dtype: float64
In [15]:
df.var()
Out[15]:
rand               0.083301
randint          791.485309
randn              1.033378
random_sample      0.081552
binomial           0.250150
beta               0.083489
dtype: float64
In [17]:
df.std()
Out[17]:
rand              0.288619
randint          28.133349
randn             1.016552
random_sample     0.285573
binomial          0.500150
beta              0.288944
dtype: float64

Calculate using describe

In [18]:
df.describe()
Out[18]:
rand randint randn random_sample binomial beta
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.00000 1000.000000
mean 0.497015 49.224000 -0.054651 0.504412 0.49000 0.508469
std 0.288619 28.133349 1.016552 0.285573 0.50015 0.288944
min 0.000525 0.000000 -3.405606 0.001359 0.00000 0.000373
25% 0.241000 25.000000 -0.741640 0.264121 0.00000 0.256070
50% 0.497571 48.000000 -0.074852 0.505738 0.00000 0.523674
75% 0.742702 73.000000 0.602928 0.743445 1.00000 0.758901
max 0.999275 99.000000 3.861652 0.995010 1.00000 0.999007

Describe An Array

Goal

This post aims to describe an array using pandas. As an example, Boston Housing Data is used in this post.

Reference

Libraries

In [13]:
import pandas as pd
from sklearn.datasets import load_boston
%matplotlib inline

Create an array

In [4]:
boston = load_boston()
df_boston = pd.DataFrame(boston['data'], columns=boston['feature_names'])
df_boston.head()
Out[4]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

Describe numerical values

pandas DataFrame has a method, called describe, which shows basic statistics based on the data types for each columns

In [5]:
df_boston.describe()
Out[5]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.613524 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032 12.653063
std 8.601545 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864 7.141062
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 1.730000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500 6.950000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000 11.360000
75% 3.677083 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000 16.955000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 37.970000

Random Forest Classifer

Goal

This post aims to introduce how to train random forest classifier, which is one of most popular machine learning model.

Reference

Libraries

In [12]:
import pandas as pd
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
%matplotlib inline

Load Data

In [6]:
X, y = make_blobs(n_samples=10000, n_features=10, centers=100, random_state=0)
df_X = pd.DataFrame(X)
df_X.head()
Out[6]:
0 1 2 3 4 5 6 7 8 9
0 6.469076 4.250703 -8.636944 4.044785 9.017254 4.535872 -4.670276 -0.481728 -6.449961 -2.659850
1 6.488564 9.379570 10.327917 -1.765055 -2.068842 -9.537790 3.936380 3.375421 7.412737 -9.722844
2 8.373928 -10.143423 -3.527536 -7.338834 1.385557 6.961417 -4.504456 -7.315360 -2.330709 6.440872
3 -3.414101 -2.019790 -2.748108 4.168691 -5.788652 -7.468685 -1.719800 -5.302655 4.534099 -4.613695
4 -1.330023 -3.725465 9.559999 -6.751356 -7.407864 -2.131515 1.766013 2.381506 -1.886568 8.667311
In [8]:
df_y = pd.DataFrame(y, columns=['y'])
df_y.head()
Out[8]:
y
0 85
1 64
2 93
3 46
4 61

Train a model using Cross Validation

In [19]:
clf = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
scores = cross_val_score(clf, X, y, cv=5, verbose=1)
scores.mean()                               
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    1.8s finished
Out[19]:
0.9997
In [15]:
pd.DataFrame(scores, columns=['CV Scores']).plot();

Adding Or Substracting Time

Goal

This post aims to add or subtract time from date column using pandas:

  • Pandas

Reference:

  • Chris Albon's blog (I look at his post's title and wrote my own contents to deepen my understanding about the topic.)

Library

In [1]:
import pandas as pd

Create date columns using date_range

In [2]:
date_rng = pd.date_range(start='20160101', end='20190101', freq='m', closed='left')
date_rng
Out[2]:
DatetimeIndex(['2016-01-31', '2016-02-29', '2016-03-31', '2016-04-30',
               '2016-05-31', '2016-06-30', '2016-07-31', '2016-08-31',
               '2016-09-30', '2016-10-31', '2016-11-30', '2016-12-31',
               '2017-01-31', '2017-02-28', '2017-03-31', '2017-04-30',
               '2017-05-31', '2017-06-30', '2017-07-31', '2017-08-31',
               '2017-09-30', '2017-10-31', '2017-11-30', '2017-12-31',
               '2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
               '2018-05-31', '2018-06-30', '2018-07-31', '2018-08-31',
               '2018-09-30', '2018-10-31', '2018-11-30', '2018-12-31'],
              dtype='datetime64[ns]', freq='M')

Ordinal Encoding using Scikit-learn

Goal

This post aims to convert one of the categorical columns for further process using scikit-learn:

Library

In [1]:
import pandas as pd
import sklearn.preprocessing

Create categorical data

In [2]:
df = pd.DataFrame(data={'type': ['cat', 'dog', 'sheep'], 
                       'weight': [10, 15, 50]})
df
Out[2]:
type weight
0 cat 10
1 dog 15
2 sheep 50

Ordinal Encoding

Ordinal encoding is replacing the categories into numbers.

In [3]:
# Instanciate ordinal encoder class
oe = sklearn.preprocessing.OrdinalEncoder()

# Learn the mapping from categories to the numbers
oe.fit(df.loc[:, ['type']])
Out[3]:
OrdinalEncoder(categories='auto', dtype=<class 'numpy.float64'>)
In [4]:
# Apply this ordinal encoder to new data 
oe.transform(pd.DataFrame(['cat'] * 3 + 
                          ['dog'] * 2 + 
                          ['sheep'] * 5))
Out[4]:
array([[0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.]])

Create A Sparse Matrix

Goal

This post aims to create a sparse matrix in python using following modules:

  • Numpy
  • Scipy

Reference:

  • Scipy Document
  • Chris Albon's blog (I look at his post's title and wrote my own contents to deepen my understanding about the topic.)

Library

In [8]:
import numpy as np
import scipy.sparse

Create a sparse matrix using csr_matrix

CSR stands for "Compressed Sparse Row" matrix

In [9]:
nrow = 10000
ncol = 10000

# CSR stands for "Compressed Sparse Row" matrix
arr_sparse = scipy.sparse.csr_matrix((nrow, ncol))
arr_sparse
Out[9]:
<10000x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in Compressed Sparse Row format>