Variance Thresholding For Feature Selection
Goal¶
This post aims to introduce how to conduct feature selection by variance thresholding
Reference
Libraries¶
In [26]:
from sklearn.feature_selection import VarianceThreshold
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Load Boston housing data¶
In [4]:
boston = load_boston()
df_boston = pd.DataFrame(boston.data, columns=boston.feature_names)
df_boston.head()
Out[4]:
Add low variance columns¶
In [14]:
np.var(np.random.random(size=df_boston.shape[0]))
Out[14]:
In [16]:
df_boston['low_variance'] = 1
df_boston['low_variance2'] = (np.random.random(size=df_boston.shape[0]) > 0.5) * 1
df_boston.head()
Out[16]:
Variance Thresholding¶
In [50]:
variance_threshold = 0.1
selection = VarianceThreshold(threshold=variance_threshold)
selection.fit(df_boston)
Out[50]:
In [51]:
ax = pd.Series(selection.variances_, index=df_boston.columns).plot(kind='bar', logy=True);
ax.axhline(variance_threshold, ls='dotted', c='r');
In [52]:
df_boston_selected = pd.DataFrame(selection.transform(df_boston), columns=df_boston.columns[selection.get_support()])
df_boston_selected.head()
Out[52]: