One-Hot Encode Nominal Categorical Features

h1ros

Jun 20, 2019, 12:38:23 PM

Goal¶

This post aims to introduce how to create one-hot-encoded features for categorical variables. In this post, two ways of creating one hot encoded features: OneHotEncoder in scikit-learn and get_dummies in pandas.

Peronally, I like get_dummies in pandas since pandas takes care of columns names, type of data and therefore, it looks cleaner and simpler with less code.

Reference

Libraries¶

In [1]:

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

Create a data for one hot encoding¶

In [4]:

df = pd.DataFrame(data={'fruit': ['apple', 'apple', 'banana', 'orange', 'banana', 'apple'], 
                       'size': ['large', 'medium', 'small','large', 'medium', 'small']})
df

Out[4]:

	fruit	size
0	apple	large
1	apple	medium
2	banana	small
3	orange	large
4	banana	medium
5	apple	small

Create one-hot encoded columns¶

Using `OneHotEncoder` in `sklearn`¶

In [17]:

encoder = OneHotEncoder()
df_fruit_encoded = pd.DataFrame(encoder.fit_transform(df[['fruit']]).todense(), 
                                columns=encoder.get_feature_names())
df_fruit_encoded

Out[17]:

	x0_apple	x0_banana	x0_orange
0	1.0	0.0	0.0
1	1.0	0.0	0.0
2	0.0	1.0	0.0
3	0.0	0.0	1.0
4	0.0	1.0	0.0
5	1.0	0.0	0.0

Using `get_dummies` method in `pandas`¶

In [18]:

pd.get_dummies(df['size'])

Out[18]:

	large	medium	small
0	1	0	0
1	0	1	0
2	0	0	1
3	1	0	0
4	0	1	0
5	0	0	1

	x0_apple	x0_banana	x0_orange
0	1.0	0.0	0.0
1	1.0	0.0	0.0
2	0.0	1.0	0.0
3	0.0	0.0	1.0
4	0.0	1.0	0.0
5	1.0	0.0	0.0

	x0_apple	x0_banana	x0_orange
0	1.0	0.0	0.0
1	1.0	0.0	0.0
2	0.0	1.0	0.0
3	0.0	0.0	1.0
4	0.0	1.0	0.0
5	1.0	0.0	0.0

Goal¶

Libraries¶

Create a data for one hot encoding¶

Create one-hot encoded columns¶

Using OneHotEncoder in sklearn¶

Using get_dummies method in pandas¶

Comments

Using `OneHotEncoder` in `sklearn`¶

Using `get_dummies` method in `pandas`¶

	x0_apple	x0_banana	x0_orange
0	1.0	0.0	0.0
1	1.0	0.0	0.0
2	0.0	1.0	0.0
3	0.0	0.0	1.0
4	0.0	1.0	0.0
5	1.0	0.0	0.0