One-Hot Encode Nominal Categorical Features

Goal

This post aims to introduce how to create one-hot-encoded features for categorical variables. In this post, two ways of creating one hot encoded features: OneHotEncoder in scikit-learn and get_dummies in pandas.

Peronally, I like get_dummies in pandas since pandas takes care of columns names, type of data and therefore, it looks cleaner and simpler with less code.

image

Reference

Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

Create a data for one hot encoding

In [4]:
df = pd.DataFrame(data={'fruit': ['apple', 'apple', 'banana', 'orange', 'banana', 'apple'], 
                       'size': ['large', 'medium', 'small','large', 'medium', 'small']})
df
Out[4]:
fruit size
0 apple large
1 apple medium
2 banana small
3 orange large
4 banana medium
5 apple small

Create one-hot encoded columns

Using OneHotEncoder in sklearn

In [17]:
encoder = OneHotEncoder()
df_fruit_encoded = pd.DataFrame(encoder.fit_transform(df[['fruit']]).todense(), 
                                columns=encoder.get_feature_names())
df_fruit_encoded
Out[17]:
x0_apple x0_banana x0_orange
0 1.0 0.0 0.0
1 1.0 0.0 0.0
2 0.0 1.0 0.0
3 0.0 0.0 1.0
4 0.0 1.0 0.0
5 1.0 0.0 0.0

Using get_dummies method in pandas

In [18]:
pd.get_dummies(df['size'])
Out[18]:
large medium small
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
4 0 1 0
5 0 0 1

Comments

Comments powered by Disqus