Bag Of Words

Goal

This post aims to introduce Bag of words which can be used as features for each document or images.

Simply, bag of words are frequency of word with associated index for each word.

Libaries

In [17]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

Create a document

In [18]:
documents = ['this is a sample.', 
            'this is another example. "this" also appears in the second example.']

documents
Out[18]:
['this is a sample.',
 'this is another example. "this" also appears in the second example.']

Create a count vector

In [19]:
count_vect = CountVectorizer()
word_counts = count_vect.fit_transform(documents)

df_word_counts = pd.DataFrame(word_counts.todense(), columns=count_vect.get_feature_names())
df_word_counts
Out[19]:
also another appears example in is sample second the this
0 0 0 0 0 0 1 1 0 0 1
1 1 1 1 2 1 1 0 1 1 2

Create a frequency vector

In [20]:
tf_transformer = TfidfTransformer(use_idf=False).fit(df_word_counts)
word_freq = tf_transformer.transform(df_word_counts)
df_word_freq = pd.DataFrame(word_freq.todense(), columns=count_vect.get_feature_names())
df_word_freq
Out[20]:
also another appears example in is sample second the this
0 0.000000 0.000000 0.000000 0.000000 0.000000 0.577350 0.57735 0.000000 0.000000 0.577350
1 0.258199 0.258199 0.258199 0.516398 0.258199 0.258199 0.00000 0.258199 0.258199 0.516398

Comments

Comments powered by Disqus