Goal¶

This post aims to introduce Bag of words which can be used as features for each document or images.

Simply, bag of words are frequency of word with associated index for each word.

Libaries¶

In [17]:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

Create a document¶

In [18]:

documents = ['this is a sample.', 
            'this is another example. "this" also appears in the second example.']

documents

Out[18]:

['this is a sample.',
 'this is another example. "this" also appears in the second example.']

Create a count vector¶

In [19]:

count_vect = CountVectorizer()
word_counts = count_vect.fit_transform(documents)

df_word_counts = pd.DataFrame(word_counts.todense(), columns=count_vect.get_feature_names())
df_word_counts

Out[19]:

	also	another	appears	example	in	is	sample	second	the	this
0	0	0	0	0	0	1	1	0	0	1
1	1	1	1	2	1	1	0	1	1	2

Create a frequency vector¶

In [20]:

tf_transformer = TfidfTransformer(use_idf=False).fit(df_word_counts)
word_freq = tf_transformer.transform(df_word_counts)
df_word_freq = pd.DataFrame(word_freq.todense(), columns=count_vect.get_feature_names())
df_word_freq

Out[20]:

	also	another	appears	example	in	is	sample	second	the	this
0	0.000000	0.000000	0.000000	0.000000	0.000000	0.577350	0.57735	0.000000	0.000000	0.577350
1	0.258199	0.258199	0.258199	0.516398	0.258199	0.258199	0.00000	0.258199	0.258199	0.516398

Goal¶

Libaries¶

Create a document¶

Create a count vector¶

Create a frequency vector¶

Comments