Bag Of Words

h1ros

May 27, 2019, 11:40:24 PM

Goal¶

This post aims to introduce Bag of words which can be used as features for each document or images.

Simply, bag of words are frequency of word with associated index for each word.

Libaries¶

In [17]:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

Create a document¶

In [18]:

documents = ['this is a sample.', 
            'this is another example. "this" also appears in the second example.']

documents

Out[18]:

['this is a sample.',
 'this is another example. "this" also appears in the second example.']

Create a count vector¶

In [19]:

count_vect = CountVectorizer()
word_counts = count_vect.fit_transform(documents)

df_word_counts = pd.DataFrame(word_counts.todense(), columns=count_vect.get_feature_names())
df_word_counts

Out[19]:

	also	another	appears	example	in	is	sample	second	the	this
0	0	0	0	0	0	1	1	0	0	1
1	1	1	1	2	1	1	0	1	1	2

Create a frequency vector¶

In [20]:

tf_transformer = TfidfTransformer(use_idf=False).fit(df_word_counts)
word_freq = tf_transformer.transform(df_word_counts)
df_word_freq = pd.DataFrame(word_freq.todense(), columns=count_vect.get_feature_names())
df_word_freq

Out[20]:

	also	another	appears	example	in	is	sample	second	the	this
0	0.000000	0.000000	0.000000	0.000000	0.000000	0.577350	0.57735	0.000000	0.000000	0.577350
1	0.258199	0.258199	0.258199	0.516398	0.258199	0.258199	0.00000	0.258199	0.258199	0.516398

Term Frequency Inverse Document Frequency

h1ros

May 26, 2019, 10:15:17 PM

Comments

Goal¶

This post aims to introduce term frequency-inverse document frequency as known as TF-IDF, which indicates the importance of the words in a document considering the frequency of them across multiple documents and used for feature creation.

Term Frequency (TF)

Term Frequency can be computed as the number of occurrence of the word $n_{word}$ divided by the total number of words in a document $N_{word}$.

\begin{equation*} TF = \frac{n_{word}}{N_{word}} \end{equation*}

Document Frequency (DF)

Document Frequency can be computed as the number of documents containing the word $n_{doc}$ divided by the number of documents $N_{doc}$.

\begin{equation*} DF = \frac{n_{doc}}{N_{doc}} \end{equation*}

Inverse Document Frequency (IDF)

The inverse document frequency is the inverse of DF.

\begin{equation*} IDF = \frac{N_{doc}}{n_{doc}} \end{equation*}

Practically, to avoid the explosion of IDF and dividing by zero, IDF can be computed by log format with adding 1 to denominator as below.

\begin{equation*} IDF = log(\frac{N_{doc}}{n_{doc}+1}) \end{equation*}

Reference

Libraries¶

In [18]:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

Create a document¶

In [36]:

documents = ['this is a sample.', 
            'this is another example. "this" also appears in the second example.']

documents

Out[36]:

['this is a sample.',
 'this is another example. "this" also appears in the second example.']

Apply TF-IDF vectorizer¶

Now applying TF-IDF to each sentence, we will obtain the feature vector for each document accordingly.

In [37]:

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)
df_tfidf = pd.DataFrame(X.todense(), columns=vectorizer.get_feature_names())
df_tfidf

Out[37]:

	also	another	appears	example	in	is	sample	second	the	this
0	0.00000	0.00000	0.00000	0.00000	0.00000	0.501549	0.704909	0.00000	0.00000	0.501549
1	0.28249	0.28249	0.28249	0.56498	0.28249	0.200994	0.000000	0.28249	0.28249	0.401988

In [38]:

# The highest TF-IDF for each document
df_tfidf.idxmax(axis=1)

Out[38]:

0     sample
1    example
dtype: object

In [39]:

# TF-IDF is zero if the word does not appear in a document
df_tfidf==0

Out[39]:

	also	another	appears	example	in	is	sample	second	the	this
0	True	True	True	True	True	False	False	True	True	False
1	False	False	False	False	False	False	True	False	False	False