Goal¶

This post aims to introduce term frequency-inverse document frequency as known as TF-IDF, which indicates the importance of the words in a document considering the frequency of them across multiple documents and used for feature creation.

Term Frequency (TF)

Term Frequency can be computed as the number of occurrence of the word $n_{word}$ divided by the total number of words in a document $N_{word}$ .

$\begin{equation*} TF = \frac{n_{word}}{N_{word}} \end{equation*}$

Document Frequency (DF)

Document Frequency can be computed as the number of documents containing the word $n_{doc}$ divided by the number of documents $N_{doc}$ .

$\begin{equation*} DF = \frac{n_{doc}}{N_{doc}} \end{equation*}$

Inverse Document Frequency (IDF)

The inverse document frequency is the inverse of DF.

$\begin{equation*} IDF = \frac{N_{doc}}{n_{doc}} \end{equation*}$

Practically, to avoid the explosion of IDF and dividing by zero, IDF can be computed by log format with adding 1 to denominator as below.

$\begin{equation*} IDF = log(\frac{N_{doc}}{n_{doc}+1}) \end{equation*}$

Reference

Libraries¶

In [18]:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

Create a document¶

In [36]:

documents = ['this is a sample.', 
            'this is another example. "this" also appears in the second example.']

documents

Out[36]:

['this is a sample.',
 'this is another example. "this" also appears in the second example.']

Apply TF-IDF vectorizer¶

Now applying TF-IDF to each sentence, we will obtain the feature vector for each document accordingly.

In [37]:

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)
df_tfidf = pd.DataFrame(X.todense(), columns=vectorizer.get_feature_names())
df_tfidf

Out[37]:

	also	another	appears	example	in	is	sample	second	the	this
0	0.00000	0.00000	0.00000	0.00000	0.00000	0.501549	0.704909	0.00000	0.00000	0.501549
1	0.28249	0.28249	0.28249	0.56498	0.28249	0.200994	0.000000	0.28249	0.28249	0.401988

In [38]:

# The highest TF-IDF for each document
df_tfidf.idxmax(axis=1)

Out[38]:

0     sample
1    example
dtype: object

In [39]:

# TF-IDF is zero if the word does not appear in a document
df_tfidf==0

Out[39]:

	also	another	appears	example	in	is	sample	second	the	this
0	True	True	True	True	True	False	False	True	True	False
1	False	False	False	False	False	False	True	False	False	False

Goal¶

Libraries¶

Create a document¶

Apply TF-IDF vectorizer¶

Comments