Term Frequency Inverse Document Frequency

Goal

This post aims to introduce term frequency-inverse document frequency as known as TF-IDF, which indicates the importance of the words in a document considering the frequency of them across multiple documents and used for feature creation.

Term Frequency (TF)

Term Frequency can be computed as the number of occurrence of the word nword divided by the total number of words in a document Nword.

TF=nwordNword

Document Frequency (DF)

Document Frequency can be computed as the number of documents containing the word ndoc divided by the number of documents Ndoc.

DF=ndocNdoc

Inverse Document Frequency (IDF)

The inverse document frequency is the inverse of DF.

IDF=Ndocndoc

Practically, to avoid the explosion of IDF and dividing by zero, IDF can be computed by log format with adding 1 to denominator as below.

IDF=log(Ndocndoc+1)

Reference

Libraries

In [18]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

Create a document

In [36]:
documents = ['this is a sample.', 
            'this is another example. "this" also appears in the second example.']

documents
Out[36]:
['this is a sample.',
 'this is another example. "this" also appears in the second example.']

Apply TF-IDF vectorizer

Now applying TF-IDF to each sentence, we will obtain the feature vector for each document accordingly.

In [37]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)
df_tfidf = pd.DataFrame(X.todense(), columns=vectorizer.get_feature_names())
df_tfidf
Out[37]:
also another appears example in is sample second the this
0 0.00000 0.00000 0.00000 0.00000 0.00000 0.501549 0.704909 0.00000 0.00000 0.501549
1 0.28249 0.28249 0.28249 0.56498 0.28249 0.200994 0.000000 0.28249 0.28249 0.401988
In [38]:
# The highest TF-IDF for each document
df_tfidf.idxmax(axis=1)
Out[38]:
0     sample
1    example
dtype: object
In [39]:
# TF-IDF is zero if the word does not appear in a document
df_tfidf==0
Out[39]:
also another appears example in is sample second the this
0 True True True True True False False True True False
1 False False False False False False True False False False

Comments