Term Frequency Inverse Document Frequency
Goal¶
This post aims to introduce term frequency-inverse document frequency as known as TF-IDF, which indicates the importance of the words in a document considering the frequency of them across multiple documents and used for feature creation.
Term Frequency (TF)
Term Frequency can be computed as the number of occurrence of the word $n_{word}$ divided by the total number of words in a document $N_{word}$.
\begin{equation*} TF = \frac{n_{word}}{N_{word}} \end{equation*}Document Frequency (DF)
Document Frequency can be computed as the number of documents containing the word $n_{doc}$ divided by the number of documents $N_{doc}$.
\begin{equation*} DF = \frac{n_{doc}}{N_{doc}} \end{equation*}Inverse Document Frequency (IDF)
The inverse document frequency is the inverse of DF.
\begin{equation*} IDF = \frac{N_{doc}}{n_{doc}} \end{equation*}Practically, to avoid the explosion of IDF and dividing by zero, IDF can be computed by log format with adding 1 to denominator as below.
\begin{equation*} IDF = log(\frac{N_{doc}}{n_{doc}+1}) \end{equation*}Reference
Libraries¶
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
Create a document¶
documents = ['this is a sample.',
'this is another example. "this" also appears in the second example.']
documents
Apply TF-IDF vectorizer¶
Now applying TF-IDF to each sentence, we will obtain the feature vector for each document accordingly.
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)
df_tfidf = pd.DataFrame(X.todense(), columns=vectorizer.get_feature_names())
df_tfidf
# The highest TF-IDF for each document
df_tfidf.idxmax(axis=1)
# TF-IDF is zero if the word does not appear in a document
df_tfidf==0