Goal¶
This post aims to introduce term frequency-inverse document frequency as known as TF-IDF, which indicates the importance of the words in a document considering the frequency of them across multiple documents and used for feature creation.
Term Frequency (TF)
Term Frequency can be computed as the number of occurrence of the word $n_{word}$ divided by the total number of words in a document $N_{word}$.
\begin{equation*}
TF = \frac{n_{word}}{N_{word}}
\end{equation*}
Document Frequency (DF)
Document Frequency can be computed as the number of documents containing the word $n_{doc}$ divided by the number of documents $N_{doc}$.
\begin{equation*}
DF = \frac{n_{doc}}{N_{doc}}
\end{equation*}
Inverse Document Frequency (IDF)
The inverse document frequency is the inverse of DF.
\begin{equation*}
IDF = \frac{N_{doc}}{n_{doc}}
\end{equation*}
Practically, to avoid the explosion of IDF and dividing by zero, IDF can be computed by log format with adding 1 to denominator as below.
\begin{equation*}
IDF = log(\frac{N_{doc}}{n_{doc}+1})
\end{equation*}
Reference