Understanding TF-IDF in NLP
19 Oct 2024Introduction
In our previous post, we introduced One-Hot Encoding and the Bag-of-Words (BoW) model, which are simple methods of representing text as numerical vectors. While these techniques are foundational, they come with certain limitations. One major drawback of Bag-of-Words is that it treats all words equally—common words like “the” or “is” are given the same importance as more meaningful words like “science” or “NLP.”
TF-IDF (Term Frequency-Inverse Document Frequency) is an extension of BoW that aims to address this problem. By weighting words based on their frequency in individual documents versus the entire corpus, TF-IDF highlights more important words and reduces the impact of common, less meaningful ones.
TF-IDF
TF-IDF stands for Term Frequency-Inverse Document Frequency. It’s a numerical statistic used to reflect the importance of a word in a document relative to a collection of documents (a corpus). The formula is:
\[\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)\]Where:
- \(\text{TF}(t, d)\): Term Frequency of term \(t\) in document \(d\), which is the number of times \(t\) appears in \(d\).
- \(\text{IDF}(t)\): Inverse Document Frequency, which measures how important \(t\) is across the entire corpus.
Term Frequency (TF)
Term Frequency (TF) is simply a count of how frequently a term appears in a document. The higher the frequency, the more relevant the word is assumed to be for that specific document.
\[\text{TF}(t, d) = \frac{\text{Number of occurrences of } t \text{ in } d}{\text{Total number of terms in } d}\]For example, if the word “NLP” appears 3 times in a document of 100 words, the term frequency for “NLP” is:
\[\text{TF}(NLP, d) = \frac{3}{100} = 0.03\]Inverse Document Frequency (IDF)
Inverse Document Frequency (IDF) downweights common words that appear in many documents and upweights rare words that are more meaningful in specific contexts. The formula is:
\[\text{IDF}(t) = \log\left(\frac{N}{1 + \text{DF}(t)}\right)\]Where:
- \(N\) is the total number of documents in the corpus.
- \(\text{DF}(t)\) is the number of documents that contain the term \(t\).
The “+1” in the denominator is there to avoid division by zero. Words that appear in many documents (e.g., “is”, “the”) will have a lower IDF score, while rare terms will have higher IDF scores.
Example
Let’s take an example with two documents:
- Document 1: “I love NLP and NLP loves me”
- Document 2: “NLP is great and I enjoy learning NLP”
The combined vocabulary is:
For simplicity, let’s calculate the TF and IDF for the term “NLP”.
- TF for “NLP” in Document 1: The term “NLP” appears twice in Document 1, which has 7 words total, so:
- TF for “NLP” in Document 2: The term “NLP” appears twice in Document 2, which has 8 words total, so:
Now, let’s calculate the IDF for “NLP”. Since “NLP” appears in both documents (2 out of 2 documents), the IDF is:
\[\text{IDF}(NLP) = \log\left(\frac{2}{1 + 2}\right) = \log\left(\frac{2}{3}\right) \approx -0.176\]The negative value here shows that “NLP” is a very common term in this corpus, and its weight will be downscaled.
Code Example: TF-IDF with TfidfVectorizer
Now let’s use TfidfVectorizer
from sklearn
to automatically calculate TF-IDF scores for our documents.
The output of this is:
Each row in the output corresponds to a document, and each column corresponds to a term in the vocabulary. The values represent the TF-IDF score of each term for each document.
Advantages of TF-IDF
- Balances Frequency: TF-IDF considers both how frequently a word appears in a document (term frequency) and how unique or common it is across all documents (inverse document frequency). This helps prioritize meaningful words.
- Reduces Impact of Stop Words: By downweighting terms that appear in many documents, TF-IDF naturally handles common stop words without needing to remove them manually.
- Efficient for Large Corpora: TF-IDF is computationally efficient and scales well to large datasets.
Limitations of TF-IDF
While TF-IDF is a significant improvement over simple Bag-of-Words, it still has some limitations:
- No Semantic Meaning: Like Bag-of-Words, TF-IDF treats words as independent features and doesn’t capture the relationships or meaning between them.
- Sparse Representations: Even with the IDF weighting, TF-IDF still generates high-dimensional and sparse vectors, especially for large vocabularies.
- Ignores Word Order: TF-IDF doesn’t account for word order, so sentences with the same words in different arrangements will have the same representation.
Conclusion
TF-IDF is a powerful and widely-used method for text representation, especially in tasks like document retrieval and search engines, where distinguishing between important and common words is crucial. However, as we’ve seen, TF-IDF doesn’t capture the meaning or relationships between words, which is where word embeddings come into play.