Cogs and Levers A blog full of technical stuff

Understanding TF-IDF in NLP

Introduction

In our previous post, we introduced One-Hot Encoding and the Bag-of-Words (BoW) model, which are simple methods of representing text as numerical vectors. While these techniques are foundational, they come with certain limitations. One major drawback of Bag-of-Words is that it treats all words equally—common words like “the” or “is” are given the same importance as more meaningful words like “science” or “NLP.”

TF-IDF (Term Frequency-Inverse Document Frequency) is an extension of BoW that aims to address this problem. By weighting words based on their frequency in individual documents versus the entire corpus, TF-IDF highlights more important words and reduces the impact of common, less meaningful ones.

TF-IDF

TF-IDF stands for Term Frequency-Inverse Document Frequency. It’s a numerical statistic used to reflect the importance of a word in a document relative to a collection of documents (a corpus). The formula is:

\[\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)\]

Where:

  • \(\text{TF}(t, d)\): Term Frequency of term \(t\) in document \(d\), which is the number of times \(t\) appears in \(d\).
  • \(\text{IDF}(t)\): Inverse Document Frequency, which measures how important \(t\) is across the entire corpus.

Term Frequency (TF)

Term Frequency (TF) is simply a count of how frequently a term appears in a document. The higher the frequency, the more relevant the word is assumed to be for that specific document.

\[\text{TF}(t, d) = \frac{\text{Number of occurrences of } t \text{ in } d}{\text{Total number of terms in } d}\]

For example, if the word “NLP” appears 3 times in a document of 100 words, the term frequency for “NLP” is:

\[\text{TF}(NLP, d) = \frac{3}{100} = 0.03\]

Inverse Document Frequency (IDF)

Inverse Document Frequency (IDF) downweights common words that appear in many documents and upweights rare words that are more meaningful in specific contexts. The formula is:

\[\text{IDF}(t) = \log\left(\frac{N}{1 + \text{DF}(t)}\right)\]

Where:

  • \(N\) is the total number of documents in the corpus.
  • \(\text{DF}(t)\) is the number of documents that contain the term \(t\).

The “+1” in the denominator is there to avoid division by zero. Words that appear in many documents (e.g., “is”, “the”) will have a lower IDF score, while rare terms will have higher IDF scores.

Example

Let’s take an example with two documents:

  1. Document 1: “I love NLP and NLP loves me”
  2. Document 2: “NLP is great and I enjoy learning NLP”

The combined vocabulary is:

["I", "love", "NLP", "and", "loves", "me", "is", "great", "enjoy", "learning"]

For simplicity, let’s calculate the TF and IDF for the term “NLP”.

  • TF for “NLP” in Document 1: The term “NLP” appears twice in Document 1, which has 7 words total, so:
\[\text{TF}(NLP, d_1) = \frac{2}{7} \approx 0.286\]
  • TF for “NLP” in Document 2: The term “NLP” appears twice in Document 2, which has 8 words total, so:
\[\text{TF}(NLP, d_2) = \frac{2}{8} = 0.25\]

Now, let’s calculate the IDF for “NLP”. Since “NLP” appears in both documents (2 out of 2 documents), the IDF is:

\[\text{IDF}(NLP) = \log\left(\frac{2}{1 + 2}\right) = \log\left(\frac{2}{3}\right) \approx -0.176\]

The negative value here shows that “NLP” is a very common term in this corpus, and its weight will be downscaled.

Code Example: TF-IDF with TfidfVectorizer

Now let’s use TfidfVectorizer from sklearn to automatically calculate TF-IDF scores for our documents.

from sklearn.feature_extraction.text import TfidfVectorizer

# Define a corpus of documents
corpus = [
    "I love NLP and NLP loves me",
    "NLP is great and I enjoy learning NLP"
]

# Initialize the TF-IDF vectorizer
vectorizer = TfidfVectorizer()
# Fit and transform the corpus into TF-IDF vectors
X = vectorizer.fit_transform(corpus)

# Display the feature names (vocabulary)
print(vectorizer.get_feature_names_out())

# Display the TF-IDF matrix
print(X.toarray())

The output of this is:

['and' 'enjoy' 'great' 'is' 'learning' 'love' 'loves' 'me' 'nlp']
[[0.30253071 0.         0.         0.         0.         0.42519636 0.42519636 0.42519636 0.60506143]
 [0.27840869 0.39129369 0.39129369 0.39129369 0.39129369 0.         0.         0.         0.55681737]]

Each row in the output corresponds to a document, and each column corresponds to a term in the vocabulary. The values represent the TF-IDF score of each term for each document.

Advantages of TF-IDF

  1. Balances Frequency: TF-IDF considers both how frequently a word appears in a document (term frequency) and how unique or common it is across all documents (inverse document frequency). This helps prioritize meaningful words.
  2. Reduces Impact of Stop Words: By downweighting terms that appear in many documents, TF-IDF naturally handles common stop words without needing to remove them manually.
  3. Efficient for Large Corpora: TF-IDF is computationally efficient and scales well to large datasets.

Limitations of TF-IDF

While TF-IDF is a significant improvement over simple Bag-of-Words, it still has some limitations:

  1. No Semantic Meaning: Like Bag-of-Words, TF-IDF treats words as independent features and doesn’t capture the relationships or meaning between them.
  2. Sparse Representations: Even with the IDF weighting, TF-IDF still generates high-dimensional and sparse vectors, especially for large vocabularies.
  3. Ignores Word Order: TF-IDF doesn’t account for word order, so sentences with the same words in different arrangements will have the same representation.

Conclusion

TF-IDF is a powerful and widely-used method for text representation, especially in tasks like document retrieval and search engines, where distinguishing between important and common words is crucial. However, as we’ve seen, TF-IDF doesn’t capture the meaning or relationships between words, which is where word embeddings come into play.

Turning Words into Vectors

Introduction

In our previous post, we covered the preprocessing steps necessary to convert text into a machine-readable format, like tokenization and stop word removal. But once the text is preprocessed, how do we represent it for use in machine learning models?

Before the rise of word embeddings, simpler techniques were commonly used to represent text as vectors. Today, we’ll explore two foundational techniques: One-Hot Encoding and Bag-of-Words (BoW). These methods don’t capture the semantic meaning of words as well as modern embeddings do, but they’re essential for understanding the evolution of Natural Language Processing (NLP).

One-Hot Encoding

One of the simplest ways to represent text is through One-Hot Encoding. In this approach, each word in a vocabulary is represented as a vector where all the elements are zero, except for a single element that corresponds to the word’s index.

Let’s take a small vocabulary:

["I", "love", "NLP"]

The vocabulary size is 3, and each word will be represented by a 3-dimensional vector:

I -> [1, 0, 0] 
love -> [0, 1, 0] 
NLP -> [0, 0, 1]

Each word is “hot” (1) in one specific position, while “cold” (0) everywhere else.

Example

Let’s generate one-hot encoded vectors using Python:

from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Define a small vocabulary
vocab = ["I", "love", "NLP"]
# Reshape the data for OneHotEncoder
vocab_reshaped = np.array(vocab).reshape(-1, 1)

# Initialize the OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
# Fit and transform the vocabulary
onehot_encoded = encoder.fit_transform(vocab_reshaped)
print(onehot_encoded)

The output shows the three indexed words:

[[1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]

Each word has a unique binary vector representing its position in the vocabulary.

Drawbacks of One-Hot Encoding

One-Hot Encoding is simple but comes with some limitations:

  • High Dimensionality: For large vocabularies, the vectors become huge, leading to a “curse of dimensionality”.
  • Lack of Semantic Information: One-Hot vectors don’t capture any relationships between words. “love” and “like” would have completely different vectors, even though they are semantically similar.

Bag-of-Words (BoW)

One-Hot Encoding represents individual words, but what about whole documents or sentences? That’s where the Bag-of-Words (BoW) model comes in. In BoW, the text is represented as a vector of word frequencies.

BoW counts how often each word from a given vocabulary appears in the document, without considering the order of words (hence, a “bag” of words).

Let’s take two example sentences:

  1. “I love NLP”
  2. “NLP is amazing”

The combined vocabulary for these two sentences is:

["I", "love", "NLP", "is", "amazing"]

Now, using BoW, we represent each sentence as a vector of word counts:

  1. “I love NLP” -> [1, 1, 1, 0, 0] (since “I”, “love”, and “NLP” appear once, and “is” and “amazing” don’t appear)
  2. “NLP is amazing” -> [0, 0, 1, 1, 1] (since “NLP”, “is”, and “amazing” appear once, and “I” and “love” don’t appear)

Example

We can use CountVectorizer from the sklearn library to easily apply Bag-of-Words to a corpus of text:

from sklearn.feature_extraction.text import CountVectorizer

# Define a set of documents
corpus = [
    "I love NLP",
    "NLP is amazing"
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()
# Transform the corpus into BoW vectors
X = vectorizer.fit_transform(corpus)

# Display the feature names (the vocabulary)
print(vectorizer.get_feature_names_out())

# Display the BoW representation
print(X.toarray())

The output of which looks like this:

['amazing' 'is' 'love' 'nlp']
[[0 0 1 1]
 [1 1 0 1]]

Limitations of Bag-of-Words

While BoW is a simple and powerful method, it too has its drawbacks:

  1. Sparsity: Like One-Hot Encoding, BoW produces high-dimensional and sparse vectors, especially for large vocabularies.
  2. No Word Order: BoW ignores word order. The sentence “I love NLP” is treated the same as “NLP love I”, which may not always make sense.
  3. No Semantic Relationships: Just like One-Hot Encoding, BoW doesn’t capture the meaning or relationships between words. All words are treated as independent features.

Conclusion

Both One-Hot Encoding and Bag-of-Words are simple and effective ways to represent text as numbers, but they have significant limitations, particularly in capturing semantic relationships and dealing with large vocabularies.

These methods laid the groundwork for more sophisticated representations like TF-IDF (which we’ll cover next) and eventually led to word embeddings, which capture the meaning and context of words more effectively.

Computers Understanding Text

Introduction

When working with Natural Language Processing (NLP), one of the first challenges you encounter is how to convert human-readable text into a format that machines can understand. Computers don’t natively understand words or sentences; they operate in numbers.

So, how do we get from words to something a machine can process?

This is where text preprocessing comes in.

Text preprocessing involves several steps to prepare raw text for analysis. In this post, we’ll walk through the foundational techniques in preprocessing: tokenization, lowercasing, removing stop words, and stemming/lemmatization. These steps ensure that our text is in a clean, structured format for further processing like word embeddings or more complex NLP models.

Tokenization: Breaking Down the Text

What is Tokenization?

Tokenization is the process of breaking a string of text into smaller pieces, usually words or subwords. In essence, it’s the process of splitting sentences into tokens, which are the basic units for further NLP tasks.

For example, consider the sentence:

"I love NLP!"

Tokenization would break this into:

["I", "love", "NLP", "!"]

This is a simple example where each token corresponds to a word or punctuation. However, tokenization can get more complex depending on the language and the task. For instance, some tokenizers split contractions like “can’t” into ["can", "'t"], while others might treat it as one token. Tokenization also becomes more challenging in languages that don’t have spaces between words, like Chinese or Japanese.

Code Example: Tokenization in Python

Let’s look at a basic example of tokenization using Python’s nltk library:

import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

sentence = "I love NLP!"
tokens = word_tokenize(sentence)
print(tokens)

The output you can see is simply:

['I', 'love', 'NLP', '!']

Code Example: Sentence Tokenization

Tokenization can also occur at the sentence level, which means breaking down a paragraph or a larger body of text into individual sentences. This is helpful for tasks like summarization or sentiment analysis, where sentence boundaries matter.

from nltk.tokenize import sent_tokenize

text = "NLP is fun. It's amazing how machines can understand text!"
sentences = sent_tokenize(text)
print(sentences)

The output is now on the sentence boundary:

['NLP is fun.', "It's amazing how machines can understand text!"]

Lowercasing: Making Text Uniform

In English, the words “Dog” and “dog” mean the same thing, but to a computer, they are two different tokens. Lowercasing is a simple yet powerful step in text preprocessing. By converting everything to lowercase, we reduce the complexity of the text and ensure that words like “NLP” and “nlp” are treated identically.

We can achieve this simple with the .lower() method off of a string.

This step becomes crucial when dealing with large text corpora, as it avoids treating different capitalizations of the same word as distinct entities.

Removing Stop Words: Filtering Out Common Words

Stop words are commonly used words that don’t carry significant meaning in many tasks, such as “and”, “the”, and “is”. Removing stop words helps reduce noise in the data and improves the efficiency of downstream models by focusing only on the meaningful parts of the text.

Many libraries provide lists of stop words, but the ideal list can vary depending on the task.

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
tokens = ["I", "love", "NLP", "!"]
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)

Only the value words remain, removing the “I”:

["love", "NLP", "!"]

Stemming and Lemmatization: Reducing Words to Their Root Forms

Another key step in preprocessing is reducing words to their base or root form. There are two common approaches:

Stemming: This cuts off word endings to get to the base form, which can sometimes be rough. For example, “running”, “runner”, and “ran” might all be reduced to “run”.

Lemmatization: This is a more refined process that looks at the word’s context and reduces it to its dictionary form. For instance, “better” would be lemmatized to “good”.

Here’s an example using nltk for both stemming and lemmatization:

from nltk.stem import PorterStemmer

ps = PorterStemmer()
words = ["running", "runner", "ran"]
stemmed_words = [ps.stem(word) for word in words]
print(stemmed_words)

The output here:

['run', 'runner', 'ran']

An example of Lemmatization looks like this:

from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
words = ["running", "better", "ran"]
lemmatized_words = [lemmatizer.lemmatize(word, pos="v") for word in words]
print(lemmatized_words)
['run', 'good', 'run']

Conclusion

Text preprocessing is a crucial first step in any NLP project. By breaking down text through tokenization, making it uniform with lowercasing, and reducing unnecessary noise with stop word removal and stemming/lemmatization, we can create a clean and structured dataset for further analysis or model training. These steps form the foundation upon which more advanced techniques, such as word embeddings and machine learning models, are built.

Word Embeddings

Introduction

Word embeddings are one of the most significant advancements in natural language processing (NLP). They allow us to transform words or sentences into vectors, where each word is represented by a point in a high-dimensional space. The core idea is that words with similar meanings are close to each other in this space, making it possible to use mathematical operations on these vectors to uncover relationships between words.

In this post, we’ll explore how to create word embeddings using a pre-trained model, and we’ll perform various vector operations to see how these embeddings capture semantic relationships. We’ll cover examples like analogy generation, word similarity, and how these embeddings can be leveraged for search tasks.

What Are Word Embeddings?

Word embeddings are dense vector representations of words, where each word is mapped to a point in a continuous vector space. Unlike older techniques (such as one-hot encoding) that give each word a unique identifier, embeddings represent words in a way that captures semantic relationships, such as similarity and analogy.

For example, embeddings can represent the relationship:

king - man + woman = queen

This is made possible because words that are semantically similar (e.g., “king” and “queen”) have vector representations that are close together in space, while words that are opposites (e.g., “good” and “bad”) may have vectors pointing in opposite directions.

Gensim

Let’s begin by loading a pre-trained word embedding model. We’ll use the glove-wiki-gigaword-50 model, which provides 50-dimensional vectors for many common words.

import gensim.downloader as api

# Load the pre-trained GloVe Word2Vec model
model = api.load("glove-wiki-gigaword-50")

This might take a moment to download. It’s not too big.

Now that we have the model, let’s try converting some words into vectors.

Converting Words to Vectors

We can take individual words and get their vector representations. Let’s look at the vectors for “king,” “queen,” “man,” and “woman.”

# Example words
word1 = "king"
word2 = "queen"
word3 = "man"
word4 = "woman"

# Get the vectors for each word
vector_king = model[word1]
vector_queen = model[word2]
vector_man = model[word3]
vector_woman = model[word4]

# Print the vector for 'king'
print(f"Vector for '{word1}':\n{vector_king}")

You’ll see that each word is represented as a 50-dimensional vector. These vectors capture the meanings of the words in such a way that we can manipulate them mathematically.

Vector for 'king':
[ 0.50451   0.68607  -0.59517  -0.022801  0.60046  -0.13498  -0.08813
0.47377  -0.61798  -0.31012  -0.076666  1.493    -0.034189 -0.98173
0.68229   0.81722  -0.51874  -0.31503  -0.55809   0.66421   0.1961
-0.13495  -0.11476  -0.30344   0.41177  -2.223    -1.0756   -1.0783
-0.34354   0.33505   1.9927   -0.04234  -0.64319   0.71125   0.49159
0.16754   0.34344  -0.25663  -0.8523    0.1661    0.40102   1.1685
-1.0137   -0.21585  -0.15155   0.78321  -0.91241  -1.6106   -0.64426
-0.51042 ]

Performing Vector Arithmetic

One of the most famous examples of vector arithmetic in word embeddings is the analogy:

king - man + woman = queen

We can perform this operation by subtracting the vector for “man” from “king” and then adding the vector for “woman.” Let’s try this and see what word is closest to the resulting vector.

# Perform vector arithmetic
result_vector = vector_king - vector_man + vector_woman

# Find the closest word to the resulting vector
similar_words = model.similar_by_vector(result_vector, topn=3)

# Print the result
print("Result of 'king - man + woman':", similar_words)

You should find that the word closest to the resulting vector is “queen,” demonstrating that the model captures the gender relationship between “king” and “queen.”

Measuring Word Similarity with Cosine Similarity

Another key operation you can perform on word embeddings is measuring the similarity between two words. The most common way to do this is by calculating the cosine similarity between the two vectors. The cosine similarity between two vectors is defined as:

\[\text{cosine similarity} = \frac{A \cdot B}{\|A\| \|B\|}\]

This returns a value between -1 and 1:

  • 1 means the vectors are identical (the words are very similar),
  • 0 means the vectors are orthogonal (unrelated words),
  • -1 means the vectors are pointing in opposite directions (possibly antonyms).

Let’s measure the similarity between related words like “apple” and “fruit,” and compare it to unrelated words like “apple” and “car.”

import numpy as np
from numpy.linalg import norm

# Function to calculate cosine similarity
def cosine_similarity(vec1, vec2):
return np.dot(vec1, vec2) / (norm(vec1) * norm(vec2))

# Get vectors for 'apple', 'fruit', and 'car'
vector_apple = model['apple']
vector_fruit = model['fruit']
vector_car = model['car']

# Calculate cosine similarity
similarity_apple_fruit = cosine_similarity(vector_apple, vector_fruit)
similarity_apple_car = cosine_similarity(vector_apple, vector_car)

print(f"Cosine Similarity between 'apple' and 'fruit': {similarity_apple_fruit:.4f}")
print(f"Cosine Similarity between 'apple' and 'car': {similarity_apple_car:.4f}")

You will see that the cosine similarity between “apple” and “fruit” is much higher than that between “apple” and “car,” illustrating the semantic relationship between “apple” and “fruit.”

Cosine Similarity between 'apple' and 'fruit': 0.5918
Cosine Similarity between 'apple' and 'car': 0.3952

Search Using Word Embeddings

Another powerful use of word embeddings is in search tasks. If you want to find words that are most similar to a given word, you can use the model’s similar_by_word function to retrieve the top N most similar words. Here’s how you can search for words most similar to “apple”:

# Find words most similar to 'apple'
similar_words_to_apple = model.similar_by_word('apple', topn=5)
print("Words most similar to 'apple':", similar_words_to_apple)

You can see here that “apple” is treated in the proper noun sense as in the company Apple.

Words most similar to 'apple': [
('blackberry', 0.7543067336082458), 
('chips', 0.7438644170761108), 
('iphone', 0.7429664134979248), 
('microsoft', 0.7334205508232117), 
('ipad', 0.7331036329269409)
]

Each of these words has strong relevance to the company.

Averaging Word Vectors

Another interesting operation is averaging word vectors. This allows us to combine the meaning of two words into a single vector. For instance, we could average the vectors for “apple” and “orange” to get a vector that represents something like “fruit.”

# Average of 'apple' and 'orange'
vector_fruit_avg = (model['apple'] + model['orange']) / 2

# Find the words closest to the average vector
similar_to_fruit_avg = model.similar_by_vector(vector_fruit_avg, topn=5)
print("Words similar to the average of 'apple' and 'orange':", similar_to_fruit_avg)

There are a number of related words to both “apple” and “orange”. The average provides us with this intersection.

Words similar to the average of 'apple' and 'orange': [
('apple', 0.8868993520736694), 
('orange', 0.8670367002487183), 
('juice', 0.7459520101547241), 
('cherry', 0.7071465849876404), 
('cream', 0.7013142704963684)
]

Conclusion

Word embeddings are a powerful way to represent the meaning of words as vectors in a high-dimensional space. By using simple mathematical operations, such as vector arithmetic and cosine similarity, we can uncover a variety of semantic relationships between words. These operations allow embeddings to be used in tasks such as analogy generation, search, and clustering.

In this post, we explored how to use pre-trained word embeddings, perform vector operations, and leverage them for real-world tasks. These foundational concepts are what power much of the magic behind modern NLP techniques, from search engines to chatbots and more.

Straight lines

Introduction

In mathematics, the straight line equation \(y = mx + c\) is one of the simplest yet most foundational equations in both algebra and geometry. It defines a linear relationship between two variables, \(x\) and \(y\), where \(m\) represents the slope (or gradient) of the line, and \(c\) is the y-intercept, the point where the line crosses the y-axis.

This article explores key concepts related to the straight line equation, interesting properties, and how we can use Haskell to implement some useful functions.

Understanding the Equation

The equation \(y = mx + c\) allows us to describe a straight line in a two-dimensional plane. Here’s a breakdown of its components:

  • \(m\): The slope, which measures how steep the line is. It’s defined as the change in \(y\) divided by the change in \(x\) , or \(\frac{\Delta y}{\Delta x}\).
  • \(c\): The y-intercept, which is the value of \(y\) when \(x = 0\).

One of the key properties of this equation is that for every unit increase in \(x\), the value of \(y\) increases by \(m\). We can illustrate this behavior using some Haskell code.

Basic Line Function in Haskell

Let’s implement the basic straight line function in Haskell. This function will take \(m\), \(c\), and \(x\) as inputs and return the corresponding \(y\) value.

lineFunction :: Float -> Float -> Float -> Float
lineFunction m c x = m * x + c

This function calculates \(y\) for any given \(x\) using the slope \(m\) and y-intercept \(c\).

Parallel and Perpendicular Lines

An interesting aspect of lines is how they relate to each other. If two lines are parallel, they have the same slope. If two lines are perpendicular, the slope of one is the negative reciprocal of the other. In mathematical terms, if one line has a slope \(m_1\), the perpendicular line has a slope of \(-\frac{1}{m_1}\).

We can express this relationship in Haskell using a function to check if two lines are perpendicular.

arePerpendicular :: Float -> Float -> Bool
arePerpendicular m1 m2 = m1 * m2 == -1

This function takes two slopes and returns True if they are perpendicular and False otherwise.

Finding the Intersection of Two Lines

To find the point where two lines intersect, we need to solve the system of equations:

\(y = m_1x + c_1\) \(y = m_2x + c_2\)

By setting the equations equal to each other, we can solve for \(x\) and then substitute the result into one of the equations to find \(y\). The formula for the intersection point is:

\[x = \frac{c_2 - c_1}{m_1 - m_2}\]

Here’s a Haskell function that calculates the intersection point of two lines:

intersection :: Float -> Float -> Float -> Float -> Maybe (Float, Float)
intersection m1 c1 m2 c2
    | m1 == m2  = Nothing -- Parallel lines never intersect
    | otherwise = let x = (c2 - c1) / (m1 - m2)
    y = m1 * x + c1
  in Just (x, y)

This function returns Nothing if the lines are parallel and Just (x, y) if the lines intersect.

Conclusion

The straight line equation \(y = mx + c\) is a simple but powerful tool in both mathematics and programming. We’ve explored how to implement the line equation in Haskell, find parallel and perpendicular lines, and calculate intersection points. Understanding these properties gives you a deeper appreciation of how linear relationships work, both in theory and in practice.

By writing these functions in Haskell, you can model and manipulate straight lines in code, extending these basic principles to more complex problems.