In our previousposts,
we explored traditional text representation techniques like One-Hot Encoding, Bag-of-Words, and TF-IDF, and
we introduced static word embeddings like Word2Vec and GloVe. While these techniques are powerful, they have
limitations, especially when it comes to capturing the context of words.
In this post, we’ll explore more advanced topics that push the boundaries of NLP:
Contextual Word Embeddings like ELMo, BERT, and GPT
Dimensionality Reduction techniques for visualizing embeddings
Applications of Word Embeddings in real-world tasks
Training Custom Word Embeddings on your own data
Let’s dive in!
Contextual Word Embeddings
Traditional embeddings like Word2Vec and GloVe generate a single fixed vector for each word. This means the word “bank”
will have the same vector whether it refers to a “river bank” or a “financial institution,” which is a major limitation
in understanding nuanced meanings in context.
Contextual embeddings, on the other hand, generate different vectors for the same word depending on its context.
These models are based on deep learning architectures and have revolutionized NLP by capturing the dynamic nature of
language.
ELMo (Embeddings from Language Models)
ELMo was one of the first models to introduce the idea of context-dependent word representations. Instead of a fixed
vector, ELMo generates a vector for each word that depends on the entire sentence. It uses bidirectional LSTMs
to achieve this, looking both forward and backward in the text to understand the context.
BERT (Bidirectional Encoder Representations from Transformers)
BERT takes contextual embeddings to the next level using the Transformer architecture. Unlike traditional models,
which process text in one direction (left-to-right or right-to-left), BERT is bidirectional, meaning it looks at all
the words before and after a given word to understand its meaning. BERT also uses pretraining and fine-tuning,
making it one of the most versatile models in NLP.
GPT (Generative Pretrained Transformer)
While GPT is similar to BERT in using the Transformer architecture, it is primarily unidirectional and excels at
generating text. This model has been the backbone for many state-of-the-art systems in tasks like
text generation, summarization, and dialogue systems.
Why Contextual Embeddings Matter
Contextual embeddings are critical in modern NLP applications, such as:
Named Entity Recognition (NER): Contextual models help disambiguate words with multiple meanings.
Machine Translation: These embeddings capture the nuances of language, making translations more accurate.
Question-Answering: Systems like GPT-3 excel in understanding and responding to complex queries by leveraging context.
To experiment with BERT, you can try the transformers library from Hugging Face:
fromtransformersimportBertTokenizer,BertModel# Load pre-trained model tokenizer
tokenizer=BertTokenizer.from_pretrained('bert-base-uncased')# Tokenize input text
text="NLP is amazing!"encoded_input=tokenizer(text,return_tensors='pt')# Load pre-trained BERT model
model=BertModel.from_pretrained('bert-base-uncased')# Get word embeddings from BERT
output=model(**encoded_input)print(output.last_hidden_state)
The tensor output from this process should look something like this:
Word embeddings are usually represented as high-dimensional vectors (e.g., 300 dimensions for Word2Vec). While this is
great for models, it’s difficult for humans to interpret these vectors directly. This is where dimensionality reduction
techniques like PCA and t-SNE come in handy.
Principal Component Analysis (PCA)
PCA reduces the dimensions of the word vectors while preserving the most important information. It helps us visualize
clusters of similar words in a lower-dimensional space (e.g., 2D or 3D).
Following on from the previous example, we’ll use the simple embeddings that we’ve generated in the output variable.
fromsklearn.decompositionimportPCAimportmatplotlib.pyplotasplt# . . .
embeddings=output.last_hidden_state.detach().numpy()# Reduce embeddings dimensionality with PCA
# The embeddings are 3D (1, sequence_length, hidden_size), so we flatten the first two dimensions
embeddings_2d=embeddings[0]# Remove the batch dimension, now (sequence_length, hidden_size)
# Apply PCA
pca=PCA(n_components=2)reduced_embeddings=pca.fit_transform(embeddings_2d)# Visualize the first two principal components of each token's embedding
plt.figure(figsize=(8,6))plt.scatter(reduced_embeddings[:,0],reduced_embeddings[:,1])# Add labels for each token
tokens=tokenizer.convert_ids_to_tokens(encoded_input['input_ids'][0])fori,tokeninenumerate(tokens):plt.annotate(token,(reduced_embeddings[i,0],reduced_embeddings[i,1]))plt.title('PCA of BERT Token Embeddings')plt.xlabel('Principal Component 1')plt.ylabel('Principal Component 2')plt.savefig('bert_token_embeddings_pca.png',format='png')
You should see a plot similar to this:
This is a scatter plot where the 768 dimensions of each embedding has been reduced two to 2 principal components using
Principal Component Analysis (PCA). This allows us to plot these in two-dimensional space.
Some observations when looking at this chart:
Special Tokens [CLS] and [SEP]
These special tokens are essential in BERT. The [CLS] token is typically used as a summary representation for the
entire sentence (especially in classification tasks), and the [SEP] token is used to separate sentences or indicate
the end of a sentence.
In the plot, you can see [CLS] and [SEP] are far apart from other tokens, especially [SEP], which has a distinct
position in the vector space. This makes sense since their roles are unique compared to actual word tokens like “amazing”
or “is.”
Subword Tokens
Notice the token labeled ##p. This represents a subword. BERT uses a WordPiece tokenization algorithm, which
breaks rare or complex words into subword units. In this case, “NLP” has been split into nl and ##p because BERT
doesn’t have “NLP” as a whole word in its vocabulary. The fact that nl and ##p are close together in the plot
indicates that BERT keeps semantically related parts of the same word close in the vector space.
Contextual Similarity
The tokens “amazing” and “is” are relatively close to each other, which reflects that they are part of the same sentence
and share a contextual relationship. Interestingly, “amazing” is a bit more isolated, which could be because it’s a more
distinctive word with a strong meaning, whereas “is” is a more common auxiliary verb and closer to other less distinctive
tokens.
Distribution and Separation
The distance between tokens shows how BERT separates different tokens in the vector space based on their contextual
meaning. For example, [SEP] is far from the other tokens because it serves a very different role in the sentence.
The overall spread of the tokens suggests that BERT embeddings can clearly distinguish between different word types
(subwords, regular words, and special tokens).
t-SNE is another popular technique for visualizing high-dimensional data. It captures both local and global
structures of the embeddings and is often used to visualize word clusters based on their semantic similarity.
I’ve continued on from the code that we’ve been using:
fromsklearn.manifoldimportTSNEimportmatplotlib.pyplotaspltembeddings=output.last_hidden_state.detach().numpy()[0]# use TSNE here
tsne=TSNE(n_components=2,random_state=42,perplexity=1)reduced_embeddings=tsne.fit_transform(embeddings)# Plot the t-SNE reduced embeddings
plt.figure(figsize=(8,6))plt.scatter(reduced_embeddings[:,0],reduced_embeddings[:,1])# Add labels for each token
fori,tokeninenumerate(tokens):plt.annotate(token,(reduced_embeddings[i,0],reduced_embeddings[i,1]))plt.title('t-SNE of BERT Token Embeddings')plt.xlabel('Dimension 1')plt.ylabel('Dimension 2')# Save the plot to a file
plt.savefig('bert_token_embeddings_tsne.png',format='png')
The output of which looks a little different to PCA:
There is a different distribution of the embeddings in comparison.
Real-World Applications of Word Embeddings
Word embeddings are foundational in numerous NLP applications:
Semantic Search: Embeddings allow search engines to find documents based on meaning rather than exact keyword matches.
Sentiment Analysis: Embeddings can capture the sentiment of text, enabling models to predict whether a review is positive or negative.
Machine Translation: By representing words from different languages in the same space, embeddings improve the accuracy of machine translation systems.
Question-Answering Systems: Modern systems like GPT-3 use embeddings to understand and respond to natural language queries.
Example: Semantic Search with Word Embeddings
In a semantic search engine, user queries and documents are both represented as vectors in the same embedding space. By
calculating the cosine similarity between these vectors, we can retrieve documents that are semantically related to the
query.
importnumpyasnpfromsklearn.metrics.pairwiseimportcosine_similarity# Simulate a query embedding (1D vector of size 768, similar to BERT output)
query_embedding=np.random.rand(1,768)# Shape: (1, 768)
# Simulate a set of 5 document embeddings (5 documents, each with a 768-dimensional vector)
document_embeddings=np.random.rand(5,768)# Shape: (5, 768)
# Compute cosine similarity between the query and the documents
similarities=cosine_similarity(query_embedding,document_embeddings)# Shape: (1, 5)
# Rank documents by similarity (higher similarity first)
ranked_indices=similarities.argsort()[0][::-1]# Sort in descending order
print("Ranked document indices (most similar to least similar):",ranked_indices)# If you want to print the similarity scores as well
print("Similarity scores:",similarities[0][ranked_indices])
Walking through this code:
query_embedding and document_embeddings
We generate random vectors to simulate the embeddings. In a real use case, these would come from an embedding model
(e.g., BERT, Word2Vec). The query_embedding represents the vector for the user’s query, and document_embeddings
represents vectors for a set of documents.
Both query_embedding and document_embeddings must have the same dimensionality (e.g., 768 if you’re using BERT).
Cosine Similarity
The cosine_similarity() function computes the cosine similarity between the query_embedding and each document embedding.
Cosine similarity measures the cosine of the angle between two vectors, which ranges from -1 (completely dissimilar) to 1 (completely similar). In this case, we’re interested in documents that are most similar to the query (values close to 1).
Ranking the Documents
We use argsort() to get the indices of the document embeddings sorted in ascending order of similarity.
The [::-1] reverses this order so that the most similar documents appear first.
The ranked_indices gives the document indices, ranked from most similar to least similar to the query.
The output of which looks like this:
Ranked document indices (most similar to least similar): [3 4 1 2 0]
Similarity scores: [0.76979867 0.7686247 0.75195574 0.74263041 0.72975817]
Training Your Own Word Embeddings
While pretrained embeddings like Word2Vec and BERT are incredibly powerful, sometimes you need embeddings that are
fine-tuned to your specific domain or dataset. You can train your own embeddings using frameworks like Gensim for
Word2Vec or PyTorch for more complex models.
The following code shows training Word2Vec with Gensim:
fromgensim.modelsimportWord2Vecimportnltkfromnltk.tokenizeimportword_tokenizenltk.download('punkt_tab')# Example sentences
sentences=[word_tokenize("I love NLP"),word_tokenize("NLP is amazing"),word_tokenize("Word embeddings are cool")]# Train Word2Vec model
model=Word2Vec(sentences,vector_size=100,window=5,min_count=1,workers=4)# Get vector for a word
vector=model.wv['NLP']print(vector)
The output here is a 100-dimensional vector that represents the word NLP.
You can also fine-tune BERT or other transformer models on your own dataset. This is useful when you need embeddings
that are tailored to a specific domain, such as medical or legal text.
Conclusion
Word embeddings have come a long way, from static models like Word2Vec and GloVe to dynamic, context-aware
models like BERT and GPT. These techniques have revolutionized how we represent and process language in NLP.
Alongside dimensionality reduction for visualization, applications such as semantic search, sentiment analysis, and
custom embeddings training open up a world of possibilities.
In a previous post , I detailed a double-buffering
implementation written in C. The idea behind double buffering is to draw graphics off-screen, then quickly swap
(or “flip”) this off-screen buffer with the visible screen. This technique reduces flickering and provides smoother
rendering. While the C implementation was relatively straightforward using GDI functions, I decided to challenge myself
by recreating it in assembly language using MASM32.
There are some slight differences that I’ll go through.
First up, this module defines some macros that are just helpful blocks of reusable code.
szText defines a string inline
m2m performs value assignment from a memory location, to another
return is a simple analog for the return keyword in c
rgb encodes 8 bit RGB components into the eax register
; Defines strings in an ad-hoc fashionszTextMACROName,Text:VARARGLOCALlbljmplblNamedbText,0lbl:ENDM; Assigns a value from a memory location into another memory locationm2mMACROM1,M2pushM2popM1ENDM; Syntax sugar for returning from a PROCreturnMACROargmoveax,argretENDMrgbMACROr,g,bxoreax,eaxmovah,bmoval,groleax,8moval,rENDM
Setup
The setup is very much like its C counterpart with a registration of a class first, and then the creation of the window.
WM_PAINT only needs to worry about drawing the backbuffer to the window.
PaintMessage:invokeFlipBackBuffer,hWinmoveax,1ret
Handling the buffer
The routine that handles the back buffer construction is called RecreateBackBuffer. It’s a routine that will clean
up before it trys to create the back buffer saving the program from memory leaks:
This is just a different take on the same application written in C. Some of the control structures in assembly language
can seem a little hard to follow, but there is something elegant about their simplicity.
In this post, we’ll walk through fundamental data structures and sorting algorithms, using Python to demonstrate key
concepts and code implementations. We’ll also discuss the algorithmic complexity of various operations like searching,
inserting, and deleting, as well as the best, average, and worst-case complexities of popular sorting algorithms.
Algorithmic Complexity
When working with data structures and algorithms, it’s crucial to consider how efficiently they perform under different
conditions. This is where algorithmic complexity comes into play. It helps us measure how the time or space an
algorithm uses grows as the input size increases.
Time Complexity
Time complexity refers to the amount of time an algorithm takes to complete, usually expressed as a function of the
size of the input, \(n\). We typically use Big-O notation to describe the worst-case scenario. The goal is to
approximate how the time increases as the input size grows.
Common Time Complexities:
\(O(1)\) (Constant Time): The runtime does not depend on the size of the input. For example, accessing an element in an array by index takes the same amount of time regardless of the array’s size.
\(O(n)\) (Linear Time): The runtime grows proportionally with the size of the input. For example, searching for an element in an unsorted list takes \(O(n)\) time because, in the worst case, you have to check each element.
\(O(n^2)\) (Quadratic Time): The runtime grows quadratically with the input size. Sorting algorithms like Bubble Sort and Selection Sort exhibit \(O(n^2)\) time complexity because they involve nested loops.
\(O(\log n)\) (Logarithmic Time): The runtime grows logarithmically as the input size increases, often seen in algorithms that reduce the problem size with each step, like binary search.
\(O(n \log n)\): This complexity appears in efficient sorting algorithms like Merge Sort and Quick Sort, combining the linear and logarithmic growth patterns.
Space Complexity
Space complexity refers to the amount of memory an algorithm uses relative to the size of the input. This is also
expressed in Big-O notation. For instance, sorting an array in-place (i.e., modifying the input array) requires
\(O(1)\) auxiliary space, whereas Merge Sort requires \(O(n)\) additional space to store the temporary arrays
created during the merge process.
Why Algorithmic Complexity Matters
Understanding the time and space complexity of algorithms is crucial because it helps you:
Predict Performance: You can estimate how an algorithm will perform on large inputs, avoiding slowdowns that may arise with inefficient algorithms.
Choose the Right Tool: For example, you might choose a hash table (with \(O(1)\) lookup) over a binary search tree (with \(O(\log n)\) lookup) when you need fast access times.
Optimize Code: Knowing the time complexity helps identify bottlenecks and guides you in writing more efficient code.
Data Structures
Lists
Python lists are dynamic arrays that support random access. They are versatile and frequently used due to their built-in
functionality.
# Python List Example
my_list=[1,2,3,4]my_list.append(5)# O(1) - Insertion at the end
my_list.pop()# O(1) - Deletion at the end
print(my_list[0])# O(1) - Access
Complexity
Access: \(O(1)\)
Search: \(O(n)\)
Insertion (at end): \(O(1)\)
Deletion (at end): \(O(1)\)
Arrays
Arrays are fixed-size collections that store elements of the same data type. While Python lists are dynamic, we can use
the array module to simulate arrays.
Now we’ll talk about some very common sorting algorithms and understand their complexity to better equip ourselves to
make choices about what types of searches we need to do and when.
Bubble Sort
Repeatedly swap adjacent elements if they are in the wrong order.
We’ve explored a wide range of data structures and sorting algorithms, discussing their Python implementations, and
breaking down their time and space complexities. These foundational concepts are essential for any software developer to
understand, and mastering them will improve your ability to choose the right tools and algorithms for a given problem.
Below is a table outlining these complexities about the data structures:
Data Structure
Access Time
Search Time
Insertion Time
Deletion Time
Space Complexity
List (Array)
\(O(1)\)
\(O(n)\)
\(O(n)\)
\(O(n)\)
\(O(n)\)
Stack
\(O(n)\)
\(O(n)\)
\(O(1)\)
\(O(1)\)
\(O(n)\)
Queue
\(O(n)\)
\(O(n)\)
\(O(1)\)
\(O(1)\)
\(O(n)\)
Set
N/A
\(O(1)\)
\(O(1)\)
\(O(1)\)
\(O(n)\)
Dictionary
N/A
\(O(1)\)
\(O(1)\)
\(O(1)\)
\(O(n)\)
Binary Tree (BST)
\(O(\log n)\)
\(O(\log n)\)
\(O(\log n)\)
\(O(\log n)\)
\(O(n)\)
Heap (Binary)
\(O(n)\)
\(O(n)\)
\(O(\log n)\)
\(O(\log n)\)
\(O(n)\)
Below is a quick summary of the time complexities of the sorting algorithms we covered:
Algorithm
Best Time Complexity
Average Time Complexity
Worst Time Complexity
Auxiliary Space
Bubble Sort
\(O(n)\)
\(O(n^2)\)
\(O(n^2)\)
\(O(1)\)
Selection Sort
\(O(n^2)\)
\(O(n^2)\)
\(O(n^2)\)
\(O(1)\)
Insertion Sort
\(O(n)\)
\(O(n^2)\)
\(O(n^2)\)
\(O(1)\)
Merge Sort
\(O(n \log n)\)
\(O(n \log n)\)
\(O(n \log n)\)
\(O(n)\)
Quick Sort
\(O(n \log n)\)
\(O(n \log n)\)
\(O(n^2)\)
\(O(\log n)\)
Heap Sort
\(O(n \log n)\)
\(O(n \log n)\)
\(O(n \log n)\)
\(O(1)\)
Bucket Sort
\(O(n + k)\)
\(O(n + k)\)
\(O(n^2)\)
\(O(n + k)\)
Radix Sort
\(O(nk)\)
\(O(nk)\)
\(O(nk)\)
\(O(n + k)\)
Keep this table handy as a reference for making decisions on the appropriate sorting algorithm based on time and space
constraints.
In our previous post, we introduced One-Hot Encoding and
the Bag-of-Words (BoW) model, which are simple methods of representing text as numerical vectors. While these
techniques are foundational, they come with certain limitations. One major drawback of Bag-of-Words is that it treats
all words equally—common words like “the” or “is” are given the same importance as more meaningful words like
“science” or “NLP.”
TF-IDF (Term Frequency-Inverse Document Frequency) is an extension of BoW that aims to address this problem. By
weighting words based on their frequency in individual documents versus the entire corpus, TF-IDF highlights more
important words and reduces the impact of common, less meaningful ones.
TF-IDF
TF-IDF stands for Term Frequency-Inverse Document Frequency. It’s a numerical statistic used to reflect the
importance of a word in a document relative to a collection of documents (a corpus). The formula is:
\(\text{TF}(t, d)\): Term Frequency of term \(t\) in document \(d\), which is the number of times \(t\) appears in \(d\).
\(\text{IDF}(t)\): Inverse Document Frequency, which measures how important \(t\) is across the entire corpus.
Term Frequency (TF)
Term Frequency (TF) is simply a count of how frequently a term appears in a document. The higher the frequency, the
more relevant the word is assumed to be for that specific document.
\[\text{TF}(t, d) = \frac{\text{Number of occurrences of } t \text{ in } d}{\text{Total number of terms in } d}\]
For example, if the word “NLP” appears 3 times in a document of 100 words, the term frequency for “NLP” is:
\[\text{TF}(NLP, d) = \frac{3}{100} = 0.03\]
Inverse Document Frequency (IDF)
Inverse Document Frequency (IDF) downweights common words that appear in many documents and upweights rare words
that are more meaningful in specific contexts. The formula is:
\(N\) is the total number of documents in the corpus.
\(\text{DF}(t)\) is the number of documents that contain the term \(t\).
The “+1” in the denominator is there to avoid division by zero. Words that appear in many documents (e.g., “is”, “the”)
will have a lower IDF score, while rare terms will have higher IDF scores.
Example
Let’s take an example with two documents:
Document 1: “I love NLP and NLP loves me”
Document 2: “NLP is great and I enjoy learning NLP”
The negative value here shows that “NLP” is a very common term in this corpus, and its weight will be downscaled.
Code Example: TF-IDF with TfidfVectorizer
Now let’s use TfidfVectorizer from sklearn to automatically calculate TF-IDF scores for our documents.
fromsklearn.feature_extraction.textimportTfidfVectorizer# Define a corpus of documents
corpus=["I love NLP and NLP loves me","NLP is great and I enjoy learning NLP"]# Initialize the TF-IDF vectorizer
vectorizer=TfidfVectorizer()# Fit and transform the corpus into TF-IDF vectors
X=vectorizer.fit_transform(corpus)# Display the feature names (vocabulary)
print(vectorizer.get_feature_names_out())# Display the TF-IDF matrix
print(X.toarray())
Each row in the output corresponds to a document, and each column corresponds to a term in the vocabulary. The values
represent the TF-IDF score of each term for each document.
Advantages of TF-IDF
Balances Frequency: TF-IDF considers both how frequently a word appears in a document (term frequency) and how unique or common it is across all documents (inverse document frequency). This helps prioritize meaningful words.
Reduces Impact of Stop Words: By downweighting terms that appear in many documents, TF-IDF naturally handles common stop words without needing to remove them manually.
Efficient for Large Corpora: TF-IDF is computationally efficient and scales well to large datasets.
Limitations of TF-IDF
While TF-IDF is a significant improvement over simple Bag-of-Words, it still has some limitations:
No Semantic Meaning: Like Bag-of-Words, TF-IDF treats words as independent features and doesn’t capture the relationships or meaning between them.
Sparse Representations: Even with the IDF weighting, TF-IDF still generates high-dimensional and sparse vectors, especially for large vocabularies.
Ignores Word Order: TF-IDF doesn’t account for word order, so sentences with the same words in different arrangements will have the same representation.
Conclusion
TF-IDF is a powerful and widely-used method for text representation, especially in tasks like document retrieval and
search engines, where distinguishing between important and common words is crucial. However, as we’ve seen, TF-IDF
doesn’t capture the meaning or relationships between words, which is where word embeddings come into play.
In our previous post, we covered the preprocessing steps
necessary to convert text into a machine-readable format, like tokenization and stop word removal. But once the text is
preprocessed, how do we represent it for use in machine learning models?
Before the rise of word embeddings, simpler techniques were commonly used to represent text as vectors. Today, we’ll
explore two foundational techniques: One-Hot Encoding and Bag-of-Words (BoW). These methods don’t capture the
semantic meaning of words as well as modern embeddings do, but they’re essential for understanding the evolution of
Natural Language Processing (NLP).
One-Hot Encoding
One of the simplest ways to represent text is through One-Hot Encoding. In this approach, each word in a vocabulary
is represented as a vector where all the elements are zero, except for a single element that corresponds to the word’s
index.
Let’s take a small vocabulary:
["I", "love", "NLP"]
The vocabulary size is 3, and each word will be represented by a 3-dimensional vector:
I -> [1, 0, 0]
love -> [0, 1, 0]
NLP -> [0, 0, 1]
Each word is “hot” (1) in one specific position, while “cold” (0) everywhere else.
Example
Let’s generate one-hot encoded vectors using Python:
fromsklearn.preprocessingimportOneHotEncoderimportnumpyasnp# Define a small vocabulary
vocab=["I","love","NLP"]# Reshape the data for OneHotEncoder
vocab_reshaped=np.array(vocab).reshape(-1,1)# Initialize the OneHotEncoder
encoder=OneHotEncoder(sparse_output=False)# Fit and transform the vocabulary
onehot_encoded=encoder.fit_transform(vocab_reshaped)print(onehot_encoded)
The output shows the three indexed words:
[[1. 0. 0.]
[0. 0. 1.]
[0. 1. 0.]]
Each word has a unique binary vector representing its position in the vocabulary.
Drawbacks of One-Hot Encoding
One-Hot Encoding is simple but comes with some limitations:
High Dimensionality: For large vocabularies, the vectors become huge, leading to a “curse of dimensionality”.
Lack of Semantic Information: One-Hot vectors don’t capture any relationships between words. “love” and “like” would have completely different vectors, even though they are semantically similar.
Bag-of-Words (BoW)
One-Hot Encoding represents individual words, but what about whole documents or sentences? That’s where the Bag-of-Words
(BoW) model comes in. In BoW, the text is represented as a vector of word frequencies.
BoW counts how often each word from a given vocabulary appears in the document, without considering the order of words
(hence, a “bag” of words).
Let’s take two example sentences:
“I love NLP”
“NLP is amazing”
The combined vocabulary for these two sentences is:
["I", "love", "NLP", "is", "amazing"]
Now, using BoW, we represent each sentence as a vector of word counts:
“I love NLP” -> [1, 1, 1, 0, 0] (since “I”, “love”, and “NLP” appear once, and “is” and “amazing” don’t appear)
“NLP is amazing” -> [0, 0, 1, 1, 1] (since “NLP”, “is”, and “amazing” appear once, and “I” and “love” don’t appear)
Example
We can use CountVectorizer from the sklearn library to easily apply Bag-of-Words to a corpus of text:
fromsklearn.feature_extraction.textimportCountVectorizer# Define a set of documents
corpus=["I love NLP","NLP is amazing"]# Initialize the CountVectorizer
vectorizer=CountVectorizer()# Transform the corpus into BoW vectors
X=vectorizer.fit_transform(corpus)# Display the feature names (the vocabulary)
print(vectorizer.get_feature_names_out())# Display the BoW representation
print(X.toarray())
While BoW is a simple and powerful method, it too has its drawbacks:
Sparsity: Like One-Hot Encoding, BoW produces high-dimensional and sparse vectors, especially for large vocabularies.
No Word Order: BoW ignores word order. The sentence “I love NLP” is treated the same as “NLP love I”, which may not always make sense.
No Semantic Relationships: Just like One-Hot Encoding, BoW doesn’t capture the meaning or relationships between words. All words are treated as independent features.
Conclusion
Both One-Hot Encoding and Bag-of-Words are simple and effective ways to represent text as numbers, but they have significant
limitations, particularly in capturing semantic relationships and dealing with large vocabularies.
These methods laid the groundwork for more sophisticated representations like TF-IDF (which we’ll cover next) and
eventually led to word embeddings, which capture the meaning and context of words more effectively.