Exploring Advanced Word Embeddings
20 Oct 2024Introduction
In our previous posts, we explored traditional text representation techniques like One-Hot Encoding, Bag-of-Words, and TF-IDF, and we introduced static word embeddings like Word2Vec and GloVe. While these techniques are powerful, they have limitations, especially when it comes to capturing the context of words.
In this post, we’ll explore more advanced topics that push the boundaries of NLP:
- Contextual Word Embeddings like ELMo, BERT, and GPT
- Dimensionality Reduction techniques for visualizing embeddings
- Applications of Word Embeddings in real-world tasks
- Training Custom Word Embeddings on your own data
Let’s dive in!
Contextual Word Embeddings
Traditional embeddings like Word2Vec and GloVe generate a single fixed vector for each word. This means the word “bank” will have the same vector whether it refers to a “river bank” or a “financial institution,” which is a major limitation in understanding nuanced meanings in context.
Contextual embeddings, on the other hand, generate different vectors for the same word depending on its context. These models are based on deep learning architectures and have revolutionized NLP by capturing the dynamic nature of language.
ELMo (Embeddings from Language Models)
ELMo was one of the first models to introduce the idea of context-dependent word representations. Instead of a fixed vector, ELMo generates a vector for each word that depends on the entire sentence. It uses bidirectional LSTMs to achieve this, looking both forward and backward in the text to understand the context.
BERT (Bidirectional Encoder Representations from Transformers)
BERT takes contextual embeddings to the next level using the Transformer architecture. Unlike traditional models, which process text in one direction (left-to-right or right-to-left), BERT is bidirectional, meaning it looks at all the words before and after a given word to understand its meaning. BERT also uses pretraining and fine-tuning, making it one of the most versatile models in NLP.
GPT (Generative Pretrained Transformer)
While GPT is similar to BERT in using the Transformer architecture, it is primarily unidirectional and excels at generating text. This model has been the backbone for many state-of-the-art systems in tasks like text generation, summarization, and dialogue systems.
Why Contextual Embeddings Matter
Contextual embeddings are critical in modern NLP applications, such as:
- Named Entity Recognition (NER): Contextual models help disambiguate words with multiple meanings.
- Machine Translation: These embeddings capture the nuances of language, making translations more accurate.
- Question-Answering: Systems like GPT-3 excel in understanding and responding to complex queries by leveraging context.
To experiment with BERT, you can try the transformers
library from Hugging Face:
The tensor
output from this process should look something like this:
2. Visualizing Word Embeddings
Word embeddings are usually represented as high-dimensional vectors (e.g., 300 dimensions for Word2Vec). While this is great for models, it’s difficult for humans to interpret these vectors directly. This is where dimensionality reduction techniques like PCA and t-SNE come in handy.
Principal Component Analysis (PCA)
PCA reduces the dimensions of the word vectors while preserving the most important information. It helps us visualize clusters of similar words in a lower-dimensional space (e.g., 2D or 3D).
Following on from the previous example, we’ll use the simple embeddings that we’ve generated in the output
variable.
You should see a plot similar to this:
This is a scatter plot where the 768 dimensions of each embedding has been reduced two to 2 principal components using Principal Component Analysis (PCA). This allows us to plot these in two-dimensional space.
Some observations when looking at this chart:
Special Tokens [CLS]
and [SEP]
These special tokens are essential in BERT. The [CLS]
token is typically used as a summary representation for the
entire sentence (especially in classification tasks), and the [SEP]
token is used to separate sentences or indicate
the end of a sentence.
In the plot, you can see [CLS]
and [SEP]
are far apart from other tokens, especially [SEP]
, which has a distinct
position in the vector space. This makes sense since their roles are unique compared to actual word tokens like “amazing”
or “is.”
Subword Tokens
Notice the token labeled ##p
. This represents a subword. BERT uses a WordPiece tokenization algorithm, which
breaks rare or complex words into subword units. In this case, “NLP” has been split into nl
and ##p
because BERT
doesn’t have “NLP” as a whole word in its vocabulary. The fact that nl
and ##p
are close together in the plot
indicates that BERT keeps semantically related parts of the same word close in the vector space.
Contextual Similarity
The tokens “amazing” and “is” are relatively close to each other, which reflects that they are part of the same sentence and share a contextual relationship. Interestingly, “amazing” is a bit more isolated, which could be because it’s a more distinctive word with a strong meaning, whereas “is” is a more common auxiliary verb and closer to other less distinctive tokens.
Distribution and Separation
The distance between tokens shows how BERT separates different tokens in the vector space based on their contextual
meaning. For example, [SEP]
is far from the other tokens because it serves a very different role in the sentence.
The overall spread of the tokens suggests that BERT embeddings can clearly distinguish between different word types
(subwords, regular words, and special tokens).
t-SNE (t-Distributed Stochastic Neighbor Embedding)
t-SNE is another popular technique for visualizing high-dimensional data. It captures both local and global structures of the embeddings and is often used to visualize word clusters based on their semantic similarity.
I’ve continued on from the code that we’ve been using:
The output of which looks a little different to PCA:
There is a different distribution of the embeddings in comparison.
Real-World Applications of Word Embeddings
Word embeddings are foundational in numerous NLP applications:
- Semantic Search: Embeddings allow search engines to find documents based on meaning rather than exact keyword matches.
- Sentiment Analysis: Embeddings can capture the sentiment of text, enabling models to predict whether a review is positive or negative.
- Machine Translation: By representing words from different languages in the same space, embeddings improve the accuracy of machine translation systems.
- Question-Answering Systems: Modern systems like GPT-3 use embeddings to understand and respond to natural language queries.
Example: Semantic Search with Word Embeddings
In a semantic search engine, user queries and documents are both represented as vectors in the same embedding space. By calculating the cosine similarity between these vectors, we can retrieve documents that are semantically related to the query.
Walking through this code:
query_embedding
and document_embeddings
-
We generate random vectors to simulate the embeddings. In a real use case, these would come from an embedding model (e.g., BERT, Word2Vec). The
query_embedding
represents the vector for the user’s query, anddocument_embeddings
represents vectors for a set of documents. -
Both
query_embedding
anddocument_embeddings
must have the same dimensionality (e.g., 768 if you’re using BERT).
Cosine Similarity
- The
cosine_similarity()
function computes the cosine similarity between thequery_embedding
and each document embedding. - Cosine similarity measures the cosine of the angle between two vectors, which ranges from -1 (completely dissimilar) to 1 (completely similar). In this case, we’re interested in documents that are most similar to the query (values close to 1).
Ranking the Documents
- We use
argsort()
to get the indices of the document embeddings sorted in ascending order of similarity. - The
[::-1]
reverses this order so that the most similar documents appear first. - The
ranked_indices
gives the document indices, ranked from most similar to least similar to the query.
The output of which looks like this:
Training Your Own Word Embeddings
While pretrained embeddings like Word2Vec and BERT are incredibly powerful, sometimes you need embeddings that are fine-tuned to your specific domain or dataset. You can train your own embeddings using frameworks like Gensim for Word2Vec or PyTorch for more complex models.
The following code shows training Word2Vec with Gensim:
The output here is a 100-dimensional vector that represents the word NLP
.
Fine-Tuning BERT with PyTorch
You can also fine-tune BERT or other transformer models on your own dataset. This is useful when you need embeddings that are tailored to a specific domain, such as medical or legal text.
Conclusion
Word embeddings have come a long way, from static models like Word2Vec and GloVe to dynamic, context-aware models like BERT and GPT. These techniques have revolutionized how we represent and process language in NLP. Alongside dimensionality reduction for visualization, applications such as semantic search, sentiment analysis, and custom embeddings training open up a world of possibilities.