Cogs and Levers A blog full of technical stuff

Double Buffering with Windows GDI in MASM

Introduction

In a previous post , I detailed a double-buffering implementation written in C. The idea behind double buffering is to draw graphics off-screen, then quickly swap (or “flip”) this off-screen buffer with the visible screen. This technique reduces flickering and provides smoother rendering. While the C implementation was relatively straightforward using GDI functions, I decided to challenge myself by recreating it in assembly language using MASM32.

There are some slight differences that I’ll go through.

The full code is available as a gist to follow along with.

Macros

First up, this module defines some macros that are just helpful blocks of reusable code.

  • szText defines a string inline
  • m2m performs value assignment from a memory location, to another
  • return is a simple analog for the return keyword in c
  • rgb encodes 8 bit RGB components into the eax register
; Defines strings in an ad-hoc fashion
szText MACRO Name, Text:VARARG
    LOCAL lbl
    jmp lbl
    Name db Text,0
lbl:
ENDM

; Assigns a value from a memory location into another memory location
m2m MACRO M1, M2
    push M2
    pop M1
ENDM

; Syntax sugar for returning from a PROC
return MACRO arg
    mov eax, arg
    ret
ENDM

rgb MACRO r, g, b
    xor eax, eax
    mov ah, b
    mov al, g
    rol eax, 8
    mov al, r
ENDM

Setup

The setup is very much like its C counterpart with a registration of a class first, and then the creation of the window.

szText szClassName, "DoubleBufferClass"

mov wc.cbSize, sizeof WNDCLASSEX
mov wc.style, CS_HREDRAW or CS_VREDRAW or CS_BYTEALIGNWINDOW
mov wc.lpfnWndProc, offset WndProc
mov wc.cbClsExtra, NULL
mov wc.cbWndExtra, NULL
m2m wc.hInstance, hInst
mov wc.hbrBackground, NULL
mov wc.lpszMenuName, NULL
mov wc.lpszClassName, offset szClassName

invoke LoadIcon, NULL, IDI_APPLICATION
mov wc.hIcon, eax

invoke LoadCursor, NULL, IDC_ARROW
mov wc.hCursor, eax

mov wc.hIconSm, 0

invoke RegisterClassEx, ADDR wc

The szClassName gives us a reference to the name of the class to use.

invoke CreateWindowEx, WS_EX_APPWINDOW, ADDR szClassName, ADDR szDisplayName, WS_OVERLAPPEDWINDOW,
                       CW_USEDEFAULT, CW_USEDEFAULT, CW_USEDEFAULT, CW_USEDEFAULT, NULL, NULL, hInst, NULL

Message pump

We continue to render out to the window in a loop:

StillRunning:
    invoke PeekMessage, ADDR msg, NULL, 0, 0, PM_REMOVE

    cmp eax, 0
    je RenderFrame

    invoke DispatchMessage, ADDR msg

RenderFrame:

    invoke DrawRandomShape
    invoke InvalidateRect, hWnd, NULL, 0

    cmp bRunning, 1
    je StillRunning

Using InvalidateRect tells the window that there is an update to draw. This is then propagated through the WM_PAINT message.

Window proc

Each of the window messages is handled in a switch/case like arrangement with a series of cmp and je instructions.

In higher level MASM this can be handled using the .IF syntax.

    cmp uMsg, WM_CREATE
    je CreateMessage

    cmp uMsg, WM_DESTROY
    je DestroyMessage

    cmp uMsg, WM_ERASEBKGND
    je EraseBackgroundMessage

    cmp uMsg, WM_CLOSE
    je CloseMessage

    cmp uMsg, WM_SIZE
    je SizeMessage

    cmp uMsg, WM_PAINT
    je PaintMessage

    jmp DefaultWindowHandler

We use the WM_CREATE, WM_SIZE, and WM_DESTROY messages to control when we create and destroy our back buffer.

CreateMessage:
SizeMessage:
    invoke RecreateBackBuffer, hWin
    jmp DefaultWindowHandler

DestroyMessage:

    invoke DestroyBackBuffer, hWin
    invoke PostQuitMessage, 0

    xor eax, eax
    ret

WM_PAINT only needs to worry about drawing the backbuffer to the window.

PaintMessage:
    invoke FlipBackBuffer, hWin
    mov eax, 1
    ret

Handling the buffer

The routine that handles the back buffer construction is called RecreateBackBuffer. It’s a routine that will clean up before it trys to create the back buffer saving the program from memory leaks:

    invoke DestroyBackBuffer, hWin
    
    invoke GetClientRect, hWin, ADDR clientRect
    invoke GetDC, hWin
    mov winDC, eax

    invoke CreateCompatibleDC, winDC
    mov memDC, eax

    mov eax, clientRect.right
    sub eax, clientRect.left
    mov ebx, clientRect.bottom
    sub ebx, clientRect.top

    invoke CreateCompatibleBitmap, winDC, eax, ebx
    mov memBitmap, eax
    invoke SelectObject, memDC, memBitmap
    mov memOldBitmap, eax
    xor eax, eax

DestroyBackBuffer being the first thing called here; it’s just a basic clean up:

    cmp memDC, NULL
    je MemoryDCDeallocated

    invoke SelectObject, memDC, memOldBitmap
    invoke DeleteObject, memBitmap
    invoke DeleteDC, memDC

    mov memDC, NULL
    mov memOldBitmap, NULL
    mov memBitmap, NULL

MemoryDCDeallocated:

    cmp winDC, NULL
    je WindowDCReleased

    invoke ReleaseDC, hWin, winDC
    mov winDC, NULL

WindowDCReleased:

    xor eax, eax
    ret

Flipping

When we want to draw that back buffer to the window, we just use BitBlt from the GDI:

    LOCAL hDC:HDC
    LOCAL ps:PAINTSTRUCT

    invoke BeginPaint, hWin, ADDR ps
    mov hDC, eax

    mov eax, clientRect.right
    sub eax, clientRect.left
    mov ebx, clientRect.bottom
    sub ebx, clientRect.top

    invoke BitBlt, hDC, 0, 0, eax, ebx, memDC, 0, 0, SRCCOPY
    invoke EndPaint, hWin, ADDR ps

    xor eax, eax
    ret

Conclusion

This is just a different take on the same application written in C. Some of the control structures in assembly language can seem a little hard to follow, but there is something elegant about their simplicity.

Data Structures, Algorithms, and Complexity

Introduction

In this post, we’ll walk through fundamental data structures and sorting algorithms, using Python to demonstrate key concepts and code implementations. We’ll also discuss the algorithmic complexity of various operations like searching, inserting, and deleting, as well as the best, average, and worst-case complexities of popular sorting algorithms.

Algorithmic Complexity

When working with data structures and algorithms, it’s crucial to consider how efficiently they perform under different conditions. This is where algorithmic complexity comes into play. It helps us measure how the time or space an algorithm uses grows as the input size increases.

Time Complexity

Time complexity refers to the amount of time an algorithm takes to complete, usually expressed as a function of the size of the input, \(n\). We typically use Big-O notation to describe the worst-case scenario. The goal is to approximate how the time increases as the input size grows.

Common Time Complexities:

  • \(O(1)\) (Constant Time): The runtime does not depend on the size of the input. For example, accessing an element in an array by index takes the same amount of time regardless of the array’s size.
  • \(O(n)\) (Linear Time): The runtime grows proportionally with the size of the input. For example, searching for an element in an unsorted list takes \(O(n)\) time because, in the worst case, you have to check each element.
  • \(O(n^2)\) (Quadratic Time): The runtime grows quadratically with the input size. Sorting algorithms like Bubble Sort and Selection Sort exhibit \(O(n^2)\) time complexity because they involve nested loops.
  • \(O(\log n)\) (Logarithmic Time): The runtime grows logarithmically as the input size increases, often seen in algorithms that reduce the problem size with each step, like binary search.
  • \(O(n \log n)\): This complexity appears in efficient sorting algorithms like Merge Sort and Quick Sort, combining the linear and logarithmic growth patterns.

Space Complexity

Space complexity refers to the amount of memory an algorithm uses relative to the size of the input. This is also expressed in Big-O notation. For instance, sorting an array in-place (i.e., modifying the input array) requires \(O(1)\) auxiliary space, whereas Merge Sort requires \(O(n)\) additional space to store the temporary arrays created during the merge process.

Why Algorithmic Complexity Matters

Understanding the time and space complexity of algorithms is crucial because it helps you:

  • Predict Performance: You can estimate how an algorithm will perform on large inputs, avoiding slowdowns that may arise with inefficient algorithms.
  • Choose the Right Tool: For example, you might choose a hash table (with \(O(1)\) lookup) over a binary search tree (with \(O(\log n)\) lookup) when you need fast access times.
  • Optimize Code: Knowing the time complexity helps identify bottlenecks and guides you in writing more efficient code.

Data Structures

Lists

Python lists are dynamic arrays that support random access. They are versatile and frequently used due to their built-in functionality.

# Python List Example
my_list = [1, 2, 3, 4]
my_list.append(5)    # O(1) - Insertion at the end
my_list.pop()        # O(1) - Deletion at the end
print(my_list[0])    # O(1) - Access

Complexity

  • Access: \(O(1)\)
  • Search: \(O(n)\)
  • Insertion (at end): \(O(1)\)
  • Deletion (at end): \(O(1)\)

Arrays

Arrays are fixed-size collections that store elements of the same data type. While Python lists are dynamic, we can use the array module to simulate arrays.

import array
my_array = array.array('i', [1, 2, 3])
my_array.append(4)
print(my_array)

Complexity

  • Access: \(O(1)\)
  • Search: \(O(n)\)
  • Insertion/Deletion: \(O(n)\)

Stacks

A Stack is a Last-In-First-Out (LIFO) structure. Python lists can be used to simulate stack operations with append() and pop().

class Stack:
    def __init__(self):
        self.stack = []

    def push(self, value):
        self.stack.append(value)

    def pop(self):
        return self.stack.pop()

my_stack = Stack()
my_stack.push(1)
my_stack.push(2)
print(my_stack.pop())  # O(1)

Complexity

  • Push/Pop: \(O(1)\)
  • Peek: \(O(1)\)
  • Search: \(O(n)\)

Queues

A Queue is a First-In-First-Out (FIFO) structure. Python’s collections.deque is efficient for this purpose.

from collections import deque
queue = deque()
queue.append(1)  # Enqueue
queue.append(2)
print(queue.popleft())  # Dequeue O(1)

Complexity

  • Enqueue/Dequeue: \(O(1)\)
  • Search: \(O(n)\)

Sets

Sets are unordered collections with no duplicates. Python’s set is implemented as a hash table.

my_set = set()
my_set.add(1)  # O(1)
my_set.remove(1)  # O(1)
print(1 in my_set)  # O(1)

Complexity

  • Add: \(O(1)\)
  • Remove: \(O(1)\)
  • Search: \(O(1)\)

Maps (Dictionaries)

Dictionaries store key-value pairs and are implemented as hash tables in Python.

my_dict = {}
my_dict['key'] = 'value'  # O(1)
del my_dict['key']  # O(1)
print('key' in my_dict)  # O(1)

Complexity

  • Insertion/Lookup/Deletion: \(O(1)\)

Trees

Trees are hierarchical data structures that allow efficient searching and sorting. A Binary Search Tree (BST) is one common example.

class Node:
    def __init__(self, key):
        self.left = None
        self.right = None
        self.value = key

    def insert(root, key):
        if root is None:
            return Node(key)
        if key < root.value:
            root.left = insert(root.left, key)
        else:
            root.right = insert(root.right, key)
        return root

root = None
root = insert(root, 10)
root = insert(root, 20)

Complexity

  • Insertion/Search/Deletion: \(O(\log n)\) for balanced trees, \(O(n)\) for unbalanced trees.

Heaps

Heaps are specialized tree-based structures where the parent node is always greater (max-heap) or smaller (min-heap) than its children.

import heapq
heap = []
heapq.heappush(heap, 10)  # O(log n)
heapq.heappush(heap, 5)
print(heapq.heappop(heap))  # O(log n)

Complexity

  • Insertion/Deletion: \(O(\log n)\)
  • Peek: \(O(1)\)

Sorting Algorithms

Now we’ll talk about some very common sorting algorithms and understand their complexity to better equip ourselves to make choices about what types of searches we need to do and when.

Bubble Sort

Repeatedly swap adjacent elements if they are in the wrong order.

def bubble_sort(arr):
    for i in range(len(arr)):
        for j in range(0, len(arr) - i - 1):
            if arr[j] > arr[j + 1]:
                arr[j], arr[j + 1] = arr[j + 1], arr[j]

arr = [5, 2, 9, 1, 5, 6]
bubble_sort(arr)
print(arr)

Complexity

  • Best: \(O(n)\)
  • Average/Worst: \(O(n^2)\)
  • Space: \(O(1)\)

Selection Sort

Select the smallest element and swap it with the current element.

def selection_sort(arr):
    for i in range(len(arr)):
        min_idx = i
        for j in range(i + 1, len(arr)):
            if arr[j] < arr[min_idx]:
                min_idx = j
        arr[i], arr[min_idx] = arr[min_idx], arr[i]

arr = [64, 25, 12, 22, 11]
selection_sort(arr)
print(arr)

Complexity

  • Best/Average/Worst: \(O(n^2)\)
  • Space: \(O(1)\)

Insertion Sort

Insert each element into its correct position in the sorted portion of the array.

def insertion_sort(arr):
    for i in range(1, len(arr)):
        key = arr[i]
        j = i - 1
        while j >= 0 and key < arr[j]:
            arr[j + 1] = arr[j]
            j -= 1
        arr[j + 1] = key

arr = [12, 11, 13, 5, 6]
insertion_sort(arr)
print(arr)

Complexity

  • Best: \(O(n)\)
  • Average/Worst: \(O(n^2)\)
  • Space: \(O(1)\)

Merge Sort

Divide and conquer algorithm that splits the array and merges them back in sorted order.

def merge_sort(arr):
    if len(arr) > 1:
        mid = len(arr) // 2
        left = arr[:mid]
        right = arr[mid:]

        merge_sort(left)
        merge_sort(right)

        i = j = k = 0
        while i < len(left) and j < len(right):
            if left[i] < right[j]:
                arr[k] = left[i]
                i += 1
            else:
                arr[k] = right[j]
                j += 1
            k += 1

        while i < len(left):
            arr[k] = left[i]
            i += 1
            k += 1

        while j < len(right):
            arr[k] = right[j]
            j += 1
            k += 1

arr = [12, 11, 13, 5, 6, 7]
merge_sort(arr)
print(arr)

Complexity

  • Best/Average/Worst: \(O(n \log n)\)
  • Space: \(O(n)\)

Quick Sort

Picks a pivot and partitions the array around the pivot.

def partition(arr, low, high):
    pivot = arr[high]
    i = low - 1
    for j in range(low, high):
        if arr[j] <= pivot:
            i += 1
            arr[i], arr[j] = arr[j], arr[i]
    arr[i + 1], arr[high] = arr[high], arr[i + 1]
    return i + 1

def quick_sort(arr, low, high):
    if low < high:
        pi = partition(arr, low, high)
        quick_sort(arr, low, pi - 1)
        quick_sort(arr, pi + 1, high)

arr = [10, 7, 8, 9, 1, 5]
quick_sort(arr, 0, len(arr) - 1)
print(arr)

Complexity

  • Best/Average: \(O(n \log n)\)
  • Worst: \(O(n^2)\)
  • Space: \(O(\log n)\)

Heap Sort

Uses a heap data structure to find the maximum or minimum element.

def heapify(arr, n, i):
    largest = i
    l = 2 * i + 1
    r = 2 * i + 2

    if l < n and arr[l] > arr[largest]:
        largest = l

    if r < n and arr[r] > arr[largest]:
        largest = r

    if largest != i:
        arr[i], arr[largest] = arr[largest], arr[i]
        heapify(arr, n, largest)

def heap_sort(arr):
    n = len(arr)
    for i in range(n // 2 - 1, -1, -1):
        heapify(arr, n, i)
    for i in range(n - 1, 0, -1):
        arr[i], arr[0] = arr[0], arr[i]
        heapify(arr, i, 0)

arr = [12, 11, 13, 5, 6, 7]
heap_sort(arr)
print(arr)

Complexity

  • Best/Average/Worst: \(O(n \log n)\)
  • Space: \(O(1)\)

Bucket Sort

Distributes elements into buckets and sorts them individually.

def bucket_sort(arr):
    bucket = []
    for i in range(len(arr)):
        bucket.append([])

    for j in arr:
        index_b = int(10 * j)
        bucket[index_b].append(j)

    for i in range(len(arr)):
        bucket[i] = sorted(bucket[i])

    k = 0
    for i in range(len(arr)):
        for j in range(len(bucket[i])):
            arr[k] = bucket[i][j]
            k += 1

arr = [0.897, 0.565, 0.656, 0.1234, 0.665]
bucket_sort(arr)
print(arr)

Complexity

  • Best: \(O(n+k)\)
  • Average/Worst: \(O(n^2)\)
  • Space: \(O(n+k)\)

Radix Sort

Sorts numbers digit by digit starting from the least significant digit.

def counting_sort(arr, exp):
    n = len(arr)
    output = [0] * n
    count = [0] * 10

    for i in range(n):
        index = arr[i] // exp
        count[index % 10] += 1

    for i in range(1, 10):
        count[i] += count[i - 1]

    i = n - 1
    while i >= 0:
        index = arr[i] // exp
        output[count[index % 10] - 1] = arr[i]
        count[index % 10] -= 1
        i -= 1

    for i in range(len(arr)):
        arr[i] = output[i]

def radix_sort(arr):
    max_val = max(arr)
    exp = 1
    while max_val // exp > 0:
        counting_sort(arr, exp)
        exp *= 10

arr = [170, 45, 75, 90, 802, 24, 2, 66]
radix_sort(arr)
print(arr)

Complexity

  • Best/Average/Worst: \(O(nk)\)
  • Space: \(O(n+k)\)

Conclusion

We’ve explored a wide range of data structures and sorting algorithms, discussing their Python implementations, and breaking down their time and space complexities. These foundational concepts are essential for any software developer to understand, and mastering them will improve your ability to choose the right tools and algorithms for a given problem.

Below is a table outlining these complexities about the data structures:

Data Structure Access Time Search Time Insertion Time Deletion Time Space Complexity
List (Array) \(O(1)\) \(O(n)\) \(O(n)\) \(O(n)\) \(O(n)\)
Stack \(O(n)\) \(O(n)\) \(O(1)\) \(O(1)\) \(O(n)\)
Queue \(O(n)\) \(O(n)\) \(O(1)\) \(O(1)\) \(O(n)\)
Set N/A \(O(1)\) \(O(1)\) \(O(1)\) \(O(n)\)
Dictionary N/A \(O(1)\) \(O(1)\) \(O(1)\) \(O(n)\)
Binary Tree (BST) \(O(\log n)\) \(O(\log n)\) \(O(\log n)\) \(O(\log n)\) \(O(n)\)
Heap (Binary) \(O(n)\) \(O(n)\) \(O(\log n)\) \(O(\log n)\) \(O(n)\)

Below is a quick summary of the time complexities of the sorting algorithms we covered:

Algorithm Best Time Complexity Average Time Complexity Worst Time Complexity Auxiliary Space
Bubble Sort \(O(n)\) \(O(n^2)\) \(O(n^2)\) \(O(1)\)
Selection Sort \(O(n^2)\) \(O(n^2)\) \(O(n^2)\) \(O(1)\)
Insertion Sort \(O(n)\) \(O(n^2)\) \(O(n^2)\) \(O(1)\)
Merge Sort \(O(n \log n)\) \(O(n \log n)\) \(O(n \log n)\) \(O(n)\)
Quick Sort \(O(n \log n)\) \(O(n \log n)\) \(O(n^2)\) \(O(\log n)\)
Heap Sort \(O(n \log n)\) \(O(n \log n)\) \(O(n \log n)\) \(O(1)\)
Bucket Sort \(O(n + k)\) \(O(n + k)\) \(O(n^2)\) \(O(n + k)\)
Radix Sort \(O(nk)\) \(O(nk)\) \(O(nk)\) \(O(n + k)\)

Keep this table handy as a reference for making decisions on the appropriate sorting algorithm based on time and space constraints.

Happy coding!

Understanding TF-IDF in NLP

Introduction

In our previous post, we introduced One-Hot Encoding and the Bag-of-Words (BoW) model, which are simple methods of representing text as numerical vectors. While these techniques are foundational, they come with certain limitations. One major drawback of Bag-of-Words is that it treats all words equally—common words like “the” or “is” are given the same importance as more meaningful words like “science” or “NLP.”

TF-IDF (Term Frequency-Inverse Document Frequency) is an extension of BoW that aims to address this problem. By weighting words based on their frequency in individual documents versus the entire corpus, TF-IDF highlights more important words and reduces the impact of common, less meaningful ones.

TF-IDF

TF-IDF stands for Term Frequency-Inverse Document Frequency. It’s a numerical statistic used to reflect the importance of a word in a document relative to a collection of documents (a corpus). The formula is:

\[\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)\]

Where:

  • \(\text{TF}(t, d)\): Term Frequency of term \(t\) in document \(d\), which is the number of times \(t\) appears in \(d\).
  • \(\text{IDF}(t)\): Inverse Document Frequency, which measures how important \(t\) is across the entire corpus.

Term Frequency (TF)

Term Frequency (TF) is simply a count of how frequently a term appears in a document. The higher the frequency, the more relevant the word is assumed to be for that specific document.

\[\text{TF}(t, d) = \frac{\text{Number of occurrences of } t \text{ in } d}{\text{Total number of terms in } d}\]

For example, if the word “NLP” appears 3 times in a document of 100 words, the term frequency for “NLP” is:

\[\text{TF}(NLP, d) = \frac{3}{100} = 0.03\]

Inverse Document Frequency (IDF)

Inverse Document Frequency (IDF) downweights common words that appear in many documents and upweights rare words that are more meaningful in specific contexts. The formula is:

\[\text{IDF}(t) = \log\left(\frac{N}{1 + \text{DF}(t)}\right)\]

Where:

  • \(N\) is the total number of documents in the corpus.
  • \(\text{DF}(t)\) is the number of documents that contain the term \(t\).

The “+1” in the denominator is there to avoid division by zero. Words that appear in many documents (e.g., “is”, “the”) will have a lower IDF score, while rare terms will have higher IDF scores.

Example

Let’s take an example with two documents:

  1. Document 1: “I love NLP and NLP loves me”
  2. Document 2: “NLP is great and I enjoy learning NLP”

The combined vocabulary is:

["I", "love", "NLP", "and", "loves", "me", "is", "great", "enjoy", "learning"]

For simplicity, let’s calculate the TF and IDF for the term “NLP”.

  • TF for “NLP” in Document 1: The term “NLP” appears twice in Document 1, which has 7 words total, so:
\[\text{TF}(NLP, d_1) = \frac{2}{7} \approx 0.286\]
  • TF for “NLP” in Document 2: The term “NLP” appears twice in Document 2, which has 8 words total, so:
\[\text{TF}(NLP, d_2) = \frac{2}{8} = 0.25\]

Now, let’s calculate the IDF for “NLP”. Since “NLP” appears in both documents (2 out of 2 documents), the IDF is:

\[\text{IDF}(NLP) = \log\left(\frac{2}{1 + 2}\right) = \log\left(\frac{2}{3}\right) \approx -0.176\]

The negative value here shows that “NLP” is a very common term in this corpus, and its weight will be downscaled.

Code Example: TF-IDF with TfidfVectorizer

Now let’s use TfidfVectorizer from sklearn to automatically calculate TF-IDF scores for our documents.

from sklearn.feature_extraction.text import TfidfVectorizer

# Define a corpus of documents
corpus = [
    "I love NLP and NLP loves me",
    "NLP is great and I enjoy learning NLP"
]

# Initialize the TF-IDF vectorizer
vectorizer = TfidfVectorizer()
# Fit and transform the corpus into TF-IDF vectors
X = vectorizer.fit_transform(corpus)

# Display the feature names (vocabulary)
print(vectorizer.get_feature_names_out())

# Display the TF-IDF matrix
print(X.toarray())

The output of this is:

['and' 'enjoy' 'great' 'is' 'learning' 'love' 'loves' 'me' 'nlp']
[[0.30253071 0.         0.         0.         0.         0.42519636 0.42519636 0.42519636 0.60506143]
 [0.27840869 0.39129369 0.39129369 0.39129369 0.39129369 0.         0.         0.         0.55681737]]

Each row in the output corresponds to a document, and each column corresponds to a term in the vocabulary. The values represent the TF-IDF score of each term for each document.

Advantages of TF-IDF

  1. Balances Frequency: TF-IDF considers both how frequently a word appears in a document (term frequency) and how unique or common it is across all documents (inverse document frequency). This helps prioritize meaningful words.
  2. Reduces Impact of Stop Words: By downweighting terms that appear in many documents, TF-IDF naturally handles common stop words without needing to remove them manually.
  3. Efficient for Large Corpora: TF-IDF is computationally efficient and scales well to large datasets.

Limitations of TF-IDF

While TF-IDF is a significant improvement over simple Bag-of-Words, it still has some limitations:

  1. No Semantic Meaning: Like Bag-of-Words, TF-IDF treats words as independent features and doesn’t capture the relationships or meaning between them.
  2. Sparse Representations: Even with the IDF weighting, TF-IDF still generates high-dimensional and sparse vectors, especially for large vocabularies.
  3. Ignores Word Order: TF-IDF doesn’t account for word order, so sentences with the same words in different arrangements will have the same representation.

Conclusion

TF-IDF is a powerful and widely-used method for text representation, especially in tasks like document retrieval and search engines, where distinguishing between important and common words is crucial. However, as we’ve seen, TF-IDF doesn’t capture the meaning or relationships between words, which is where word embeddings come into play.

Turning Words into Vectors

Introduction

In our previous post, we covered the preprocessing steps necessary to convert text into a machine-readable format, like tokenization and stop word removal. But once the text is preprocessed, how do we represent it for use in machine learning models?

Before the rise of word embeddings, simpler techniques were commonly used to represent text as vectors. Today, we’ll explore two foundational techniques: One-Hot Encoding and Bag-of-Words (BoW). These methods don’t capture the semantic meaning of words as well as modern embeddings do, but they’re essential for understanding the evolution of Natural Language Processing (NLP).

One-Hot Encoding

One of the simplest ways to represent text is through One-Hot Encoding. In this approach, each word in a vocabulary is represented as a vector where all the elements are zero, except for a single element that corresponds to the word’s index.

Let’s take a small vocabulary:

["I", "love", "NLP"]

The vocabulary size is 3, and each word will be represented by a 3-dimensional vector:

I -> [1, 0, 0] 
love -> [0, 1, 0] 
NLP -> [0, 0, 1]

Each word is “hot” (1) in one specific position, while “cold” (0) everywhere else.

Example

Let’s generate one-hot encoded vectors using Python:

from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Define a small vocabulary
vocab = ["I", "love", "NLP"]
# Reshape the data for OneHotEncoder
vocab_reshaped = np.array(vocab).reshape(-1, 1)

# Initialize the OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
# Fit and transform the vocabulary
onehot_encoded = encoder.fit_transform(vocab_reshaped)
print(onehot_encoded)

The output shows the three indexed words:

[[1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]

Each word has a unique binary vector representing its position in the vocabulary.

Drawbacks of One-Hot Encoding

One-Hot Encoding is simple but comes with some limitations:

  • High Dimensionality: For large vocabularies, the vectors become huge, leading to a “curse of dimensionality”.
  • Lack of Semantic Information: One-Hot vectors don’t capture any relationships between words. “love” and “like” would have completely different vectors, even though they are semantically similar.

Bag-of-Words (BoW)

One-Hot Encoding represents individual words, but what about whole documents or sentences? That’s where the Bag-of-Words (BoW) model comes in. In BoW, the text is represented as a vector of word frequencies.

BoW counts how often each word from a given vocabulary appears in the document, without considering the order of words (hence, a “bag” of words).

Let’s take two example sentences:

  1. “I love NLP”
  2. “NLP is amazing”

The combined vocabulary for these two sentences is:

["I", "love", "NLP", "is", "amazing"]

Now, using BoW, we represent each sentence as a vector of word counts:

  1. “I love NLP” -> [1, 1, 1, 0, 0] (since “I”, “love”, and “NLP” appear once, and “is” and “amazing” don’t appear)
  2. “NLP is amazing” -> [0, 0, 1, 1, 1] (since “NLP”, “is”, and “amazing” appear once, and “I” and “love” don’t appear)

Example

We can use CountVectorizer from the sklearn library to easily apply Bag-of-Words to a corpus of text:

from sklearn.feature_extraction.text import CountVectorizer

# Define a set of documents
corpus = [
    "I love NLP",
    "NLP is amazing"
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()
# Transform the corpus into BoW vectors
X = vectorizer.fit_transform(corpus)

# Display the feature names (the vocabulary)
print(vectorizer.get_feature_names_out())

# Display the BoW representation
print(X.toarray())

The output of which looks like this:

['amazing' 'is' 'love' 'nlp']
[[0 0 1 1]
 [1 1 0 1]]

Limitations of Bag-of-Words

While BoW is a simple and powerful method, it too has its drawbacks:

  1. Sparsity: Like One-Hot Encoding, BoW produces high-dimensional and sparse vectors, especially for large vocabularies.
  2. No Word Order: BoW ignores word order. The sentence “I love NLP” is treated the same as “NLP love I”, which may not always make sense.
  3. No Semantic Relationships: Just like One-Hot Encoding, BoW doesn’t capture the meaning or relationships between words. All words are treated as independent features.

Conclusion

Both One-Hot Encoding and Bag-of-Words are simple and effective ways to represent text as numbers, but they have significant limitations, particularly in capturing semantic relationships and dealing with large vocabularies.

These methods laid the groundwork for more sophisticated representations like TF-IDF (which we’ll cover next) and eventually led to word embeddings, which capture the meaning and context of words more effectively.

Computers Understanding Text

Introduction

When working with Natural Language Processing (NLP), one of the first challenges you encounter is how to convert human-readable text into a format that machines can understand. Computers don’t natively understand words or sentences; they operate in numbers.

So, how do we get from words to something a machine can process?

This is where text preprocessing comes in.

Text preprocessing involves several steps to prepare raw text for analysis. In this post, we’ll walk through the foundational techniques in preprocessing: tokenization, lowercasing, removing stop words, and stemming/lemmatization. These steps ensure that our text is in a clean, structured format for further processing like word embeddings or more complex NLP models.

Tokenization: Breaking Down the Text

What is Tokenization?

Tokenization is the process of breaking a string of text into smaller pieces, usually words or subwords. In essence, it’s the process of splitting sentences into tokens, which are the basic units for further NLP tasks.

For example, consider the sentence:

"I love NLP!"

Tokenization would break this into:

["I", "love", "NLP", "!"]

This is a simple example where each token corresponds to a word or punctuation. However, tokenization can get more complex depending on the language and the task. For instance, some tokenizers split contractions like “can’t” into ["can", "'t"], while others might treat it as one token. Tokenization also becomes more challenging in languages that don’t have spaces between words, like Chinese or Japanese.

Code Example: Tokenization in Python

Let’s look at a basic example of tokenization using Python’s nltk library:

import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

sentence = "I love NLP!"
tokens = word_tokenize(sentence)
print(tokens)

The output you can see is simply:

['I', 'love', 'NLP', '!']

Code Example: Sentence Tokenization

Tokenization can also occur at the sentence level, which means breaking down a paragraph or a larger body of text into individual sentences. This is helpful for tasks like summarization or sentiment analysis, where sentence boundaries matter.

from nltk.tokenize import sent_tokenize

text = "NLP is fun. It's amazing how machines can understand text!"
sentences = sent_tokenize(text)
print(sentences)

The output is now on the sentence boundary:

['NLP is fun.', "It's amazing how machines can understand text!"]

Lowercasing: Making Text Uniform

In English, the words “Dog” and “dog” mean the same thing, but to a computer, they are two different tokens. Lowercasing is a simple yet powerful step in text preprocessing. By converting everything to lowercase, we reduce the complexity of the text and ensure that words like “NLP” and “nlp” are treated identically.

We can achieve this simple with the .lower() method off of a string.

This step becomes crucial when dealing with large text corpora, as it avoids treating different capitalizations of the same word as distinct entities.

Removing Stop Words: Filtering Out Common Words

Stop words are commonly used words that don’t carry significant meaning in many tasks, such as “and”, “the”, and “is”. Removing stop words helps reduce noise in the data and improves the efficiency of downstream models by focusing only on the meaningful parts of the text.

Many libraries provide lists of stop words, but the ideal list can vary depending on the task.

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
tokens = ["I", "love", "NLP", "!"]
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)

Only the value words remain, removing the “I”:

["love", "NLP", "!"]

Stemming and Lemmatization: Reducing Words to Their Root Forms

Another key step in preprocessing is reducing words to their base or root form. There are two common approaches:

Stemming: This cuts off word endings to get to the base form, which can sometimes be rough. For example, “running”, “runner”, and “ran” might all be reduced to “run”.

Lemmatization: This is a more refined process that looks at the word’s context and reduces it to its dictionary form. For instance, “better” would be lemmatized to “good”.

Here’s an example using nltk for both stemming and lemmatization:

from nltk.stem import PorterStemmer

ps = PorterStemmer()
words = ["running", "runner", "ran"]
stemmed_words = [ps.stem(word) for word in words]
print(stemmed_words)

The output here:

['run', 'runner', 'ran']

An example of Lemmatization looks like this:

from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
words = ["running", "better", "ran"]
lemmatized_words = [lemmatizer.lemmatize(word, pos="v") for word in words]
print(lemmatized_words)
['run', 'good', 'run']

Conclusion

Text preprocessing is a crucial first step in any NLP project. By breaking down text through tokenization, making it uniform with lowercasing, and reducing unnecessary noise with stop word removal and stemming/lemmatization, we can create a clean and structured dataset for further analysis or model training. These steps form the foundation upon which more advanced techniques, such as word embeddings and machine learning models, are built.