Cogs and Levers A blog full of technical stuff

High Performance Linux IO with IO_URING

Introduction

IO_URING is an advanced asynchronous I/O interface introduced in the Linux kernel (version 5.1). It’s designed to provide significant performance improvements for I/O-bound applications, particularly those requiring high throughput and low latency.

It’s well worth taking a look in the linux man pages for io_uring and having a read through the function interface.

In today’s article we’ll discuss IO_URING in depth and follow with some examples to see it in practice.

What is IO_URING

IO_URING is a high-performance asynchronous I/O interface introduced in Linux kernel version 5.1. It was developed to address the limitations of traditional Linux I/O mechanisms like epoll, select, and aio. These earlier approaches often suffered from high overhead due to system calls, context switches, or inefficient batching, which limited their scalability in handling modern high-throughput and low-latency workloads.

At its core, IO_URING provides a ring-buffer-based mechanism for submitting I/O requests and receiving their completions, eliminating many inefficiencies in older methods. This allows applications to perform non-blocking, asynchronous I/O with minimal kernel involvement, making it particularly suited for applications such as databases, web servers, and file systems.

How does IO_URING work?

IO_URING’s architecture revolves around two primary shared memory ring buffers between user space and the kernel:

  1. Submission Queue (SQ):
    • The SQ is a ring buffer where applications enqueue I/O requests.
    • User-space applications write requests directly to the buffer without needing to call into the kernel for each operation.
    • The requests describe the type of I/O operation to be performed (e.g., read, write, send, receive).
  2. Completion Queue (CQ):
    • The CQ is another ring buffer where the kernel places the results of completed I/O operations.
    • Applications read from the CQ to retrieve the status of their submitted requests.

The interaction between user space and the kernel is simplified:

  • The user-space application adds entries to the Submission Queue and notifies the kernel when ready (via a single syscall like io_uring_enter).
  • The kernel processes these requests and posts results to the Completion Queue, which the application can read without additional syscalls.

Key Features

  1. Batching Requests:
    • Multiple I/O operations can be submitted in a single system call, significantly reducing syscall overhead.
  2. Zero-copy I/O:
    • Certain operations (like reads and writes) can leverage fixed buffers, avoiding unnecessary data copying between kernel and user space.
  3. Kernel Offloading:
    • The kernel can process requests in the background, allowing the application to continue without waiting.
  4. Efficient Polling:
    • Supports event-driven programming with low-latency polling mechanisms, reducing idle time in high-performance applications.
  5. Flexibility:
    • IO_URING supports a wide range of I/O operations, including file I/O, network I/O, and event notifications.

Code

Let’s get some code examples going to see exactly what we’re dealing with.

First of all, check to see that your kernel supports IO_URING. It should. It’s been available since 51.

uname -r

You’ll also need liburing avaliable to you in order to compile these examples.

Library setup

In this first example, we won’t perform any actions; but we’ll setup the library so that we can use these operations. All of our other examples will use this as a base.

We’ll need some basic I/O headers as well as liburing.h.

#include <liburing.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

We initialize our uring queue using io_uring_queue_init:

struct io_uring ring;
int ret;

// initialize IO_URING
if (io_uring_queue_init(8, &ring, 0) < 0) {
    perror("io_uring_queue_init");
    exit(1);
}

When we’re finished with the ring, we cleanup with io_uring_queue_exit.

io_uring_queue_exit(&ring);

Simple Write

In this example, we’ll queue up a write of a string out to a file and that’s it.

First, we need to open the file like usual:

int fd = open(FILENAME, O_WRONLY | O_CREAT | O_TRUNC, 0644);
if (fd < 0) {
    perror("open");
    io_uring_queue_exit(&ring);
    exit(1);
}

Now, we setup the write job to happen.

struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
if (!sqe) {
    fprintf(stderr, "io_uring_get_sqe failed\n");
    close(fd);
    io_uring_queue_exit(&ring);
    exit(1);
}

const char *message = MESSAGE;
struct iovec iov = {
    .iov_base = (void *)message,
    .iov_len = strlen(message)
};

io_uring_prep_writev(sqe, fd, &iov, 1, 0);

The io_uring_get_sqe function will get us the next available submission queue entry from the job queue. Once we have secured one of these, we then fill a vector I/O structure (a iovec) with the details of our data. Here it’s just the data pointer, and length.

Finally, we prepare a vector write request using io_uring_prep_writev.

We submit the job off to be processed now with io_uring_submit:

ret = io_uring_submit(&ring);
if (ret < 0) {
    perror("io_uring_submit");
    close(fd);
    io_uring_queue_exit(&ring);
    exit(1);
}

We can wait for the execution to complete; even more powerful though is we can be off doing other things if we’d like!

In order to wait for the job to finish, we use io_uring_wait_cqe:

struct io_uring_cqe *cqe;
ret = io_uring_wait_cqe(&ring, &cqe);
if (ret < 0) {
    perror("io_uring_wait_cqe");
    close(fd);
    io_uring_queue_exit(&ring);
    exit(1);
}

We check the result of the job through the io_uring_cqe structure filled by the io_uring_wait_cqe call:

if (cqe->res < 0) {
    fprintf(stderr, "Write failed: %s\n", strerror(-cqe->res));
} else {
    printf("Write completed successfully!\n");
}

Finally, we mark the uring event as consumed and close the file.

io_uring_cqe_seen(&ring, cqe);
close(fd);

The full example of this can be found here.

Multiple Operations

We can start to see some of the power of this system in this next example. We’ll submit multiple jobs for processing.

We’ve opened a source file for reading int src_fd and a destination file for writing in dest_fd.

// prepare a read operation
sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, src_fd, buffer, BUF_SIZE, 0);

// submit the read request
io_uring_submit(&ring);
io_uring_wait_cqe(&ring, &cqe);

if (cqe->res < 0) {
    fprintf(stderr, "Read failed: %s\n", strerror(-cqe->res));
    io_uring_cqe_seen(&ring, cqe);
    goto cleanup;
}
io_uring_cqe_seen(&ring, cqe);

// prepare a write operation
sqe = io_uring_get_sqe(&ring);
io_uring_prep_write(sqe, dest_fd, buffer, cqe->res, 0);

// submit the write request
io_uring_submit(&ring);
io_uring_wait_cqe(&ring, &cqe);

if (cqe->res < 0) {
    fprintf(stderr, "Write failed: %s\n", strerror(-cqe->res));
} else {
    printf("Copy completed successfully!\n");
}
io_uring_cqe_seen(&ring, cqe);

So, this is just sequentially executing multiple operations.

The full example of this can be found here.

Asynchronous operations

Finally, we’ll write an example that will process multiple operations in parallel.

The following for loop sets up 3 read jobs:

for (int i = 0; i < FILE_COUNT; i++) {
    int fd = open(files[i], O_RDONLY);
    if (fd < 0) {
        perror("open");
        io_uring_queue_exit(&ring);
        exit(1);
    }

    // Allocate a buffer for the read operation
    char *buffer = malloc(BUF_SIZE);
    if (!buffer) {
        perror("malloc");
        close(fd);
        io_uring_queue_exit(&ring);
        exit(1);
    }

    requests[i].fd = fd;
    requests[i].buffer = buffer;

    // Get an SQE (Submission Queue Entry)
    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    if (!sqe) {
        fprintf(stderr, "Failed to get SQE\n");
        close(fd);
        free(buffer);
        io_uring_queue_exit(&ring);
        exit(1);
    }

    // Prepare a read operation
    io_uring_prep_read(sqe, fd, buffer, BUF_SIZE, 0);
    io_uring_sqe_set_data(sqe, &requests[i]);
}

All of the requests now get submitted for processing:

// Submit all requests
ret = io_uring_submit(&ring);
if (ret < 0) {
    perror("io_uring_submit");
    io_uring_queue_exit(&ring);
    exit(1);
}

Finally, we wait on each of the jobs to finish. The important thing to note here, is that we could be busy off doing otherthings rather than just waiting for these jobs to finish.

// wait for completions
for (int i = 0; i < FILE_COUNT; i++) {
    struct io_uring_cqe *cqe;
    ret = io_uring_wait_cqe(&ring, &cqe);
    if (ret < 0) {
        perror("io_uring_wait_cqe");
        io_uring_queue_exit(&ring);
        exit(1);
    }

    // Process the completed request
    struct io_request *req = io_uring_cqe_get_data(cqe);
    if (cqe->res < 0) {
        fprintf(stderr, "Read failed for file %d: %s\n", req->fd, strerror(-cqe->res));
    } else {
        printf("Read %d bytes from file descriptor %d:\n%s\n", cqe->res, req->fd, req->buffer);
    }

    // Mark the CQE as seen
    io_uring_cqe_seen(&ring, cqe);

    // Clean up
    close(req->fd);
    free(req->buffer);
}

The entire example of this one can be found here.

Conclusion

IO_URING represents a transformative step in Linux asynchronous I/O, providing unparalleled performance and flexibility for modern applications. By minimizing syscall overhead, enabling zero-copy I/O, and allowing concurrent and batched operations, it has become a vital tool for developers working on high-performance systems.

Through the examples we’ve covered, you can see the practical power of IO_URING, from simple write operations to complex asynchronous processing. Its design not only simplifies high-throughput I/O operations but also opens up opportunities to optimize and innovate in areas like database systems, networking, and file handling.

SIMD

Introduction

SIMD (Single Instruction, Multiple Data) is a computing technique used in modern CPUs and GPUs to perform the same operation on multiple pieces of data simultaneously. SIMD instructions are critical for optimizing tasks in data-parallel applications, such as multimedia processing, scientific computing, and machine learning.

What is SIMD?

SIMD allows a single instruction to operate on multiple data elements in parallel. It is a subset of parallel computing focused on data-level parallelism. Traditional instructions operate on a single data element (Single Instruction, Single Data).

Most modern CPUs have SIMD instruction sets built into their architecture. These include:

  • Intel/AMD x86:
    • MMX (legacy)
    • SSE (Streaming SIMD Extensions)
    • AVX (Advanced Vector Extensions)
    • AVX-512 (latest in Intel’s Xeon and some desktop processors)
  • ARM:
    • NEON
  • PowerPC:
    • AltiVec (also known as VMX)
  • RISC-V:
    • Vector extensions.

When to Use SIMD

SIMD is ideal for applications with:

  1. Data Parallelism: Repeated operations on arrays or vectors (e.g., adding two arrays).
  2. Heavy Computation:
    • Multimedia processing (e.g., video encoding/decoding, image manipulation).
    • Scientific simulations (e.g., matrix operations).
    • Machine learning (e.g., tensor computations).
  3. Regular Data Access Patterns: Data laid out in contiguous memory blocks.

SIMD support in your CPU provides vector registers to store multiple data elements (i.e. 4 floats in a 128-bit register). From there, vectorized instructions are performed simultaneously. SIMD requires aligned memory for optimal performance. Misaligned data incurs penalties or falls back to scalar processing.

How to use it

Intel Intrinsics for AVX

The following example simply adds two vectors together, and prints the results out to the terminal.

#include <immintrin.h>
#include <stdio.h>
#include <stdlib.h>

// add two arrays of floats using AVX
void add_arrays(float* a, float* b, float* result) {
    __m256 vec_a = _mm256_load_ps(a);   // Load 8 floats into a vector register
    __m256 vec_b = _mm256_load_ps(b);   // Load 8 floats into another register
    __m256 vec_res = _mm256_add_ps(vec_a, vec_b); // SIMD addition
    _mm256_store_ps(result, vec_res);  // Store the result back to memory
}

int main() {
    // ensure array sizes match AVX requirements (8 floats)
    float a[8] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f};
    float b[8] = {8.0f, 7.0f, 6.0f, 5.0f, 4.0f, 3.0f, 2.0f, 1.0f};
    float c[8];

    add_arrays(a, b, c);

    // print the result
    printf("Answer: ");
    for (int i = 0; i < 8; i++) {
        printf("%f ", c[i]);
    }
    printf("\n");

    return 0;
}

In order to compile this you need to use:

gcc -mavx test.c -o test

When the disassemble this program, we can see evidence that the extended instruction set is being used:

    __m256 vec_a = _mm256_load_ps(a);   // Load 8 floats into a vector register
    118a:       c5 fc 29 44 24 c8       vmovaps %ymm0,-0x38(%rsp)
    1190:       48 8b 44 24 98          mov    -0x68(%rsp),%rax
    1195:       48 89 44 24 b8          mov    %rax,-0x48(%rsp)
    119a:       48 8b 44 24 b8          mov    -0x48(%rsp),%rax
    119f:       c5 fc 28 00             vmovaps (%rax),%ymm0
    __m256 vec_b = _mm256_load_ps(b);   // Load 8 floats into another register
    11a3:       c5 fc 29 44 24 e8       vmovaps %ymm0,-0x18(%rsp)
    11a9:       c5 fc 28 44 24 c8       vmovaps -0x38(%rsp),%ymm0
    11af:       c5 fc 29 44 24 48       vmovaps %ymm0,0x48(%rsp)
    11b5:       c5 fc 28 44 24 e8       vmovaps -0x18(%rsp),%ymm0
    11bb:       c5 fc 29 44 24 68       vmovaps %ymm0,0x68(%rsp)

Compiler Auto-Vectorisation

SIMD is so common these days, that if you wrote the code above just in plain-old c:

void add_arrays(float* a, float* b, float* result, int n) {
    for (int i = 0; i < n; i++) {
        result[i] = a[i] + b[i];
    }
}

If you were to compile this code with either -O2 or -O3, you’ll find that vectorisation gets enabled.

Without any optimisation, we get the following:

void add_arrays(float* a, float* b, float* result, int n) {
    1159:       55                      push   %rbp
    115a:       48 89 e5                mov    %rsp,%rbp
    115d:       48 89 7d e8             mov    %rdi,-0x18(%rbp)
    1161:       48 89 75 e0             mov    %rsi,-0x20(%rbp)
    1165:       48 89 55 d8             mov    %rdx,-0x28(%rbp)
    1169:       89 4d d4                mov    %ecx,-0x2c(%rbp)
  for (int i = 0; i < n; i ++) {
    116c:       c7 45 fc 00 00 00 00    movl   $0x0,-0x4(%rbp)
    1173:       eb 50                   jmp    11c5 <add_arrays+0x6c>
    result[i] = a[i] + b[i];
    1175:       8b 45 fc                mov    -0x4(%rbp),%eax
    1178:       48 98                   cltq
    117a:       48 8d 14 85 00 00 00    lea    0x0(,%rax,4),%rdx
    1181:       00 
    1182:       48 8b 45 e8             mov    -0x18(%rbp),%rax
    1186:       48 01 d0                add    %rdx,%rax
    1189:       f3 0f 10 08             movss  (%rax),%xmm1
    118d:       8b 45 fc                mov    -0x4(%rbp),%eax
    1190:       48 98                   cltq
    1192:       48 8d 14 85 00 00 00    lea    0x0(,%rax,4),%rdx
    1199:       00 
    119a:       48 8b 45 e0             mov    -0x20(%rbp),%rax
    119e:       48 01 d0                add    %rdx,%rax
    11a1:       f3 0f 10 00             movss  (%rax),%xmm0
    11a5:       8b 45 fc                mov    -0x4(%rbp),%eax
    11a8:       48 98                   cltq
    11aa:       48 8d 14 85 00 00 00    lea    0x0(,%rax,4),%rdx
    11b1:       00 
    11b2:       48 8b 45 d8             mov    -0x28(%rbp),%rax
    11b6:       48 01 d0                add    %rdx,%rax
    11b9:       f3 0f 58 c1             addss  %xmm1,%xmm0
    11bd:       f3 0f 11 00             movss  %xmm0,(%rax)
  for (int i = 0; i < n; i ++) {
    11c1:       83 45 fc 01             addl   $0x1,-0x4(%rbp)
    11c5:       8b 45 fc                mov    -0x4(%rbp),%eax
    11c8:       3b 45 d4                cmp    -0x2c(%rbp),%eax
    11cb:       7c a8                   jl     1175 <add_arrays+0x1c>
  }
}

The use of movss and addss are indeed SIMD instructions; but they are only operating on scalar values at a time.

Now, if we turn the optimisation up you’ll notice that we start to use some of those SIMD primitives start working on packed numbers.

    result[i] = a[i] + b[i];
    1260:       0f 10 04 07             movups (%rdi,%rax,1),%xmm0
    1264:       0f 10 14 06             movups (%rsi,%rax,1),%xmm2
    1268:       0f 58 c2                addps  %xmm2,%xmm0
    126b:       0f 11 04 02             movups %xmm0,(%rdx,%rax,1)

These instructions (like addps) can add 4, 8, or 16 numbers at once.

Assembly

If you really feel the need to get that extra bit of power, you can crack out the assembly language yourself and have a go.

movaps xmm0, [a]     ; Load 4 floats from a
movaps xmm1, [b]     ; Load 4 floats from b
addps xmm0, xmm1     ; Add packed floats
movaps [result], xmm0; Store the result

For the work that it’s doing, this is very tidy code.

High Level Libraries

Finally, there are a number of high level libraries that industralise the usage of SIMD instructions really well. Using these makes these operations much easier to write!

  • Eigen (C++): Matrix and vector math.
  • NumPy (Python): Uses SIMD internally via BLAS.
  • OpenCV (C++): SIMD-optimized image processing.

Challenges with SIMD

Branching can be an issue with SIMD struggling to diverge execution paths (e.g., if statements).

The alignment requirements are quite strict for the maximum optimum capability. SIMD often requires data to be aligned to specific byte boundaries (e.g., 16 bytes for SSE, 32 bytes for AVX).

SIMD scales to a fixed number of elements per operation, determined by the vector register width. Scalability can be an issue here with higher dimension vectors.

Code written with specific intrinsics or assembly may not run on CPUs with different SIMD instruction sets. So, if you’re not using one of those higher level libraries - portability can be an issue.

Conclusion

SIMD is a powerful tool for optimizing performance in data-parallel applications, allowing modern CPUs and GPUs to handle repetitive tasks more efficiently. By leveraging intrinsics, compiler optimizations, or high-level libraries, developers can unlock significant performance gains with relatively little effort.

However, like any optimization, SIMD has its challenges, such as branching, memory alignment, and portability. Understanding these limitations and balancing them with the benefits is key to effectively integrating SIMD into your projects.

Whether you’re working on scientific simulations, multimedia processing, or machine learning, SIMD offers a compelling way to accelerate your computations. Start small, experiment with intrinsics or auto-vectorization, and explore the high-level libraries to see how SIMD can transform your application’s performance.

Waving Flag Animation

Introduction

In a previous post we made a simple water droplet demonstration. This is all built on the vga work that we’ve already done.

In today’s post, we’ll use this information again and put a waving flag demo together.

The effect that we’re looking to produce should look something like this:

The Idea

The waving flag effect simulates a piece of fabric rippling in the wind. Here’s the high-level approach:

  1. Flag Bitmap: Create a bitmap with alternating horizontal stripes of green and white.
  2. Wave Dynamics: Use trigonometric functions to displace pixels horizontally and vertically, simulating waves.
  3. Buffering: Use an offscreen buffer to avoid flickering during rendering.

Building the Flag

The flag is a simple bitmap composed of horizontal stripes alternating between green and white. Here’s how it’s generated:

uint8_t* make_flag() {
  int width = 200, height = 100;
  uint8_t *buffer = (uint8_t*) malloc(width * height);

  for (int y = 0; y < height; y++) {
    uint8_t col = (y / 10) % 2 == 0 ? 2 : 15; // Green and white stripes
    memset(buffer + (y * width), col, width);
  }

  return buffer;
}

Each stripe spans 10 rows, and the alternating colors give the flag a distinctive look.

Adding the Wave Effect

The waving effect is achieved by modifying the position of each pixel based on sine and cosine functions. Here’s the core logic:

void draw_flag(double theta, int x, int y, int width, int height, uint8_t *flag, uint8_t *buffer) {
  double t = 0;
  for (int xx = 0; xx < width; xx ++) {
    t = theta;
    double wave_offset = 5 * sin(theta + xx * 0.1);
    for (int yy = 0; yy < height; yy ++) {
      uint16_t px = xx + wave_offset;
      uint16_t py = yy + (2 * sin((theta + xx) * 0.2));
      
      uint16_t o = x + (px + x) + ((py + y) << 8) + ((py + y) << 6);
      uint16_t f = xx + ((yy << 6) + (yy << 5) + (y << 2) << 1);

      uint8_t col = 25 + (3 * sin(t)) + (6 * cos(t));
      
      buffer[o] = flag[f];
      buffer[o-1] = flag[f];
      buffer[o+1] = flag[f];
      buffer[o-320] = flag[f];
      buffer[o+320] = flag[f];

      buffer[o-319] = flag[f];
      buffer[o-321] = flag[f];
      buffer[o+319] = flag[f];
      buffer[o-321] = flag[f];
      t += 0.1;
    }
  }
}

Key Features:

  • Wave Dynamics: The wave_offset creates a horizontal ripple effect based on sin(theta + xx * 0.1). A secondary vertical ripple adds realism.
  • Boundary Checks: Ensures pixels remain within the screen bounds.
  • Direct Pixel Copy: Pixels are copied from the flag bitmap to the appropriate buffer position.
  • Redundant Pixel Render: We make sure we render to all surrounding cells so we don’t experience tearing

Main Loop

The main loop ties everything together, handling synchronization, rendering, and input:

int main() {
  uint8_t *back_buffer = (uint8_t *)malloc(64000);
  uint8_t *flag = make_flag();
  double theta = 0;

  set_mcga();
  clear_buffer(0x00, back_buffer);

  while (!kbhit()) {
    draw_flag(theta, 20, 20, 200, 100, flag, back_buffer);

    wait_vsync();
    copy_buffer(vga, back_buffer);
    clear_buffer(0x00, back_buffer);

    theta += 0.1; // Animate the wave
  }

  free(flag);
  free(back_buffer);
  set_text();

  return 0;
}

Highlights:

  1. Synchronization: The wait_vsync() call ensures smooth updates.
  2. Animation: The theta value incrementally changes, creating continuous movement.
  3. Keyboard Interrupt: The kbhit() function allows the user to exit gracefully.

Conclusion

This waving flag effect combines simple algorithms with creative use of VGA mode 13h to create a visually stunning effect. By leveraging trigonometry, palette manipulation, and efficient buffer handling, we replicate the mesmerizing motion of a flag in the wind.

You can find the complete code on GitHub as a gist.

Try it out, tweak the parameters, and share your own effects! There’s a lot of joy in creating beautiful visuals with minimal resources.

How Modern Compilers Optimise Recursive Algorithms

Introduction

Modern compilers are incredibly sophisticated, capable of transforming even the most inefficient code into highly optimized machine instructions. Recursive algorithms, often seen as elegant yet potentially costly in terms of performance, present a fascinating case study for these optimizations. From reducing function call overhead to transforming recursion into iteration, compilers employ a range of techniques that balance developer productivity with runtime efficiency.

In this article, we’ll explore how GCC optimizes recursive algorithms. We’ll examine key techniques such as tail-call optimization, stack management, and inlining through a simple, easy to understand example. By the end, you’ll have a clearer understanding of the interplay between recursive algorithms and compiler optimizations, equipping you to write code that performs better while retaining clarity.

Factorial

The first example that we’ll look at is calculating a factorial.

int factorial(int n, int acc) {
  if (n == 0) {
    return acc;
  }

  return factorial(n - 1, n * acc);
}

This block of code is fairly simple. n is the factorial that we want to calculate with acc facilitating the recursive processing that we’re looking to optimise.

-O0

First of all, we’ll compile this function with -O0 (no optimisation):

int factorial(int n, int acc) {
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
   4:   48 83 ec 10             sub    $0x10,%rsp
   8:   89 7d fc                mov    %edi,-0x4(%rbp)
   b:   89 75 f8                mov    %esi,-0x8(%rbp)
  if (n == 0) {
   e:   83 7d fc 00             cmpl   $0x0,-0x4(%rbp)
  12:   75 05                   jne    19 <factorial+0x19>
    return acc;
  14:   8b 45 f8                mov    -0x8(%rbp),%eax
  17:   eb 16                   jmp    2f <factorial+0x2f>
  }

  return factorial(n - 1, n * acc);
  19:   8b 45 fc                mov    -0x4(%rbp),%eax
  1c:   0f af 45 f8             imul   -0x8(%rbp),%eax
  20:   8b 55 fc                mov    -0x4(%rbp),%edx
  23:   83 ea 01                sub    $0x1,%edx
  26:   89 c6                   mov    %eax,%esi
  28:   89 d7                   mov    %edx,%edi
  2a:   e8 00 00 00 00          call   2f <factorial+0x2f>
}
  2f:   c9                      leave
  30:   c3                      ret

The compiler generates straightforward assembly that closely follows the original C code. No optimizations are applied to reduce function call overhead or improve performance. You would use this level of optimisation (or lack thereof) in situations where you might be debugging; and a straight-forward translation of your code is useful.

Stack operations (push, mov, sub, etc.) are explicitly performed for each recursive call. This results in the largest amount of assembly code and higher function call overhead.

-O1

Next, we’ll re-compile this function at -O1 which will give us basic optimisations:

int factorial(int n, int acc) {
   0:   89 f0                   mov    %esi,%eax
  if (n == 0) {
   2:   85 ff                   test   %edi,%edi
   4:   75 01                   jne    7 <factorial+0x7>
    return acc;
  }

  return factorial(n - 1, n * acc);
}
   6:   c3                      ret
int factorial(int n, int acc) {
   7:   48 83 ec 08             sub    $0x8,%rsp
  return factorial(n - 1, n * acc);
   b:   0f af c7                imul   %edi,%eax
   e:   89 c6                   mov    %eax,%esi
  10:   83 ef 01                sub    $0x1,%edi
  13:   e8 00 00 00 00          call   18 <factorial+0x18>
}
  18:   48 83 c4 08             add    $0x8,%rsp
  1c:   c3                      ret

The first thing to notice here is the stack management at the start of the function.

-O0:

push   %rbp
mov    %rsp,%rbp
sub    $0x10,%rsp

The stack frame is explicitly set up and torn down for every function call, regardless of whether it is needed. This includes saving the base pointer and reserving 16 bytes of stack space.

We then have slower execution due to redundant stack operations and higher memory overhead.

-O1:

sub    $0x8,%rsp

The stack frame is more compact, reducing overhead. The base pointer (%rbp) is no longer saved, as it’s not strictly necessary. This give us reduced stack usage and faster function calls

Next up, we see optimisations around tail-call optimisation (TCO).

-O0:

call   2f <factorial+0x2f>

Recursive calls are handled traditionally, with each call creating a new stack frame.

-O1:

call   18 <factorial+0x18>

While -O1 still retains recursion, it simplifies the process by preparing for tail-call optimization. Unnecessary operations before and after the call are eliminated.

We also see some arithmetic simplification between the optimisation levels:

-O0:

mov    -0x4(%rbp),%eax
imul   -0x8(%rbp),%eax
sub    $0x1,%edx

Arithmetic operations explicitly load and store intermediate results in memory, reflecting a direct translation of the high-level code.

-O1:

imul   %edi,%eax
sub    $0x1,%edi

Intermediate results are kept in registers (%eax, %edi), avoiding unnecessary memory access.

There’s also some instruction elimination between the optimisation levels:

-O0:

mov    -0x8(%rbp),%eax
mov    %eax,%esi
mov    -0x4(%rbp),%edx

Each variable is explicitly loaded from the stack and moved between registers, leading to redundant instructions.

-O1:

mov    %esi,%eax

The compiler identifies that some operations are unnecessary and eliminates them, reducing instruction count.

We finish off with a return path optimisation.

-O0:

leave
ret

Explicit leave and ret instructions are used to restore the stack and return from the function.

-O1:

ret

The leave instruction is eliminated as it’s redundant when the stack frame is managed efficiently.

With reduced stack overhead and fewer instructions, the function executes faster and consumes less memory at -O1 compared to -O0. Now we’ll see if we can squeeze things even further.

-02

We re-compile the same function again, turning optimisations up to -O2. The resulting generated code is this:

int factorial(int n, int acc) {
   0:   89 f0                   mov    %esi,%eax
  if (n == 0) {
   2:   85 ff                   test   %edi,%edi
   4:   74 28                   je     2e <factorial+0x2e>
   6:   8d 57 ff                lea    -0x1(%rdi),%edx
   9:   40 f6 c7 01             test   $0x1,%dil
   d:   74 11                   je     20 <factorial+0x20>
    return acc;
  }

  return factorial(n - 1, n * acc);
   f:   0f af c7                imul   %edi,%eax
  12:   89 d7                   mov    %edx,%edi
  if (n == 0) {
  14:   85 d2                   test   %edx,%edx
  16:   74 17                   je     2f <factorial+0x2f>
  18:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
  1f:   00 
  return factorial(n - 1, n * acc);
  20:   0f af c7                imul   %edi,%eax
  23:   8d 57 ff                lea    -0x1(%rdi),%edx
  26:   0f af c2                imul   %edx,%eax
  if (n == 0) {
  29:   83 ef 02                sub    $0x2,%edi
  2c:   75 f2                   jne    20 <factorial+0x20>
}
  2e:   c3                      ret
  2f:   c3                      ret

First we see some instruction-level parallelism here.

-O2 introduces techniques that exploit CPU-level parallelism. This is visible in the addition of the lea (load effective address) instruction and conditional branching.

-O1:

imul   %edi,%eax
sub    $0x1,%edi
call   18 <factorial+0x18>

-O2:

lea    -0x1(%rdi),%edx
imul   %edi,%eax
mov    %edx,%edi
test   %edx,%edx
jne    20 <factorial+0x20>

At -O2, the compiler begins precomputing values and uses lea to reduce instruction latency. The conditional branch (test and jne) avoids unnecessary function calls by explicitly checking the termination condition.

Next, we see the compiler partially does some loop unrolling

-O1 Recursion is preserved:

call   18 <factorial+0x18>

-O2 Loop structure replaces recursion:

imul   %edi,%eax
lea    -0x1(%rdi),%edx
sub    $0x2,%edi
jne    20 <factorial+0x20>

The recursion is transformed into a loop-like structure that uses the jne (jump if not equal) instruction to iterate until the base case is met. This eliminates much of the overhead associated with recursive function calls, such as managing stack frames.

More redundant operations removed from the code. Redundant instructions like saving and restoring registers are removed. This is particularly noticeable in how the return path is optimized.

-O1:

add    $0x8,%rsp
ret

-O2:

ret

-O2 eliminates the need for stack pointer adjustments because the compiler reduces the stack usage overall.

Finally, we see some more sophisticated conditional simplifications.

-O1:

test   %edi,%edi
jne    7 <factorial+0x7>

-O2:

test   %edi,%edi
je     2e <factorial+0x2e>

Instead of jumping to a label and performing additional instructions, -O2 jumps directly to the return sequence (2e <factorial+0x2e>). This improves branch prediction and minimizes unnecessary operations.

These transformations further reduce the number of instructions executed per recursive call, optimizing runtime efficiency while minimizing memory footprint.

-O3

When we re-compile this code for -O3, we notice that the output code is identical to -O2. This suggests that the compiler found all of the performance opportunities in previous optimisation levels.

This highlights an important point: not all functions benefit from the most aggressive optimization level.

The factorial function is simple and compact, meaning that the optimizations applied at -O2 (tail-recursion transformation, register usage, and instruction alignment) have already maximized its efficiency. -O3 doesn’t introduce further changes because:

  • The function is too small to benefit from aggressive inlining.
  • There are no data-parallel computations that could take advantage of SIMD instructions.
  • Loop unrolling is unnecessary since the tail-recursion has already been transformed into a loop.

For more complex code, -O3 often shines by extracting additional performance through aggressive heuristics, but in cases like this, the improvements plateau at -O2.

Conclusion

Recursive algorithms can often feel like a trade-off between simplicity and performance, but modern compilers significantly narrow this gap. By employing advanced optimizations such as tail-call elimination, inline expansion, and efficient stack management, compilers make it possible to write elegant, recursive solutions without sacrificing runtime efficiency.

Through the examples in this article, we’ve seen how these optimizations work in practice, as well as their limitations. Understanding these techniques not only helps you write better code but also deepens your appreciation for the compilers that turn your ideas into reality. Whether you’re a developer crafting algorithms or just curious about the magic happening behind the scenes, the insights from this exploration highlight the art and science of compiler design.

Water droplet demo

Introduction

Visual effects like water droplets are mesmerizing, and they showcase how simple algorithms can produce complex, beautiful animations. In this article, I’ll walk you through creating a water droplet effect using VGA mode 13h.

We’ll rely on some of the code that we developed in the VGA routines from Watcom C article for VGA setup and utility functions, focusing on how to implement the effect itself.

The effect that we’re looking to produce should look something like this:

The Idea

The water droplet effect simulates circular ripples spreading out from random points on the screen. Here’s the high-level approach:

  1. Drops: Represent each drop with a structure containing its position, energy, and ripple generation.
  2. Drawing Ripples: Use trigonometry to create circular patterns for each ripple generation.
  3. Blur Effect: Smooth the buffer to simulate water’s fluid motion.
  4. Palette: Set a blue-themed palette to enhance the watery feel.

Setting the Water Palette

First, we set a blue-tinted gradient palette. Each color gradually transitions from dark blue to bright blue.

void set_water_palette() {
  uint16_t i;
  uint8_t r, g, b;

  for (i = 0; i < 256; i++) {
      r = i >> 2;  // Dim red
      g = i >> 2;  // Dim green
      b = 63;      // Maximum blue

      set_palette(i, r, g, b);
  }
}

Representing Drops

Each drop is represented by a structure that tracks:

  • (x, y): The origin of the drop.
  • e: Energy, which fades with time.
  • g: Current ripple generation.
struct drop {
  int x;    /* original x-coordinate */
  int y;    /* original y-coordinate */
  int e;    /* energy left in the drop */
  int g;    /* current generation */
};

struct drop drops[N_DROPS];

Creating and Advancing Drops

Drops are reset with random positions, maximum energy, and zero ripple generation:

void reset_drop(struct drop *d) {
  d->x = rand() % 320;
  d->y = rand() % 200;
  d->e = 200;
  d->g = 0;
}

Each frame, we reduce the drop’s energy and increment its generation. When energy is exhausted, the drop stops producing ripples:

void advance_drop(struct drop *d) {
  if (d->e > 0) {
    d->e--;
    d->g++;
  } else {
    d->g = 0;
  }
}

Drawing Ripples

Ripples are drawn using polar coordinates. We calculate x and y offsets using cosine and sine functions for each angle and scale by the current generation.

void draw_drop(struct drop *d, uint8_t *buffer) {
  // if this droplet still has some energy
  if (d->e > 0) {
    // 0 to 2π
    for (float rad = 0.0f; rad < 6.28f; rad += 0.05f) { 
      // x, y co-ordinates to go around the circle
      int xx = (int)(cos(rad) * (float)d->g);
      int yy = (int)(sin(rad) * (float)d->g);

      // translate them into the field
      xx += d->x;
      yy += d->y;

      // clip them to the visible field
      if ((xx >= 0) && (xx < 320) && (yy >= 0) && (yy < 200)) {
        uint16_t offset = xx + (yy << 6) + (yy << 8);  // VGA offset
        uint16_t c = buffer[offset];

        // clamp the pixel colour to 255
        if ((c + d->e) > 255) {
          c = 255;
        } else {
          c += d->e;
        }

        // set the pixel
        buffer[offset] = c;
      }
    }
  }
}

The colour that is rendered to the buffer is additive. We take the current colour at the pixel position, and add to it giving the droplets a sense of collision when they overlap.

Simulating Fluid Motion

A blur effect smooths the ripples, blending them into neighboring pixels for a more fluid appearance. This is done by averaging surrounding pixels.

void blur_buffer(uint8_t *buffer) {
  memset(buffer, 0, 320);         // Clear top border
  memset(buffer + 63680, 0, 320); // Clear bottom border

  for (uint16_t i = 320; i < 63680; i++) {
    buffer[i] = (
        buffer[i - 321] + buffer[i - 320] + buffer[i - 319] +
        buffer[i - 1]   + buffer[i + 1]   +
        buffer[i + 319] + buffer[i + 320] + buffer[i + 321]
    ) >> 3;  // Average of 8 neighbors
  }
}

Main Loop

The main loop handles:

  1. Adding new drops randomly.
  2. Advancing and drawing existing drops.
  3. Applying the blur effect.
  4. Rendering the buffer to the VGA screen.
int main() {
  uint8_t *back_buffer = (uint8_t *)malloc(64000);
  uint8_t drop_index = 0;

  set_mcga();              // Switch to VGA mode
  set_water_palette();     // Set blue gradient
  clear_buffer(0x00, back_buffer); // Clear the back buffer

  while (!kbhit()) { // Continue until a key is pressed

    // Randomly reset a drop
    if ((rand() % 10) == 0) {
      reset_drop(&drops[drop_index]);

      drop_index++;
      drop_index %= N_DROPS;
    }

    // Process and draw each drop
    for (int i = 0; i < N_DROPS; i++) {
      advance_drop(&drops[i]);
      draw_drop(&drops[i], back_buffer);
    }

    blur_buffer(back_buffer);   // Apply the blur effect

    wait_vsync();               // Synchronize with vertical refresh
    copy_buffer(vga, back_buffer); // Copy back buffer to screen
    clear_buffer(0x00, back_buffer); // Clear back buffer for next frame
  }

  free(back_buffer);
  set_text(); // Return to text mode

  return 0;
}

Conclusion

This water droplet effect combines simple algorithms with creative use of VGA mode 13h to create a visually stunning effect. By leveraging circular ripples, energy fading, and a blur filter, we replicate the mesmerizing motion of water.

You can find the complete code on GitHub as a gist.

Try it out, tweak the parameters, and share your own effects! There’s a lot of joy in creating beautiful visuals with minimal resources.