Intercepting Linux Syscalls with Kernel Probes

22 Jan 2025

Introduction

n this tutorial, we will explore how to write a Linux kernel module that intercepts system calls using kernel probes (kprobes).

Instead of modifying the syscall table—a risky and outdated approach—we will use kprobes, an officially supported and safer method to trace and modify kernel behavior dynamically.

What Are System Calls?

System calls are the primary mechanism by which user-space applications interact with the operating system’s kernel. They provide a controlled gateway to hardware and kernel services. For example, opening a file uses the open syscall, while reading data from it uses the read syscall.

What Are Kernel Probes?

Kprobes are a powerful debugging and tracing mechanism in the Linux kernel. They allow developers to dynamically intercept and inject logic into almost any kernel function, including system calls. Kprobes work by placing breakpoints at specific addresses in kernel code, redirecting execution to custom handlers.

Using kprobes, you can intercept system calls like close to log parameters, modify behavior, or gather debugging information, all without modifying the syscall table or kernel memory structures.

The Code

We have some preparation steps in order to be able to do Linux Kernel module development. If your system is already setup to do this, you can skip the first section here.

Before we start, remember to do this in a safe environment. Use a virtual machine or a disposable system for development. Debugging kernel modules can lead to crashes or instability.

Prerequisites

First up, we need to install the prerequisite software in order to write and build modules:

sudo apt-get install build-essential linux-headers-$(uname -r)

Module code

Now we can write some code that will actually be our kernel module.

#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/kprobes.h>

MODULE_LICENSE("GPL");

static struct kprobe kp = {
    .symbol_name = "__x64_sys_close",
};

static int handler_pre(struct kprobe *p, struct pt_regs *regs) {
    printk(KERN_INFO "Intercepted close syscall: fd=%ld\n", regs->di);
    return 0;
}

static int __init kprobe_init(void) {
    int ret;

    kp.pre_handler = handler_pre;
    ret = register_kprobe(&kp);
    if (ret < 0) {
        printk(KERN_ERR "register_kprobe failed, returned %d\n", ret);
        return ret;
    }

    printk(KERN_INFO "Kprobe registered\n");
    return 0;
}

static void __exit kprobe_exit(void) {
    unregister_kprobe(&kp);
    printk(KERN_INFO "Kprobe unregistered\n");
}

module_init(kprobe_init);
module_exit(kprobe_exit);

Breakdown

First up, we have our necessary headers for kernel development and the module license:

#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/kprobes.h>

MODULE_LICENSE("GPL");

This ensures compatibility with GPL-only kernel symbols and enables proper loading of the module.

Next, the kprobe structure defines the function to be intercepted by specifying its symbol name. Here, we target __x64_sys_close:

static struct kprobe kp = {
    .symbol_name = "__x64_sys_close",
};

This tells the kernel which function to monitor dynamically.

The handler_pre function is executed before the intercepted function runs. It logs the file descriptor (fd) argument passed to the close syscall:

static int handler_pre(struct kprobe *p, struct pt_regs *regs) {
    printk(KERN_INFO "Intercepted close syscall: fd=%ld\n", regs->di);
    return 0;
}

In this case, regs->di contains the first argument to the syscall (the file descriptor).

The kprobe_init function initialises the kprobe, registers the handler, and logs its status. If registration fails, an error message is printed:

static int __init kprobe_init(void) {
    int ret;

    kp.pre_handler = handler_pre;
    ret = register_kprobe(&kp);
    if (ret < 0) {
        printk(KERN_ERR "register_kprobe failed, returned %d\n", ret);
        return ret;
    }

    printk(KERN_INFO "Kprobe registered\n");
    return 0;
}

The kprobe_exit function unregisters the kprobe to ensure no stale probes are left in the kernel:

static void __exit kprobe_exit(void) {
    unregister_kprobe(&kp);
    printk(KERN_INFO "Kprobe unregistered\n");
}

Finally, just like usual we define the entry and exit points for our module:

module_init(kprobe_init);
module_exit(kprobe_exit);

Building

Now that we’ve got our module code, we can can build and install our module. The following Makefile will allow us to build our code:

obj-m += syscall_interceptor.o

all:
        make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules

clean:
        make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean

We build the module:

make

After a successful build, you should be left with a ko file. In my case it’s called syscall_interceptor.ko. This is the module that we’ll install into the kernel with the following:

sudo insmod syscall_interceptor.ko

Verify

Let’s check dmesg to verify it’s working. As we’ve hooked the close call we should end up with a flood of messages to verify:

dmesg | tail

You should see something like this:

[  266.615596] Intercepted close syscall: fd=-60473131794600
[  266.615596] Intercepted close syscall: fd=-60473131794600
[  266.615597] Intercepted close syscall: fd=-60473131794600
[  266.615600] Intercepted close syscall: fd=-60473131794600
[  266.615731] Intercepted close syscall: fd=-60473131925672

You can unload this module with rmmod:

sudo rmmod syscall_interceptor

Understand Kprobe Handlers

Kprobe handlers allow you to execute custom logic at various stages of the probed function’s execution:

Pre-handler: Runs before the probed instruction.
Post-handler: Runs after the probed instruction (not used in this example).
Fault handler: Runs if an exception occurs during the probe.

Modify the module to add post- or fault-handling logic as needed.

Clean Up

Always unregister kprobes in the module’s exit function to prevent leaving stale probes in the kernel. Use dmesg to debug any issues during module loading or unloading.

Caveats and Considerations

System Stability: Ensure your handlers execute quickly and avoid blocking operations to prevent affecting system performance.
Kernel Versions: Kprobes are supported in modern kernels, but some symbols may vary between versions.
Ethical Usage: Always ensure you have permission to test and use such modules.

Conclusion

Using kprobes, you can safely and dynamically intercept system calls without modifying critical kernel structures. This tutorial demonstrates a clean and modern approach to syscall interception, avoiding deprecated or risky techniques like syscall table modification.

Creating extensions in C for PostgreSQL

21 Jan 2025

Introduction

PostgreSQL allows developers to extend its functionality with custom extensions written in C. This powerful feature can be used to add new functions, data types, or even custom operators to your PostgreSQL instance.

In this blog post, I’ll guide you through creating a simple “Hello, World!” C extension for PostgreSQL and demonstrate how to compile and test it in a Dockerized environment. Using Docker ensures that your local system remains clean while providing a reproducible setup for development.

Development

There are a few steps that we need to walk through in order to get your development environment up and running as well as some simple boilerplate code.

The Code

First, create a working directory for your project:

mkdir postgres_c_extension && cd postgres_c_extension

Now, create a file named example.c and add the following code:

#include "postgres.h"
#include "fmgr.h"
#include "utils/builtins.h"  // For cstring_to_text function

PG_MODULE_MAGIC;

PG_FUNCTION_INFO_V1(hello_world);

Datum
hello_world(PG_FUNCTION_ARGS)
{
    text *result = cstring_to_text("Hello, World!");
    PG_RETURN_TEXT_P(result);
}

This code defines a simple PostgreSQL function hello_world() that returns the text “Hello, World!”. It uses PostgreSQL’s C API, and the cstring_to_text function ensures that the string is properly converted to a PostgreSQL text type.

Let’s take a closer look at a few pieces of that code snippet.

`PG_MODULE_MAGIC`

PG_MODULE_MAGIC;

This macro is mandatory in all PostgreSQL C extensions. It acts as a marker to ensure that the extension was compiled with a compatible version of PostgreSQL. Without it, PostgreSQL will refuse to load the module, as it cannot verify compatibility.

`PG_FUNCTION_INFO_V1`

PG_FUNCTION_INFO_V1(hello_world);

This macro declares the function hello_world() as a PostgreSQL-compatible function using version 1 of PostgreSQL’s call convention. It ensures that the function can interact with PostgreSQL’s internal structures, such as argument parsing and memory management.

`Datum`

Datum hello_world(PG_FUNCTION_ARGS)

Datum is a core PostgreSQL data type that represents any value passed to or returned by a PostgreSQL function. It is a general-purpose type used internally by PostgreSQL to handle various data types efficiently.
PG_FUNCTION_ARGS is a macro that defines the function signature expected by PostgreSQL for dynamically callable functions. It gives access to the arguments passed to the function.

In this example, Datum is the return type of the hello_world function.

`PG_RETURN_TEXT_P`

text *result = cstring_to_text("Hello, World!");
PG_RETURN_TEXT_P(result);

cstring_to_text: This function converts a null-terminated C string (char *) into a PostgreSQL text type. PostgreSQL uses its own text structure to manage string data.
PG_RETURN_TEXT_P: This macro wraps a pointer to a text structure and converts it into a Datum, which is required for returning values from a PostgreSQL C function.

The flow in this function:

cstring_to_text("Hello, World!") creates a text * object in PostgreSQL’s memory context.
PG_RETURN_TEXT_P(result) ensures the text * is properly wrapped in a Datum so PostgreSQL can use the return value.

Control and SQL Files

A PostgreSQL extension requires a control file to describe its metadata and a SQL file to define the functions it provides.

Create a file named example.control:

default_version = '1.0'
comment = 'Example PostgreSQL extension'

Next, create example--1.0.sql to define the SQL function:

CREATE FUNCTION hello_world() RETURNS text
AS 'example', 'hello_world'
LANGUAGE C IMMUTABLE STRICT;

Setting Up the Build System

To build the C extension, you’ll need a Makefile. Create one in the project directory:

MODULES = example
EXTENSION = example
DATA = example--1.0.sql
PG_CONFIG = pg_config
OBJS = $(MODULES:%=%.o)

PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)

This Makefile uses PostgreSQL’s pgxs build system to compile the C code into a shared library that PostgreSQL can load.

Build Environment

To keep your development environment clean, we’ll use Docker. Create a Dockerfile to set up a build environment and compile the extension:

FROM postgres:latest

RUN apt-get update && apt-get install -y \
    build-essential \
    postgresql-server-dev-all \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /usr/src/example
COPY . .

RUN make && make install

Build the Docker image:

docker build -t postgres-c-extension .

Start a container using the custom image:

docker run --name pg-c-demo -e POSTGRES_PASSWORD=postgres -d postgres-c-extension

Testing

Access the PostgreSQL shell in the running container:

docker exec -it pg-c-demo psql -U postgres

Run the following SQL commands to create and test the extension:

CREATE EXTENSION example;
SELECT hello_world();

You should see the output:

 hello_world 
--------------
 Hello, World!
(1 row)

Cleaning Up

When you’re finished, stop and remove the container:

docker stop pg-c-demo && docker rm pg-c-demo

Conclusion

By following this guide, you’ve learned how to create a simple C extension for PostgreSQL, compile it, and test it in a Dockerized environment. This example can serve as a starting point for creating more complex extensions that add custom functionality to PostgreSQL. Using Docker ensures a clean and reproducible setup, making it easier to focus on development without worrying about system dependencies.

Understanding the ? Operator

24 Dec 2024

Introduction

The ? operator in Rust is one of the most powerful features for handling errors concisely and gracefully. However, it’s often misunderstood as just syntactic sugar for .unwrap(). In this post, we’ll dive into how the ? operator works, its differences from .unwrap(), and practical examples to highlight its usage.

What is it?

The ? operator is a shorthand for propagating errors in Rust. It simplifies error handling in functions that return a Result or Option. Here’s what it does:

For Result:
- If the value is Ok, the inner value is returned.
- If the value is Err, the error is returned to the caller.
For Option:
- If the value is Some, the inner value is returned.
- If the value is None, it returns None to the caller.

This allows you to avoid manually matching on Result or Option in many cases, keeping your code clean and readable.

How `?` Differs from `.unwrap()`

At first glance, the ? operator might look like a safer version of .unwrap(), but they serve different purposes:

Error Propagation:
- ? propagates the error to the caller, allowing the program to handle it later.
- .unwrap() panics and crashes the program if the value is Err or None.
Use in Production:
- ? is ideal for production code where you want robust error handling.
- .unwrap() should only be used when you are absolutely certain the value will never be an error (e.g., in tests or prototypes).

Examples

fn read_file(path: &str) -> Result<String, std::io::Error> {
    let contents = std::fs::read_to_string(path)?; // Propagate error if it occurs
    Ok(contents)
}

fn main() {
    match read_file("example.txt") {
        Ok(contents) => println!("File contents:\n{}", contents),
        Err(err) => eprintln!("Error reading file: {}", err),
    }
}

In this example, the ? operator automatically returns any error from std::fs::read_to_string to the caller, saving you from writing a verbose match.

The match is then left as an exercise to the calling code; in this case main.

How it Differs from `.unwrap()`

Compare the ? operator to .unwrap():

Using `?`:

fn safe_read_file(path: &str) -> Result<String, std::io::Error> {
    let contents = std::fs::read_to_string(path)?; // Error is propagated
    Ok(contents)
}

Using `.unwrap()`:

fn unsafe_read_file(path: &str) -> String {
    let contents = std::fs::read_to_string(path).unwrap(); // Panics on error
    contents
}

If std::fs::read_to_string fails:

The ? operator propagates the error to the caller.
.unwrap() causes the program to panic, potentially crashing your application.

Error Propagation in Action

The ? operator shines when you need to handle multiple fallible operations:

fn process_file(path: &str) -> Result<(), std::io::Error> {
    let contents = std::fs::read_to_string(path)?;
    let lines: Vec<&str> = contents.lines().collect();
    std::fs::write("output.txt", lines.join("\n"))?;
    Ok(())
}

fn main() {
    if let Err(err) = process_file("example.txt") {
        eprintln!("Error processing file: {}", err);
    }
}

Here, the ? operator simplifies error handling for both read_to_string and write, keeping the code concise and readable.

Saving typing

Using ? is equivalent to a common error propagation pattern:

Without `?`:

fn read_file(path: &str) -> Result<String, std::io::Error> {
    let contents = match std::fs::read_to_string(path) {
        Ok(val) => val,
        Err(err) => return Err(err), // Explicitly propagate the error
    };
    Ok(contents)
}

With `?`:

fn read_file(path: &str) -> Result<String, std::io::Error> {
    let contents = std::fs::read_to_string(path)?; // Implicitly propagate the error
    Ok(contents)
}

Chaining

You can also chain multiple operations with ?, making it ideal for error-prone workflows:

async fn fetch_data(url: &str) -> Result<String, reqwest::Error> {
    let response = reqwest::get(url).await?.text().await?;
    Ok(response)
}

#[tokio::main]
async fn main() {
    match fetch_data("https://example.com").await {
        Ok(data) => println!("Fetched data: {}", data),
        Err(err) => eprintln!("Error fetching data: {}", err),
    }
}

Conclusion

The ? operator is much more than syntactic sugar for .unwrap(). It’s a powerful tool that:

Simplifies error propagation.
Keeps your code clean and readable.
Encourages robust error handling in production.

By embracing the ? operator, you can write concise, idiomatic Rust code that gracefully handles errors without sacrificing clarity or safety.

Exploring async and await in Rust

24 Dec 2024

Introduction

Rust’s async and await features bring modern asynchronous programming to the language, enabling developers to write non-blocking code efficiently. In this blog post, we’ll explore how async and await work, when to use them, and provide practical examples to demonstrate their power.

What Are `async` and `await`?

Rust uses an async and await model to handle concurrency. These features allow you to write asynchronous code that doesn’t block the thread, making it perfect for tasks like I/O operations, networking, or any scenario where waiting on external resources is necessary.

Key Concepts:

async:
- Marks a function or block as asynchronous.
- Returns a Future instead of executing immediately.
await:
- Suspends the current function until the Future completes.
- Only allowed inside an async function or block.

Getting Started

To use async and await, you’ll need an asynchronous runtime such as Tokio or async-std. These provide the necessary infrastructure to execute asynchronous tasks.

Practical Examples

A Basic `async` Function

use tokio::time::{sleep, Duration};

async fn say_hello() {
    println!("Hello, world!");
    sleep(Duration::from_secs(2)).await; // Non-blocking wait
    println!("Goodbye, world!");
}

#[tokio::main]
async fn main() {
    say_hello().await;
}

Explanation:

say_hello is an async function that prints messages and waits for 2 seconds without blocking the thread.
The .await keyword pauses execution until the sleep operation completes.

Running Tasks Concurrently with `join!`

use tokio::time::{sleep, Duration};

async fn task_one() {
    println!("Task one started");
    sleep(Duration::from_secs(2)).await;
    println!("Task one completed");
}

async fn task_two() {
    println!("Task two started");
    sleep(Duration::from_secs(1)).await;
    println!("Task two completed");
}

#[tokio::main]
async fn main() {
    tokio::join!(task_one(), task_two());
    println!("All tasks completed");
}

Explanation:

join! runs multiple tasks concurrently.
Task two finishes first, even though task one started earlier, demonstrating concurrency.

Handling Errors in Asynchronous Code

async fn fetch_data(url: &str) -> Result<String, reqwest::Error> {
    let response = reqwest::get(url).await?.text().await?;
    Ok(response)
}

#[tokio::main]
async fn main() {
    match fetch_data("https://example.com").await {
        Ok(data) => println!("Fetched data: {}", data),
        Err(err) => eprintln!("Error fetching data: {}", err),
    }
}

Explanation:

Uses the reqwest crate to fetch data from a URL.
Error handling is built-in with Result and the ? operator.

Spawning Tasks with `tokio::task`

use tokio::task;
use tokio::time::{sleep, Duration};

async fn do_work(id: u32) {
    println!("Worker {} starting", id);
    sleep(Duration::from_secs(2)).await;
    println!("Worker {} finished", id);
}

#[tokio::main]
async fn main() {
    let handles: Vec<_> = (1..=5)
        .map(|id| task::spawn(do_work(id)))
        .collect();

    for handle in handles {
        handle.await.unwrap(); // Wait for each task to complete
    }
}

Explanation:

tokio::task::spawn creates lightweight, non-blocking tasks.
The await ensures all tasks complete before exiting.

Asynchronous File I/O

use tokio::fs;

async fn read_file(file_path: &str) -> Result<String, std::io::Error> {
    let contents = fs::read_to_string(file_path).await?;
    Ok(contents)
}

#[tokio::main]
async fn main() {
    match read_file("example.txt").await {
        Ok(contents) => println!("File contents:\n{}", contents),
        Err(err) => eprintln!("Error reading file: {}", err),
    }
}

Explanation:

Uses tokio::fs for non-blocking file reading.
Handles file errors gracefully with Result.

Key Points to Remember

Async Runtime:
- You need an async runtime like Tokio or async-std to execute async functions.
Concurrency:
- Rust’s async model is cooperative, meaning tasks must yield control for others to run.
Error Handling:
- Combine async with Result for robust error management.
State Sharing:
- Use Arc and Mutex for sharing state safely between async tasks.

Conclusion

Rust’s async and await features empower you to write efficient, non-blocking code that handles concurrency seamlessly. By leveraging async runtimes and best practices, you can build high-performance applications that scale effortlessly.

Start experimenting with these examples and see how async and await can make your Rust code more powerful and expressive. Happy coding!

High Performance Linux IO with IO_URING

23 Dec 2024

Introduction

IO_URING is an advanced asynchronous I/O interface introduced in the Linux kernel (version 5.1). It’s designed to provide significant performance improvements for I/O-bound applications, particularly those requiring high throughput and low latency.

It’s well worth taking a look in the linux man pages for io_uring and having a read through the function interface.

In today’s article we’ll discuss IO_URING in depth and follow with some examples to see it in practice.

What is IO_URING

IO_URING is a high-performance asynchronous I/O interface introduced in Linux kernel version 5.1. It was developed to address the limitations of traditional Linux I/O mechanisms like epoll, select, and aio. These earlier approaches often suffered from high overhead due to system calls, context switches, or inefficient batching, which limited their scalability in handling modern high-throughput and low-latency workloads.

At its core, IO_URING provides a ring-buffer-based mechanism for submitting I/O requests and receiving their completions, eliminating many inefficiencies in older methods. This allows applications to perform non-blocking, asynchronous I/O with minimal kernel involvement, making it particularly suited for applications such as databases, web servers, and file systems.

How does IO_URING work?

IO_URING’s architecture revolves around two primary shared memory ring buffers between user space and the kernel:

Submission Queue (SQ):
- The SQ is a ring buffer where applications enqueue I/O requests.
- User-space applications write requests directly to the buffer without needing to call into the kernel for each operation.
- The requests describe the type of I/O operation to be performed (e.g., read, write, send, receive).
Completion Queue (CQ):
- The CQ is another ring buffer where the kernel places the results of completed I/O operations.
- Applications read from the CQ to retrieve the status of their submitted requests.

The interaction between user space and the kernel is simplified:

The user-space application adds entries to the Submission Queue and notifies the kernel when ready (via a single syscall like io_uring_enter).
The kernel processes these requests and posts results to the Completion Queue, which the application can read without additional syscalls.

Key Features

Batching Requests:
- Multiple I/O operations can be submitted in a single system call, significantly reducing syscall overhead.
Zero-copy I/O:
- Certain operations (like reads and writes) can leverage fixed buffers, avoiding unnecessary data copying between kernel and user space.
Kernel Offloading:
- The kernel can process requests in the background, allowing the application to continue without waiting.
Efficient Polling:
- Supports event-driven programming with low-latency polling mechanisms, reducing idle time in high-performance applications.
Flexibility:
- IO_URING supports a wide range of I/O operations, including file I/O, network I/O, and event notifications.

Code

Let’s get some code examples going to see exactly what we’re dealing with.

First of all, check to see that your kernel supports IO_URING. It should. It’s been available since 51.

uname -r

You’ll also need liburing avaliable to you in order to compile these examples.

Library setup

In this first example, we won’t perform any actions; but we’ll setup the library so that we can use these operations. All of our other examples will use this as a base.

We’ll need some basic I/O headers as well as liburing.h.

#include <liburing.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

We initialize our uring queue using io_uring_queue_init:

struct io_uring ring;
int ret;

// initialize IO_URING
if (io_uring_queue_init(8, &ring, 0) < 0) {
    perror("io_uring_queue_init");
    exit(1);
}

When we’re finished with the ring, we cleanup with io_uring_queue_exit.

io_uring_queue_exit(&ring);

Simple Write

In this example, we’ll queue up a write of a string out to a file and that’s it.

First, we need to open the file like usual:

int fd = open(FILENAME, O_WRONLY | O_CREAT | O_TRUNC, 0644);
if (fd < 0) {
    perror("open");
    io_uring_queue_exit(&ring);
    exit(1);
}

Now, we setup the write job to happen.

struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
if (!sqe) {
    fprintf(stderr, "io_uring_get_sqe failed\n");
    close(fd);
    io_uring_queue_exit(&ring);
    exit(1);
}

const char *message = MESSAGE;
struct iovec iov = {
    .iov_base = (void *)message,
    .iov_len = strlen(message)
};

io_uring_prep_writev(sqe, fd, &iov, 1, 0);

The io_uring_get_sqe function will get us the next available submission queue entry from the job queue. Once we have secured one of these, we then fill a vector I/O structure (a iovec) with the details of our data. Here it’s just the data pointer, and length.

Finally, we prepare a vector write request using io_uring_prep_writev.

We submit the job off to be processed now with io_uring_submit:

ret = io_uring_submit(&ring);
if (ret < 0) {
    perror("io_uring_submit");
    close(fd);
    io_uring_queue_exit(&ring);
    exit(1);
}

We can wait for the execution to complete; even more powerful though is we can be off doing other things if we’d like!

In order to wait for the job to finish, we use io_uring_wait_cqe:

struct io_uring_cqe *cqe;
ret = io_uring_wait_cqe(&ring, &cqe);
if (ret < 0) {
    perror("io_uring_wait_cqe");
    close(fd);
    io_uring_queue_exit(&ring);
    exit(1);
}

We check the result of the job through the io_uring_cqe structure filled by the io_uring_wait_cqe call:

if (cqe->res < 0) {
    fprintf(stderr, "Write failed: %s\n", strerror(-cqe->res));
} else {
    printf("Write completed successfully!\n");
}

Finally, we mark the uring event as consumed and close the file.

io_uring_cqe_seen(&ring, cqe);
close(fd);

The full example of this can be found here.

Multiple Operations

We can start to see some of the power of this system in this next example. We’ll submit multiple jobs for processing.

We’ve opened a source file for reading int src_fd and a destination file for writing in dest_fd.

// prepare a read operation
sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, src_fd, buffer, BUF_SIZE, 0);

// submit the read request
io_uring_submit(&ring);
io_uring_wait_cqe(&ring, &cqe);

if (cqe->res < 0) {
    fprintf(stderr, "Read failed: %s\n", strerror(-cqe->res));
    io_uring_cqe_seen(&ring, cqe);
    goto cleanup;
}
io_uring_cqe_seen(&ring, cqe);

// prepare a write operation
sqe = io_uring_get_sqe(&ring);
io_uring_prep_write(sqe, dest_fd, buffer, cqe->res, 0);

// submit the write request
io_uring_submit(&ring);
io_uring_wait_cqe(&ring, &cqe);

if (cqe->res < 0) {
    fprintf(stderr, "Write failed: %s\n", strerror(-cqe->res));
} else {
    printf("Copy completed successfully!\n");
}
io_uring_cqe_seen(&ring, cqe);

So, this is just sequentially executing multiple operations.

The full example of this can be found here.

Asynchronous operations

Finally, we’ll write an example that will process multiple operations in parallel.

The following for loop sets up 3 read jobs:

for (int i = 0; i < FILE_COUNT; i++) {
    int fd = open(files[i], O_RDONLY);
    if (fd < 0) {
        perror("open");
        io_uring_queue_exit(&ring);
        exit(1);
    }

    // Allocate a buffer for the read operation
    char *buffer = malloc(BUF_SIZE);
    if (!buffer) {
        perror("malloc");
        close(fd);
        io_uring_queue_exit(&ring);
        exit(1);
    }

    requests[i].fd = fd;
    requests[i].buffer = buffer;

    // Get an SQE (Submission Queue Entry)
    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    if (!sqe) {
        fprintf(stderr, "Failed to get SQE\n");
        close(fd);
        free(buffer);
        io_uring_queue_exit(&ring);
        exit(1);
    }

    // Prepare a read operation
    io_uring_prep_read(sqe, fd, buffer, BUF_SIZE, 0);
    io_uring_sqe_set_data(sqe, &requests[i]);
}

All of the requests now get submitted for processing:

// Submit all requests
ret = io_uring_submit(&ring);
if (ret < 0) {
    perror("io_uring_submit");
    io_uring_queue_exit(&ring);
    exit(1);
}

Finally, we wait on each of the jobs to finish. The important thing to note here, is that we could be busy off doing otherthings rather than just waiting for these jobs to finish.

// wait for completions
for (int i = 0; i < FILE_COUNT; i++) {
    struct io_uring_cqe *cqe;
    ret = io_uring_wait_cqe(&ring, &cqe);
    if (ret < 0) {
        perror("io_uring_wait_cqe");
        io_uring_queue_exit(&ring);
        exit(1);
    }

    // Process the completed request
    struct io_request *req = io_uring_cqe_get_data(cqe);
    if (cqe->res < 0) {
        fprintf(stderr, "Read failed for file %d: %s\n", req->fd, strerror(-cqe->res));
    } else {
        printf("Read %d bytes from file descriptor %d:\n%s\n", cqe->res, req->fd, req->buffer);
    }

    // Mark the CQE as seen
    io_uring_cqe_seen(&ring, cqe);

    // Clean up
    close(req->fd);
    free(req->buffer);
}

The entire example of this one can be found here.

Conclusion

IO_URING represents a transformative step in Linux asynchronous I/O, providing unparalleled performance and flexibility for modern applications. By minimizing syscall overhead, enabling zero-copy I/O, and allowing concurrent and batched operations, it has become a vital tool for developers working on high-performance systems.

Through the examples we’ve covered, you can see the practical power of IO_URING, from simple write operations to complex asynchronous processing. Its design not only simplifies high-throughput I/O operations but also opens up opportunities to optimize and innovate in areas like database systems, networking, and file handling.

Older Newer

Cogs and Levers A blog full of technical stuff

Intercepting Linux Syscalls with Kernel Probes

Introduction

What Are System Calls?

What Are Kernel Probes?

The Code

Prerequisites

Module code

Breakdown

Building

Verify

Understand Kprobe Handlers

Clean Up

Caveats and Considerations

Conclusion

Creating extensions in C for PostgreSQL

Introduction

Development

The Code

PG_MODULE_MAGIC

PG_FUNCTION_INFO_V1

Datum

PG_RETURN_TEXT_P

Control and SQL Files

Setting Up the Build System

Build Environment

Testing

Cleaning Up

Conclusion

Understanding the ? Operator

Introduction

What is it?

How ? Differs from .unwrap()

Examples

How it Differs from .unwrap()

Using ?:

Using .unwrap():

Error Propagation in Action

Saving typing

Without ?:

With ?:

Chaining

Conclusion

Exploring async and await in Rust

Introduction

What Are async and await?

Key Concepts:

Getting Started

Practical Examples

A Basic async Function

Running Tasks Concurrently with join!

Handling Errors in Asynchronous Code

Spawning Tasks with tokio::task

Asynchronous File I/O

Key Points to Remember

Conclusion

High Performance Linux IO with IO_URING

Introduction

What is IO_URING

How does IO_URING work?

Key Features

Code

Library setup

Simple Write

Multiple Operations

Asynchronous operations

Conclusion

`PG_MODULE_MAGIC`

`PG_FUNCTION_INFO_V1`

`Datum`

`PG_RETURN_TEXT_P`

How `?` Differs from `.unwrap()`

How it Differs from `.unwrap()`

Using `?`:

Using `.unwrap()`:

Without `?`:

With `?`:

What Are `async` and `await`?

A Basic `async` Function

Running Tasks Concurrently with `join!`

Spawning Tasks with `tokio::task`