Cogs and Levers A blog full of technical stuff

Loading dynamic libraries in Rust

Today’s post is going to be a quick demonstration of loading dynamic libraries at runtime in Rust.

In my earlier article, I showed how to use Glibc’s dlopen/dlsym/dlclose APIs from C to load a shared object off disk and call a function in it. Rust can do the same thing – with a bit more type safety – using:

This is not meant to be a full plugin framework, just a minimal “host loads a tiny library and calls one function” example, similar in spirit to the original C version.

A tiny library in Rust

We’ll start with a tiny dynamic library that exports one function, greet, which returns a C-style string:

cargo new --lib rust_greeter
cd rust_greeter

Edit Cargo.toml so that the library is built as a cdylib:

[package]
name = "rust_greeter"
version = "0.1.0"
edition = "2021"

[lib]
name = "test"                
crate-type = ["cdylib"]      

Now the library code in src/lib.rs:

use std::os::raw::c_char;

#[unsafe(no_mangle)]
pub extern "C" fn greet() -> *const c_char {
    static GREETING: &str = "Hello from Rust!\0";
    GREETING.as_ptr().cast()
}

The #[unsafe(no_mangle)] form marks the item (the function) as unsafe to call, and also forwards the nested no_mangle attribute exactly as written. This avoids needing unsafe fn syntax and keeps ABI-exported functions more visually consistent. It’s a small but nice modernisation that fits well when exposing C-compatible symbols from Rust.

Build:

cargo build --release

You’ll get:

target/release/libtest.so

Host program: loading the library with libloading

Create a new binary crate:

cargo new rust_host
cd rust_host

Add libloading to Cargo.toml:

[package]
name = "rust_host"
version = "0.1.0"
edition = "2021"

[dependencies]
libloading = "0.8"

And src/main.rs:

use std::error::Error;
use std::ffi::CStr;
use std::os::raw::c_char;

use libloading::{Library, Symbol};

type GreetFn = unsafe extern "C" fn() -> *const c_char;

fn main() -> Result<(), Box<dyn Error>> {
    unsafe {
        let lib = Library::new("./libtest.so")?;
        let greet: Symbol<GreetFn> = lib.get(b"greet\0")?;

        let raw = greet();
        let c_str = CStr::from_ptr(raw);
        let message = c_str.to_str()?;

        println!("{message}");
    }
    Ok(())
}

Before we can run any of this, we need to make sure the library is available to the host program. In order to do this, we simply copy over the library:

cp ../rust_greeter/target/release/libtest.so .

Just copy the so over to the host program folder.

Running cargo run prints:

$ cargo run                                     
   Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.01s
    Running `target/debug/rust_host`
Hello from Rust!

Mapping back to the C version

When you look at this code, you can see that Library::new("./libtest.so") now takes the place of dlopen().

We can get to the symbol that we want to call with lib.get(b"greet\0") rather than dlsym(), and we clean everything up now by just dropping the library.

Platform notes

Keep in mind that I’ve written this code on my linux machine, so you’ll have different targets depending on the platform that you work from.

Platform Output
Linux libtest.so
macOS libtest.dylib
Windows test.dll

cdylib produces the correct format automatically.

Conclusion

We:

  • built a tiny Rust cdylib exporting a C-ABI function,
  • loaded it at runtime with libloading,
  • looked up a symbol by name, and
  • invoked it through a typed function pointer.

I guess this was just a modern update to an existing article.

Just like in the C post, this is a deliberately minimal skeleton — but enough to grow into a proper plugin architecture once you define a stable API between host and library.

Simulating a Scheduler

Introduction

Most operating systems books talk about schedulers as if they’re mysterious forces that magically decide which program runs next. You’ll usually see phrases like context switching, round-robin, priority queues, and tasks yielding the CPU—but the moment you try to understand how that works mechanically, the examples jump straight into kernel code, hardware timers, and interrupt controllers.

That’s… a lot.

Before touching real hardware, it’s far easier to learn the concept in a sandbox: a tiny world that behaves like an operating system, but without all the baggage. That’s what we’re going to build.

In this post, we’ll simulate a scheduler.

We’ll:

  • Define a small set of opcodes—like a micro-instruction set for “programs.”
  • Write a mini assembler that converts human-readable instructions to bytecodes.
  • Represent each program as a task inside a virtual machine.
  • Implement a scheduler that runs these tasks, switching between them to simulate concurrency.

The final result:

  • multiple tiny “programs”
  • all appearing to run at the same time
  • under the control of a scheduler you wrote

No threads. No OS dependencies. No unsafe code. Just pure, understandable logic.

By the time we reach the end, you’ll not only understand what a scheduler does—you’ll understand why it needs to exist, and the mental model will stay with you when you look at real operating systems later.

Let’s build a scheduler from scratch.

Programs

Before scheduling, we need programs. We need a way to define, write, and store programs so that we can load these into our computer.

Instruction Set

First up, we’re going to define a set of instructions that our “computer” will execute. We’ll define these as a well known common set of instructions as there’s a few components that will need to understand these. The virtual machine that we’ll get to will use these to interpret and perform action, but our assembler will have the job of converting those instructions into bytes on disk.

Here’s the instructions:

#[repr(u8)]
#[derive(Copy, Clone, Debug)]
pub enum Instruction {
    NOP = 0x00,
    PUSH = 0x01,
    LOAD = 0x02,
    ADD = 0x03,
    SUB = 0x04,
    PRINT = 0x05,
    SLEEP = 0x06,
    YIELD = 0x07,
    HALT = 0x08,
    WORK = 0x09,
    SETPRIO = 0x0A,
}

As you can see, our computer won’t do a lot. Its functionality isn’t the subject here though; we want to see how these instructions get scheduled.

We use repr(u8) to make casting to u8 a much smoother process; that will help our assembler later on change our defined instructions into programs on disk.

Conversely, we need to make the bytes-on-disk to instructions-in-memory translation easier.

impl Instruction {
    pub fn from_byte(byte: u8) -> Option<Self> {
        match byte {
            0x00 => Some(Instruction::NOP),
            0x01 => Some(Instruction::PUSH),
            0x02 => Some(Instruction::LOAD),
            0x03 => Some(Instruction::ADD),
            0x04 => Some(Instruction::SUB),
            0x05 => Some(Instruction::PRINT),
            0x06 => Some(Instruction::SLEEP),
            0x07 => Some(Instruction::YIELD),
            0x08 => Some(Instruction::HALT),
            0x09 => Some(Instruction::WORK),
            0x0A => Some(Instruction::SETPRIO),
            _ => None,
        }
    }
}

Assembling Programs

With a solid set of instructions defined, we can now write some code to take a list of instrucitons and write them to disk. We’d call this a binary.

pub fn translate_instructions_to_bytes(instructions: &Program) -> Vec<u8> {
    instructions
        .iter()
        .map(|i| *i as u8)
        .collect::<Vec<u8>>()
}

pub fn assemble_to_disk(instructions: &Program, path: &str) -> std::io::Result<()> {
    let bytes = translate_instructions_to_bytes(instructions);
    std::fs::write(path, bytes)
}

The Program type is simply a synonym for Vec<Instruction>.

We can use this now to assemble some programs:

let p1 = vec![
    Instruction::PUSH,
    Instruction::ADD,
    Instruction::SUB,
    Instruction::LOAD,
    Instruction::HALT,
];

assemble_to_disk(&p1, "./bins/p1.bin").expect("couldn't write binary");

Outside the context of our rust program, we can verify the content of the binary with hexdump:

$ hexdump -C bins/p1.bin

00000000  01 03 04 02 08       |.....|

Those byte values marry up to what our program defines. You can see it’s important that our ISA doesn’t change now, otherwise our programs will have completely different meanings!

Of course, if it does change - we simply re-assemble our programs.

Tasks

Now we define our Task abstraction. A task is how our virtual machine will store the state of a program that’s been loaded off of disk and is executing / has executed. We only need some basics to define the state of our task.

pub struct Task {
    /// The path to the task
    pub path: String,
    
    /// Program code
    pub code: Program,
    
    /// Current program counter (instruction pointer register)
    pub pc: usize,
    
    /// Stack state
    pub stack: Vec<i32>,
    
    /// Determines if this task is still running
    pub done: bool,
    
    /// Priority of this task
    pub priority: usize,
}

We can load programs off of disk and make tasks out of them ready to execute.

impl Task {
    pub fn load(path: &str) -> std::io::Result<Task> {
        let raw_code = std::fs::read(path).expect("Failed to read file");
        let instructions = raw_code
            .iter()
            .filter_map(|&b| Instruction::from_byte(b)) /* filter bytes out that are invalid */
            .collect::<Vec<Instruction>>();
        
        let task = Task {
            path: path.to_string(),
            code: instructions,
            pc: 0,
            stack: [].to_vec(),
            done: false,
            priority: 0
        };

        return Ok(task);
    }
}

Virtual Machine

Now that we have tasks loaded from disk, we need something that can run them.

Enter the virtual machine.

This is the component responsible for:

  • tracking and storing tasks,
  • executing one instruction at a time,
  • and cooperatively multitasking between tasks (our scheduler).

Let’s look at the core VM structure:

pub struct VM {
    pub tasks: Vec<Task>,
}

impl VM {
    pub fn new() -> Self {
        Self { tasks: Vec::new() }
    }

    pub fn add_task(&mut self, task: Task) {
        self.tasks.push(task);
    }

    fn has_runnables(&self) -> bool {
        return self.tasks.len() > 0 && self.tasks.iter().any(|t| !t.done);
    }

    fn execute(&self, task: &Task) {
        let instruction = task.code[task.pc];
        println!("[{}]: {:?}", task.path, instruction);
    }
}

The virtual machine stores a collection of Tasks, and has three responsibilities:

  1. add_task – load a task into the VM.
  2. has_runnables – check if any task still needs CPU time.
  3. execute – run one instruction for a given task.

Note: execute receives a reference to a task (&Task). We don’t take ownership of the task, because the scheduler must be able to revisit it later and resume execution.

Scheduling — Round Robin

Now that we have tasks stored in the VM, we need to schedule them.

We’re going to implement the simplest scheduling algorithm: round robin.

Round robin is:

  • fair (every task gets a turn),
  • predictable (runs tasks in order),
  • and conceptually simple (loop through the task list over and over).

Here is the scheduler loop:

pub fn run_round_robin(&mut self) {
    while self.has_runnables() {
        // loop each task once per round
        for idx in (0..self.tasks.len()) {
            {
                let task_ref: &Task = &self.tasks[idx];

                if task_ref.done {
                    continue;
                }

                // print the instruction that's about to execute
                self.execute(task_ref);
            }

            // now we advance the task forward one instruction
            let task = &mut self.tasks[idx];

            if !task.done {
                task.step();

                if task.pc >= task.code.len() {
                    task.done = true;
                }
            }
        }

        // cleanup (remove finished tasks from the VM)
        self.tasks.retain(|t| !t.done);
    }
}

A few important points:

  • Only one instruction executes per task per loop — simulating time slices.
  • Each iteration of the outer while loop represents one round of CPU scheduling.
  • retain removes finished tasks so the VM doesn’t waste time checking them.

That’s cooperative multitasking: the VM does just enough work per task, then moves on.

Test Programs

We need some test programs to feed into this system. Here’s the code (using our assembler) that creates three programs that we can use called p1, p2, and p3.

let p1 = vec![
    Instruction::PUSH,
    Instruction::ADD,
    Instruction::SUB,
    Instruction::LOAD,
    Instruction::HALT,
];

let p2 = vec![
    Instruction::SLEEP,
    Instruction::SLEEP,
    Instruction::SLEEP,
    Instruction::NOP,
    Instruction::HALT,
];

let p3 = vec![
    Instruction::PUSH,
    Instruction::LOAD,
    Instruction::HALT,
];

assemble_to_disk(&p1, "./bins/p1.bin").expect("couldn't write binary");
assemble_to_disk(&p2, "./bins/p2.bin").expect("couldn't write binary");
assemble_to_disk(&p3, "./bins/p3.bin").expect("couldn't write binary");

With these programs, you’ll now be able to see how they “animate” through the scheduler as it chooses which to execute.

Running It

Let’s put it all together.

We’ll load three programs — each assembled into a .bin file — and execute them through our VM:

let mut vm = VM::new();

vm.add_task(Task::load("./bins/p1.bin").expect("couldn't read binary"));
vm.add_task(Task::load("./bins/p2.bin").expect("couldn't read binary"));
vm.add_task(Task::load("./bins/p3.bin").expect("couldn't read binary"));

vm.run_round_robin();

Running this produces output like the following:

[./bins/p1.bin]: PUSH
[./bins/p2.bin]: SLEEP
[./bins/p3.bin]: PUSH
[./bins/p1.bin]: ADD
[./bins/p2.bin]: SLEEP
[./bins/p3.bin]: LOAD
[./bins/p1.bin]: SUB
[./bins/p2.bin]: SLEEP
[./bins/p3.bin]: HALT
[./bins/p1.bin]: LOAD
[./bins/p2.bin]: NOP
[./bins/p1.bin]: HALT
[./bins/p2.bin]: HALT

Each task gets one instruction at a time.
They appear to multitask, but really, we are just interleaving execution.

This is exactly how early cooperative schedulers worked.

Different Algorithms

We implemented Round Robin and looked at Priority Scheduling, but operating systems use a variety of scheduling strategies. Each one optimizes something different — fairness, throughput, responsiveness, or predictability.

Here’s a breakdown of the most common ones you’ll see in real operating systems:

First-Come, First-Served (FCFS)

The simplest possible scheduler.

  • Tasks are run in the order they arrive.
  • No preemption — once a task starts running, it keeps the CPU until completion.

Pros:

  • Very predictable.
  • Easy to implement.

Cons:

  • Terrible response time — one long task can block all others (the “convoy effect”).

Used in: batch systems, print queues, embedded devices.

Shortest Job First (SJF) / Shortest Remaining Time First (SRTF)

Runs the shortest task first.

  • SJF — non-preemptive (once a task starts, it finishes).
  • SRTF — preemptive (if a new shorter task arrives, preempt the current one).

Pros:

  • Great throughput (lowest total completion time).

Cons:

  • Requires knowing how long tasks will run (hard in general).
  • Small tasks can starve large tasks.

Used in: job schedulers, long-running batch systems, HPC.

Round Robin (RR) (what we implemented)

Each task gets a fixed unit of time (called a quantum), and then moves to the back of the queue.

Pros:

  • Very fair.
  • Great for interactive workloads (UI responsiveness).

Cons:

  • If quantum is too short: too much overhead (context switching).
  • If quantum too long: behaves like FCFS.

Used in: timesharing OS kernels, early Unix schedulers.

Priority Scheduling

Each task has a priority number.

  • Always selects the runnable task with the highest priority.
  • Can be preemptive or non-preemptive.

Pros:

  • High-importance tasks get CPU time first.

Cons:

  • Starvation — low priority tasks may never run.

Used in: realtime systems, audio/video processing, embedded control software.

Multi-Level Feedback Queue (MLFQ) (how modern OS schedulers work)

Combines priority and round robin.

  • Multiple queues (high priority to low priority)
  • Round Robin within each queue
  • Tasks that use a lot of CPU get demoted to lower priority queues
  • Tasks that frequently yield or sleep get promoted (interactive = fast)

Pros:

  • Gives priority to interactive tasks
  • Penalizes CPU-bound tasks
  • Adapts automatically over time (no tuning per process)

Cons:

  • Harder to implement
  • Requires tracking task behavior (history)

Used in: Windows, macOS, Linux (CFS is conceptually similar).

Comparison

Algorithm Preemptive Goal / Optimization Fairness Weaknesses
FCFS N Simplicity Low One long task blocks everything
SJF / SRTF Y/N Lowest total completion time Low Starvation of long tasks
Round Robin Y Responsiveness / interactivity High Requires good quantum tuning
Priority Scheduling Y/N Importance / latency sensitivity Low Starvation of low priorities
Multi-Level Feedback Queue Y Realistic, adaptive fairness High More complex to implement

TL;DR

Different schedulers optimize for different outcomes:

  • Fairness? → Round Robin
  • Highest priority first? → Priority Scheduling
  • Best throughput? → Shortest Job First / SRTF
  • Real OS behavior? → Multi-Level Feedback Queue

Schedulers are tradeoffs — once you understand what they optimize, you understand why real operating systems don’t use just one mechanism.

Conclusion

We started with nothing but a tiny instruction set and finished with:

  • a program format,
  • an assembler,
  • a task abstraction,
  • a virtual machine,
  • and a functioning scheduler.

The magic was never “multitasking” — it was switching to the next task at the right moment.

Schedulers are simple at their heart:

Run something for a bit. Save its state. Move on.

Now that we’ve built a round-robin scheduler, it’s easy to extend:

  • task priorities (SETPRIO),
  • YIELD instruction support,
  • blocking + sleeping tasks,
  • time-based preemption (tick interrupts).

But those are chapters for another day.

Build your own x86 Kernel Part 5

Introduction

In the last post we landed in 32-bit protected mode. Before moving on though, let’s tidy up this project a little bit so we can start managing larger pieces of code cleanly.

So far, everything has been hard-wired: ORG directives, absolute addresses, and flat binaries. That works for a boot sector, but we’re about to deal with paging, higher memory, and higher-level languages - so we need a proper linker script to take over memory layout.

Linker

A new file has been added to manage how the result of stage2.asm is laid out, and we call it stage2.ld.

The linker script gives us precise control over where each section of code and data ends up in memory — something ORG statements alone can’t handle once we start mixing multiple object files, sections, and languages.

Up until now, our binaries were assembled as raw flat images: every byte went exactly where NASM told it. But in larger systems (especially when we introduce C or Rust later), each file gets compiled separately into relocatable object files (.o). The linker then combines them — and the .ld script tells it exactly how to do that.

What does a linker script do?

At a high level, the linker script acts as a map for your binary. It tells the linker:

  • where the program starts (ENTRY(start2))
  • where each section should live in memory (. = 0x8000)
  • how to group sections from different files (.text, .data, .bss, etc.)
  • which global symbols to export for use in assembly or C (e.g. _stack_top, _kernel_end)

This becomes essential when we start paging or using virtual addresses — because physical load addresses and virtual execution addresses will differ.

Example

Here’s a minimal example of what we might use in stage2.ld.

ENTRY(start2)

SECTIONS
{
  . = 0x8000;            /* Stage 2 load address */

  .text : {
    *(.text16*)
    *(.text32*)
    *(.text*)
  }

  .rodata : { *(.rodata*) }
  .data   : { *(.data*)   }
  .bss (NOLOAD) : { *(.bss*) *(COMMON) }

  _stack_top = . + 0x1000;   /* simple 4 KiB stack symbol */
}

A file like this replaces the hard-coded layout logic from our assembly. NASM now just emits relocatable code, and the linker (ld) uses this script to position everything properly inside the final binary.

Updating the Makefile

We now assemble stage2.asm into an object file and link it with the new script to produce an ELF image:

boot/stage2.o: boot/stage2.asm
    $(NASM) -f elf32 -g -F dwarf -Wall -O0 $< -o $@

build/stage2.elf: boot/stage2.o boot/stage2.ld
    ld -T boot/stage2.ld -m elf_i386 -nostdlib -o $@ $<

boot/stage2.bin: build/stage2.elf
    objcopy -O binary $< $@
    truncate -s 8192 $@

This new process might look like extra work, but it pays off the moment we start mixing in C code and paging structures. The linker will take care of symbol addresses and memory offsets — no more hardcoded numbers scattered through our assembly.

Paging and Long Mode

Now that our build is structured and linkable, we can continue where we left off in Stage 2 — by preparing to enter 64-bit long mode.

To do that, we first need to enable paging and set up basic 64-bit page tables.

Modern x86 CPUs can only enter 64-bit mode after paging and PAE (Physical Address Extension) are enabled. That means we’ll build just enough of a paging hierarchy to identity-map the first few megabytes of physical memory — enough to run the kernel.

Understanding Paging and PAE

To execute 64-bit code, the CPU insists on PAE paging being enabled and on using the long-mode paging format (the 4-level tree). Concretely:

  • CR4.PAE = 1 (turn on PAE)
  • EFER.LME = 1 (allow long mode)
  • CR0.PG = 1 (turn on paging)
  • Then a far jump into a 64-bit code segment (CS.L=1) to start executing in 64-bit.

The 4-level tree (long-mode paging)

Long mode uses this hierarchy (all entries are 64-bit):

Level Entries Coverage Entry Size
PML4 512 512 GiB each 8 bytes
PDPT 512 1 GiB each 8 bytes
PD 512 2 MiB each 8 bytes
PT 512 4 KiB each 8 bytes

These tables are used in a hierarchy:

Virtual addr bits:  [47:39]   [38:30]   [29:21]   [20:12]   [11:0]
                    PML4 idx  PDPT idx  PD idx    PT idx    page offset

Tables:               PML4  ->  PDPT  ->  PD  ->  PT  ->  4 KiB page
                                           \-> 2 MiB page (PS=1, no PT)

For a minimal set up, we’ll skip PT by using one 2 MiB page: create PML4[0] -> PDPT[0] -> PD[0] with PS=1 which identity-maps 0x00000000–0x001FFFFF.

That’s enough for Stage 2 and the jump.

Why not pt?

You only need a PT (page table) if you want 4 KiB granularity (e.g., guard pages, mapping tiny regions, marking some pages NX later, etc.). Early on, 2 MiB pages are simpler and faster to set up:

  • Use 2 MiB page (no PT):
    • Fewer entries to touch
    • Great for identity-mapping “just get me to 64-bit”
  • Use 4 KiB pages (needs PT):
    • Fine-grained control (per-page permissions)
    • Slightly more code and memory for the PT

Paging entries

Each paging entry (in PML4, PDPT, PD, and PT) follows the same 64-bit structure and flag semantics you listed. The only difference is which address bits and specific flag bits are valid at that level.

Each paging entry = 64 bits (low 32 bits + high 32 bits)

Each paging entry = 64 bits (low 32 bits + high 32 bits)

Bits Name Meaning
0 P (Present) Must be 1 for valid entries
1 RW (Writable) 1 = writable, 0 = read-only
2 US (User/Supervisor) 0 = kernel-only, 1 = user-accessible
3 PWT (Page Write-Through) Cache control bit (leave 0)
4 PCD (Page Cache Disable) Cache disable bit (leave 0)
5 A (Accessed) CPU sets when accessed
6 D (Dirty) For pages only (set by CPU)
7 PS (Page Size) 0 = points to next table, 1 = large page (2 MiB or 1 GiB)
8 G (Global) Optional: prevents TLB flush on CR3 reload
9–11 Available (ignored by hardware) You can use for OS bookkeeping
12–51 Physical Address Base address of next-level table or physical page
52–58 Available (ignored)
59 PAT (Page Attribute Table) Rarely used; controls memory type
60–62 Ignored / Reserved
63 NX (No Execute) 1 = non-executable (if EFER.NXE = 1)

How that applies per level:

Level Structure What address field points to Special bits
PML4E PML4 entry Physical base of PDPT P, RW, US same
PDPTE PDPT entry Physical base of PD (or 1 GiB page if PS = 1) PS = 1 → 1 GiB page
PDE Page directory entry Physical base of PT (or 2 MiB page if PS = 1) PS = 1 → 2 MiB page
PTE Page table entry Physical base of 4 KiB page PS ignored

Setup

To setup these page entries, we configure the flags that we need at each level.

; zero 4 KiB table at EDI
%macro ZERO_PAGE 1
  mov edi, %1
  xor eax, eax
  mov ecx, 4096/4
  rep stosd
%endmacro

%define PTE_P     (1 << 0)   ; present
%define PTE_RW    (1 << 1)   ; writable
%define PTE_PS    (1 << 7)   ; page size (1 = 2 MiB)
%define PTE_FLAGS (PTE_P | PTE_RW)

setup_paging:
  ZERO_PAGE pml4
  ZERO_PAGE pdpt
  ZERO_PAGE pd

  ; PML4[0] -> PDPT
  mov   eax, pdpt
  or    eax, PTE_FLAGS
  mov   [pml4], eax
  mov   dword [pml4 + 4], 0      ; high dword = 0

  ; PDPT[0] -> PD
  mov   eax, pd
  or    eax, PTE_FLAGS
  mov   [pdpt], eax
  mov   dword [pdpt + 4], 0

  ; PD[0] -> 2 MiB page starting at 0x00000000
  mov   eax, 0x00000000 | PTE_FLAGS | PTE_PS
  mov   [pd], eax
  mov   dword [pd + 4], 0

  ; PD[1] -> 2 MiB page starting at 0x00200000
  mov   eax, 0x00200000 | PTE_FLAGS | PTE_PS
  mov   [pd + 8], eax
  mov   dword [pd + 8 + 4], 0

  ret

We can then wrap up the setting of this structure behind a function call, and enable paging:

    call  setup_paging
    
    mov   eax, pml4
    mov   cr3, eax           ; set root paging structure
    
    mov   eax, cr4
    or    eax, (1 << 5)      ; PAE = Physical Address Extension
    mov   cr4, eax
    
    mov   ecx, 0xC0000080    ; EFER MSR
    rdmsr
    or    eax, (1 << 8)      ; enable long mode (LME)
    wrmsr
    
    mov   eax, cr0
    or    eax, (1 << 31)     ; enable paging
    mov   cr0, eax

At this point, all of our paging setup is complete. We now have the first 2 MiB identity mapped, ready for our kernel to use.

Long Mode

We’re almost ready to head over to long mode now.

We do need to add two more descriptors to our gdt. These are for our 64-bit code, and 64-bit data.

; GDT
gdt:
    dq 0x0000000000000000         ; 0x00: null
    dq 0x00CF9A000000FFFF         ; 0x08: 32-bit code (base=0, limit=4GiB)
    dq 0x00CF92000000FFFF         ; 0x10: 32-bit data (base=0, limit=4GiB)
    dq 0x00209A0000000000         ; 0x18: 64-bit code  (L=1, D=0, G=0 ok)
    dq 0x0000920000000000         ; 0x20: 64-bit data  (L ignored)

gdt_desc:
    dw gdt_end - gdt - 1
    dd gdt
gdt_end:

For the 64-bit code descriptor: Access = 0x9A, Flags = 0x20 (L=1). Granularity (G) and Limit are ignored in long mode, so this minimalist form is fine.

Now can can push to long mode.

lgdt  [gdt_desc]            ; GDT that includes 0x18 (64-bit CS)
jmp   0x18:long_entry       ; load CS with L=1 → CPU switches to 64-bit

This means that we need a new block of 64-bit code to jump to:

BITS 64
SECTION .text64

extern _stack        ; from your linker script

long_entry:
  mov     ax, 0x20         ; 64-bit data selector
  mov     ds, ax
  mov     es, ax
  mov     ss, ax
  mov     fs, ax
  mov     gs, ax

  mov     rsp, _stack      ; top of your stage2 stack from .ld

  lea     rsi, [rel msg_long]
  call    serial_puts64

.hang:
  hlt
  jmp     .hang

%include "boot/serial64.asm"

SECTION .rodata

msg_long db "64-bit Long mode: Enabled", 13, 10, 0

Of course, we’ve had to re-implement the serial library as well to be supported in 64-bit mode.

BITS 64

%define COM1 0x3F8

serial_wait_tx64:
  push rdx
  push rax

  mov  dx, COM1+5
.wait:
  in   al, dx
  test al, 0x20
  jz   .wait
  
  pop  rax
  pop  rdx

  ret

; AL = byte
serial_putc64:
  push rdx
  
  call serial_wait_tx64
  
  mov  dx, COM1
  out  dx, al
  
  pop  rdx
  
  ret

; RSI -> zero-terminated string
serial_puts64:
  push rax

.next:
  lodsb
  test al, al
  jz   .done
  ; translate '\n' -> "\r\n"
  cmp  al, 10
  jne  .send
  push rax
  mov  al, 13
  call serial_putc64
  pop  rax
.send:
  call serial_putc64
  jmp  .next
.done:
  pop  rax

  ret

Building and running

Once we’ve got this built, we can see that we’ve successfully jumped across to long mode.

qemu-system-x86_64 -drive file=os.img,format=raw,if=ide,media=disk -serial stdio -debugcon file:debug.log -global isa-debugcon.iobase=0xe9 -display none -no-reboot -no-shutdown -d guest_errors,cpu_reset -D qemu.log
Booting ...
Starting Stage2 ...
Stage2: OK
A20 Line: Enabled
GDT: Loaded
Protected Mode: Enabled
Paging: Enabled
64-bit Long mode: Enabled

Conclusion

We cleaned up the build with a linker script, set up a minimal long-mode paging tree (PML4 → PDPT → PD with 2 MiB pages), extended the GDT with 64-bit descriptors, and executed the CR4/EFER/CR0 sequence to reach 64-bit long mode—while keeping serial output alive the whole way. The result is a small but realistic bootstrap that moves from BIOS real mode to protected mode to long mode using identity-mapped memory and clean section layout.

In the next part we’ll start acting like an OS: map more memory (and likely move to a higher-half layout), add early IDT exception handlers, and bring up a simple 64-bit “kernel entry” that can print, panic cleanly, and prepare for timers/interrupts.

Build your own x86 Kernel Part 4

Introduction

In Part 3 finalised out boot loader, so that it now successfully loads Stage 2 for us. In this post, we’ll focus on setting the system so that we unlock the more advanced features.

Inside of Stage 2 we’ll look at setting up the following:

  • Enable the A20 line
  • Set up a Global Descriptor Table (GDT)
  • Switch to 32-bit Protected Mode

By the end of this article, we’ll at least be in 32-bit protected mode.

A20 Line

Before we can enter 32-bit protected mode, we need to enable the A20 line.

Back in the original Intel 8086, there were only 20 address lines — A0 through A19 — meaning it could address 1 MiB of memory (from 0x00000 to 0xFFFFF). When Intel introduced the 80286, it gained more address lines and could access memory above 1 MiB. However, to remain compatible with older DOS software that relied on address wrap-around (where 0xFFFFF + 1 rolled back to 0x00000), IBM added a hardware gate: A20.

When the A20 line is disabled, physical address bit 20 is forced to 0. So addresses “wrap” every 1 MiB — 0x100000 looks the same as 0x000000.

When A20 is enabled, memory above 1 MiB becomes accessible. Protected-mode code, paging, and modern kernels all assume that A20 is on.

Enabling A20

To enable the A20 Line, we use the Fast A20 Gate (port 0x92).

Most modern systems and emulators expose bit 1 of port 0x92 (the “System Control Port A”) as a direct A20 enable bit.

  • Bit 0 — system reset (don’t touch this)
  • Bit 1 — A20 gate (1 = enabled)

We add the following to do this:

%define A20_GATE  0x92

in    al, A20_GATE      ; read system control port A
or    al, 0x02          ; set bit 1 (A20 enable)
and   al, 0xFE          ; clear bit 0 (reset)
out   A20_GATE, al

Global Descriptor Table (GDT)

When the CPU is in real mode, memory addressing is done through segment:offset pairs. Each segment register (CS, DS, SS, etc.) represents a base address (shifted left by 4), and the offset is added to that. This gives you access to 1 MiB of address space — the legacy 8086 model.

When we switch to protected mode, the segmentation model changes. Instead of using raw segment values, each segment register now holds a selector — an index into a table called the Global Descriptor Table (GDT).

The GDT tells the CPU what each segment means:

  • Its base address
  • size (limit)
  • access rights
  • flags like “code or data”, “read/write”, or “privilege level”

The descriptor layout in 32-bit mode looks like this:

Bits Field Description
0-15 Limit (low) Segment limit (low 16 bits)
16-31 Base (low) Segment base address (low 16 bits)
32-39 Base (mid) Segment base (middle 8 bits)
40-47 Access Byte Type, privilege level, presence
48-51 Limit (high) High 4 bits of segment limit
52-55 Flags Granularity, 32-bit flag, etc.
56-63 Base (high) Segment base (high 8 bits)

In our boot setup, we’ll create a very simple GDT with:

  • A null descriptor (required; selector 0 is invalid by design).
  • A code segment descriptor — flat 4 GiB region, readable, executable.
  • A data segment descriptor — flat 4 GiB region, readable, writable.

This gives us a flat memory model, where all segments start at base 0 and cover the entire address space. That makes protected mode addressing behave almost like real mode linear memory, simplifying everything until paging and virtual memory come later.

Once that GDT is loaded with lgdt, we can safely set the PE (Protection Enable) bit in CR0 and perform a far jump into 32-bit protected mode code.

Defining the GDT

We define our GDT as three quad words. One for null, one for code, and one for data.

align 8
; --- GDT for entering 32-bit PM (null, code, data) ---
gdt32:
    dq 0x0000000000000000         ; null
    dq 0x00CF9A000000FFFF         ; 0x08: 32-bit code, base=0, limit=4GiB
    dq 0x00CF92000000FFFF         ; 0x10: 32-bit data, base=0, limit=4GiB

gdt32_desc:
    dw gdt32_end - gdt32 - 1      ; limit = (size of GDT - 1)
    dd gdt32                      ; base  = address of GDT
gdt32_end:

Breaking down the 32-bit code GDT:

0x00CF9A000000FFFF

If we split this into bytes (little-endian in memory):

[FFFF] [0000] [00][9A] [CF][00]

We can now start to map these to the fields:

Field Value Meaning
Limit (low 16) 0xFFFF segment limit = 0xFFFF
Base (low 16) 0x0000 base = 0x00000000
Base (mid 8) 0x00 base = 0x00000000
Access Byte 0x9A flags that define “code, ring 0, present”
Limit (high 4) + flags 0xCF limit high nibble=0xF, flags=0xC
Base (high 8) 0x00 base = 0x00000000

The “limit” of 0xFFFF and granularity bit (G=1) combine to make the segment effectively 4 GiB in size (0xFFFFF × 4 KiB pages = 4 GiB).

Loading the GDT

Now that we have our GDT defined, we can use lgdt to load it.

cli
lgdt  [gdt32_desc]
mov   eax, cr0
or    eax, 1                   ; CR0.PE=1
mov   cr0, eax

The operand to lgdt wants to see a 16bit limit first, and then a 32-bit linear address (in 32-bit mode) to where the GDT starts.

Protected Mode

With the GDT now loaded, we’re free to push over to protected mode. This is 32-bit protected mode, so we’re jumping into code that needs the [BITS 32] directive.

  ; selectors: 0x08 = code32, 0x10 = data32
  jmp   0x08:pm_entry            ; far jump to load 32-bit CS

[BITS 32]
pm_entry:
  mov   ax, 0x10                 ; 0x10 = data32
  mov   ds, ax
  mov   es, ax
  mov   ss, ax
  mov   fs, ax
  mov   gs, ax
  mov   esp, 0x90000             ; temporary 32-bit stack  

.hang:
  hlt
  jmp   .hang

We make our far jump into 32-bit land. This jump both updates CS and flushes the prefetch queue — it’s the required way to officially enter protected mode.

Immediately we set all of our segment selectors to 0x10 which is data GDT entry.

We’re now in 32-bit protected mode.

Stage 2 (full listing)

Our current code for Stage 2 now looks like this:

; ---------------------------------------------------------
; boot/stage2.asm — loaded by MBR at 0000:8000 (LBA 1..16)
; ---------------------------------------------------------
BITS 16
ORG  0x8000

%define A20_GATE          0x92

start2:
  cli
  xor   ax, ax
  mov   ds, ax        ; ds = 0 so labels assembled with ORG work as absolute
  mov   es, ax
  cld                 ; count upwards
  sti

  call  serial_init

  mov   si, stage2_msg
  call  serial_puts

  in    al, A20_GATE          ; A20 fast
  or    al, 0x02
  and   al, 0xFE
  out   A20_GATE, al

  mov   si, a20_msg
  call  serial_puts

  cli
  lgdt  [gdt32_desc]
  mov   eax, cr0
  or    eax, 1                   ; CR0.PE=1
  mov   cr0, eax

  mov   si, gdt_msg
  call  serial_puts

  ; selectors: 0x08 = code32, 0x10 = data32
  jmp   0x08:pm_entry            ; far jump to load 32-bit CS

stage2_msg db "Stage2: OK", 13, 10, 0
a20_msg    db "A20 Line: Enabled", 13, 10, 0
gdt_msg    db "GDT: Loaded", 13, 10, 0

%include "boot/serial16.asm"

[BITS 32]
pm_entry:
  mov   ax, 0x10                 ; 0x10 = data32
  mov   ds, ax
  mov   es, ax
  mov   ss, ax
  mov   fs, ax
  mov   gs, ax
  mov   esp, 0x90000             ; temporary 32-bit stack  

  mov   esi, pm_msg
  call  serial_puts32

.hang:
  hlt
  jmp   .hang


align 8
; --- GDT for entering 32-bit PM (null, code, data) ---
gdt32:
    dq 0x0000000000000000         ; null
    dq 0x00CF9A000000FFFF         ; 0x08: 32-bit code, base=0, limit=4GiB
    dq 0x00CF92000000FFFF         ; 0x10: 32-bit data, base=0, limit=4GiB

gdt32_desc:
    dw gdt32_end - gdt32 - 1
    dd gdt32
gdt32_end:

pm_msg db "Entered protected mode ...", 13, 10, 0

%include "boot/serial32.asm"

Notes

I’ve had to duplicate the serial assembly file. Originally it was 16 bits only, but now we need 32-bit support.

These routines look alot like their 16-bit counterparts:

; ---------------------------------------------------------
; serial32.asm — COM1 (0x3F8) UART helpers for 32-bit PM
; ---------------------------------------------------------
[BITS 32]

%define COM1 0x3F8
; LSR bits: 0x20 = THR empty, 0x40 = TSR empty

; init: 115200 8N1, FIFO on
serial_init32:
    push eax
    push edx
    ; IER=0 (disable UART interrupts)
    mov  dx, COM1 + 1
    xor  eax, eax
    out  dx, al
    ; DLAB=1
    mov  dx, COM1 + 3
    mov  al, 0x80
    out  dx, al
    ; divisor = 1 (DLL=1, DLM=0)
    mov  dx, COM1 + 0
    mov  al, 0x01
    out  dx, al
    mov  dx, COM1 + 1
    xor  al, al
    out  dx, al
    ; 8N1, DLAB=0
    mov  dx, COM1 + 3
    mov  al, 0x03
    out  dx, al
    ; FIFO enable/clear, 14-byte trigger
    mov  dx, COM1 + 2
    mov  al, 0xC7
    out  dx, al
    ; MCR: DTR|RTS|OUT2
    mov  dx, COM1 + 4
    mov  al, 0x0B
    out  dx, al
    pop  edx
    pop  eax
    ret

; wait until THR empty
serial_wait_tx32:
    push eax
    push edx
    mov  dx, COM1 + 5
.wait:
    in   al, dx
    test al, 0x20
    jz   .wait
    pop  edx
    pop  eax
    ret

; putc: AL = character
serial_putc32:
    push edx
    call serial_wait_tx32
    mov  dx, COM1
    out  dx, al
    pop  edx
    ret

; putc with '\n' -> "\r\n"
serial_putc_nl32:
    cmp  al, 10              ; '\n'
    jne  .send
    push eax
    mov  al, 13              ; '\r'
    call serial_putc32
    pop  eax
.send:
    jmp  serial_putc32

; puts: ESI -> zero-terminated string
serial_puts32:
    push eax
    push esi
.next:
    lodsb                    ; AL = [ESI], ESI++
    test al, al
    jz   .done
    call serial_putc_nl32
    jmp  .next
.done:
    pop  esi
    pop  eax
    ret

Running

Getting this built and running now, we can see that we’re successfully in 32-bit protected mode.

➜ make run  
qemu-system-x86_64 -drive file=os.img,format=raw,if=ide,media=disk -serial stdio -debugcon file:debug.log -global isa-debugcon.iobase=0xe9 -display none -no-reboot -no-shutdown -d guest_errors,cpu_reset -D qemu.log
Booting ...
Starting Stage2 ...
Stage2: OK
A20 Line: Enabled
GDT: Loaded
Entered protected mode ...

Conclusion

We’ve now built the minimal foundation of a protected-mode operating system: flat memory model, GDT, and a working serial console. From this point on, we can start using true 32-bit instructions and data structures. In the next post, we’ll extend this with an Interrupt Descriptor Table (IDT), Programmable Interrupt Timer (PIT), and paging, preparing the system for 64-bit long mode.

Build your own x86 Kernel Part 3

Introduction

In Part 2 we wired up a serial port so we could see life signs from our bootloader. Now we’re going to take the next big step — load a second stage from disk. We’ll keep stage2 simple for now - we’ll just prove that control has been transferred.

Our 512-byte boot sector is tiny, so it’ll stay simple: it loads the next few sectors (Stage 2) into memory and jumps there. Stage 2 will then have a lot more room to move to enable our processor.

Finishing Stage 1

Before we can get moving with Stage 2, the first stage of our boot process still has a few things left to do.

The BIOS hands our boot sector the boot drive number in DL (e.g., 0x80 for the first HDD, 0x00 for the floppy).

We need to stash that away for later.

mov   [boot_drive], dl

Disk Address Packet (DAP)

The BIOS via int 0x13 (AH=0x42) provides a read function that will allow us to read Stage 2 off of disk and into memory. The extended read function uses a 16-byte structure pointed to by DS:SI:

Offset Size Field
0x00 1 Size of packet (16 for basic)
0x01 1 Reserved (0)
0x02 2 Number of sectors to read
0x04 2 Buffer offset (destination)
0x06 2 Buffer segment (destination)
0x08 8 Starting LBA (64-bit, little-endian)

We fill this structure like so:

%define STAGE2_SEG      0x0000
%define STAGE2_OFF      0x8000
%define STAGE2_LBA      1
%define STAGE2_SECTORS  16

mov   si, dap           ; DAP for stage2 -> 0000:8000
mov   byte [si], 16
mov   byte [si + 1], 0
mov   word [si + 2], STAGE2_SECTORS
mov   word [si + 4], STAGE2_OFF
mov   word [si + 6], STAGE2_SEG
mov   dword [si + 8], STAGE2_LBA
mov   dword [si + 12], 0

And now we can actually load it off of disk:

mov   dl, [boot_drive]
mov   ah, 0x42
mov   si, dap
int   0x13
jc    disk_error

This call reads count sectors from starting LBA into segment:offset specified in the DAP.

If you recall, we setup our stack at 0x7000. By loading Stage 2 at 0x8000 and having 16 sectors (8 KiB), Stage 2 will occupy 0x8000..0x9FFF, so there won’t be a collision.

After this call we either have Stage 2 successfully loaded at STAGE2_SEG:STAGE2_OFF or the carry flag will be set; in which case, we have an error.

If everything has gone ok, we can use a far jmp to transfer control there in real mode.

jmp   STAGE2_SEG:STAGE2_OFF

Now that we’ve got a bit more space to work with, we can set some more things up (video, disk i/o, a20 lines, gdt, etc.).

Boot loader

Here’s a full rundown of the boot loader so far:

; ---------------------------------------------------------
; boot/boot.asm: Main boot loader
; ---------------------------------------------------------
;
BITS 16
ORG  0x7C00

%define STAGE2_SEG      0x0000
%define STAGE2_OFF      0x8000
%define STAGE2_LBA      1
%define STAGE2_SECTORS  16

main:
  cli
  
  xor   ax, ax
  mov   ss, ax
  mov   bp, 0x7000
  mov   sp, bp            ; temp stack setup (so it's below code)

  mov   ds, ax            ; DS = 0 -> labels are absolute 0x7Cxx
  mov   es, ax            ; ES = 0

  cld                     ; lods/stos auto-increment

  sti

  mov   [boot_drive], dl  ; remember the BIOS drive

  call  serial_init

  mov   si, boot_msg
  call  serial_puts

  mov   si, dap           ; DAP for stage2 -> 0000:8000
  mov   byte [si], 16
  mov   byte [si + 1], 0
  mov   word [si + 2], STAGE2_SECTORS
  mov   word [si + 4], STAGE2_OFF
  mov   word [si + 6], STAGE2_SEG
  mov   dword [si + 8], STAGE2_LBA
  mov   dword [si + 12], 0

  mov   ax, STAGE2_SEG
  mov   es, ax
  mov   dl, [boot_drive]
  mov   ah, 0x42
  mov   si, dap
  int   0x13
  jc    disk_error

  mov   si, stage2_msg
  call  serial_puts
  jmp   STAGE2_SEG:STAGE2_OFF

disk_error:
  mov   si, derr_msg
  call  serial_puts

.hang:
  hlt
  jmp   .hang

%include "boot/serial.asm"

boot_msg    db "Booting ...", 13, 10, 0
stage2_msg  db "Starting Stage2 ...", 13, 10, 0
derr_msg    db "Disk error!", 13, 10, 0

boot_drive  db 0
dap:        db 16, 0
            dw 0, 0, 0
            dd 0, 0

times 510-($-$$) db 0
dw 0AA55h

If we were to run this now without a Stage 2 in place, we should pretty reliably get a Disk error!:

qemu-system-x86_64 -drive file=os.img,format=raw,if=ide,media=disk -serial stdio -debugcon file:debug.log -global isa-debugcon.iobase=0xe9 -display none -no-reboot -no-shutdown -d guest_errors,cpu_reset -D qemu.log
Booting ...
Disk error!

Stage 2

Our Stage 2 runs in real mode, but it’s free of the 512-byte limit that our boot loader had. We’ll keep the implementation very simple right now, just to prove that we’ve jumped over to Stage 2 - and fill it out later.

; ---------------------------------------------------------
; boot/stage2.asm — loaded by MBR at 0000:8000 (LBA 1..16)
; ---------------------------------------------------------
BITS 16
ORG  0x8000           ; the offset where we were loaded to by MBR

start2:
  cli
  xor   ax, ax
  mov   ds, ax        ; ds = 0 so labels assembled with ORG work as absolute
  mov   es, ax
  cld                 ; count upwards
  sti

  call  serial_init

  mov   si, stage2_msg
  call  serial_puts

.hang:
  hlt
  jmp   .hang


stage2_msg db "Stage2: OK", 13, 10, 0

%include "boot/serial.asm"

Building

We need to include Stage 2 as a part of the build now in the Makefile. Not only do we need to assemble this, but it needs to make it into our final os image:

boot/boot.bin: boot/boot.asm
    $(NASM) -f bin $< -o $@

boot/stage2.bin: boot/stage2.asm
    $(NASM) -f bin $< -o $@
    truncate -s 8192  $@

os.img: boot/boot.bin boot/stage2.bin
    rm -f $@
    dd if=boot/boot.bin   of=$@ bs=512 count=1 conv=notrunc
    dd if=boot/stage2.bin of=$@ bs=512 seek=1  conv=notrunc
    truncate -s $$((32*512)) $@

After stage2.bin is assembled, you can see we pad it out to the full 8k which is our 16 sectors. This gets appended after the boot loader in the image.

With this very simple Stage 2 in place, we give this a quick build and run we should be able to confirm that we are up and running in Stage 2.

➜  make run    
qemu-system-x86_64 -drive file=os.img,format=raw,if=ide,media=disk -serial stdio -debugcon file:debug.log -global isa-debugcon.iobase=0xe9 -display none -no-reboot -no-shutdown -d guest_errors,cpu_reset -D qemu.log
Booting ...
Starting Stage2 ...
Stage2: OK

Conclusion

We’ve made it to Stage 2. We’ve got a great base to work from here. In the next upcoming posts in this series we’ll start to use Stage 2 to setup more of the boot process.