Writing “pure syscall” assembly can be fun and educational — right up until you find yourself rewriting strlen, strcmp, line input, formatting, and file handling for the tenth time.
If you’re building tooling (monitors, debuggers, CLIs, experiments), the fastest path is often to write your core logic in assembly and call out to glibc for the boring parts.
In today’s article, we’ll walk through a basic example to get you up and running. You should quickly see just how thin the C language really is as a layer over assembly and the machine itself.
Hello, world
We’ll start with a simple “Hello, world” style application.
BITS64DEFAULTRELexternputsglobalmainsection.rodatamsgdb"Hello from NASM + glibc (puts)!",0section.textmain:; puts(const char *s)leardi,[relmsg]callputswrt..plt; <-- PIE-friendly call via PLTxoreax,eax; return 0ret
Let’s break this down.
BITS64DEFAULTREL
First, we tell the assembler that we’re generating code for x86-64 using the BITS directive.
DEFAULT REL changes the default addressing mode in 64-bit assembly from absolute addressing to RIP-relative addressing. This is an important step when writing modern position-independent code (PIC), and allows the resulting executable to work correctly with security features like Address Space Layout Randomisation (ASLR).
externputs
Functions that are implemented outside our module are resolved at link time. Since the implementation of puts lives inside glibc, we declare it as an external symbol.
globalmain
The true entry point of a Linux program is _start. When you write a fully standalone binary, you need to define this yourself.
Because we’re linking against glibc, the C runtime provides the startup code for us. Internally, this eventually calls our main function. To make this work, we simply mark main as global so the linker can find it.
section.rodatamsgdb"Hello from NASM + glibc (puts)!",0
Here we define our string in the read-only data section (.rodata). From a C perspective, this is equivalent to storing a const char *.
section.textmain:
This marks the beginning of our executable code and defines the main entry point.
leardi,[relmsg]callputswrt..plt
This is where we actually print the message.
According to the x86-64 System V ABI (used by Linux and glibc), function arguments are passed in registers using the following order:
rdi
rsi
rdx
rcx
r8
r9
Floating-point arguments are passed in XMM registers.
We load the address of our string into rdi, then call puts.
The wrt ..plt modifier tells NASM to generate a call through the Procedure Linkage Table (PLT). This is required for producing position-independent executables (PIE), which are the default on many modern Linux systems. Without this, the linker may fail or produce non-relocatable binaries.
xoreax,eaxret
Finally, we return zero from main by clearing eax. Control then returns to glibc, which performs cleanup and exits back to the operating system.
Building
We first assemble the file into an object file:
nasm -felf64 hello.asm -o hello.o
Next, we link it using gcc. This automatically pulls in glibc and the required runtime startup code:
gcc hello.o -o hello
On many modern Linux distributions, position-independent executables are enabled by default. If you encounter relocation errors during linking, you can explicitly enable PIE support:
gcc -fPIE-pie hello.o -o hello
Or temporarily disable it while experimenting:
gcc -no-pie hello.o -o hello
The PLT-based call form shown earlier works correctly in both cases.
Conclusion
Calling glibc from NASM is one of those “unlock” moments.
You retain full control over registers, memory layout, and calling conventions — while gaining access to decades of well-tested functionality for free.
Instead of rewriting basic infrastructure, you can focus your energy on the interesting low-level parts of your project.
For tools like debuggers, monitors, loaders, and CLIs, this hybrid approach often provides the best balance between productivity and control.
In the next article, we’ll build a small interactive REPL in NASM using getline, strcmp, and printf, and start layering real debugger-style functionality on top.
Assembly doesn’t have to be painful — it just needs the right leverage.
In a previous post I walked through building PostgreSQL extensions in C. It worked, but the process reminded me why systems programming slowly migrated away from raw C for anything larger than a weekend hack. Writing even a trivial function required boilerplate macros, juggling PG_FUNCTION_ARGS, and carefully tiptoeing around memory contexts.
This time, we’re going to do the same thing again — but in Rust.
Using the pgrx framework, you can build fully-native Postgres extensions with:
no hand-written SQL wrappers
no PGXS Makefiles
no manual tuple construction
no palloc/pfree memory management
a hot-reloading development Postgres
and zero unsafe code unless you choose to use it
Let’s walk through the entire process: installing pgrx, creating a project, adding a function, and calling it from Postgres.
1. Installing pgrx
Install the pgrx cargo subcommand:
cargo install--locked cargo-pgrx
Before creating an extension, pgrx needs to know which versions of Postgres you want to target.
Since I’m running PostgreSQL 17, I simply asked pgrx to download and manage its own copy:
cargo pgrx init --pg17 download
This is important.
Instead of installing into /usr/share/postgresql (which requires root and is generally a bad idea), pgrx keeps everything self-contained under:
When you compile the project, pgrx automatically generates SQL wrappers and installs everything into its own Postgres instance.
3. A Minimal Rust Function
Open src/lib.rs and add:
usepgrx::prelude::*;pgrx::pg_module_magic!();#[pg_extern]fnhello_rustpg()->&'staticstr{"Hello from Rust + pgrx on Postgres 17!"}
That’s all you need.
pgrx generates the SQL wrapper for you, handles type mapping, and wires everything into Postgres.
4. Running It Inside Postgres
Start your pgrx-managed Postgres 17 instance:
cargo pgrx run pg17
Inside psql:
CREATEEXTENSIONhello_rustpg;SELECThello_rustpg();
Result:
hello_rustpg
-------------------------------
Hello from Rust + pgrx on Postgres 17!
(1 row)
Done. A working native extension — no Makefiles, no C, no segfaults.
5. Returning a Table From Rust
Let’s do something a little more interesting: return rows.
Replace your src/lib.rs with:
usepgrx::prelude::*;usepgrx::spi::SpiResult;pgrx::pg_module_magic!(name,version);#[pg_extern]fnhello_hello_rustpg()->&'staticstr{"Hello, hello_rustpg"}#[pg_extern]fnlist_tables()->TableIterator<'static,(name!(schema,String),name!(table,String))>{letsql="
SELECT schemaname::text AS schemaname,
tablename::text AS tablename
FROM pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY schemaname, tablename;
";letrows=Spi::connect(|client|{client.select(sql,None,&[])?.map(|row|->SpiResult<(String,String)>{letschema:Option<String>=row["schemaname"].value()?;lettable:Option<String>=row["tablename"].value()?;Ok((schema.expect("schemaname null"),table.expect("tablename null")))}).collect::<SpiResult<Vec<_>>>()}).expect("SPI failed");TableIterator::new(rows.into_iter())}
Re-run:
cargo pgrx run pg17
Then:
SELECT*FROMlist_tables();
If you don’t have any tables, your list will be empty. Otherwise you’ll see something like:
schema | table
--------+-------------
public | names
public | order_items
public | orders
public | users
(4 rows)
This is the point where Rust starts to feel like cheating:
you’re returning tuples without touching TupleDesc, heap_form_tuple(), or any of Postgres’s internal APIs.
6. Accessing Catalog Metadata (Optional but Fun)
Here’s one more example: listing foreign keys.
#[pg_extern]fnlist_foreign_keys()->TableIterator<'static,(name!(table_name,String),name!(column_name,String),name!(foreign_table_name,String),name!(foreign_column_name,String),),>{letsql=r#"
SELECT
tc.table_name::text AS table_name,
kcu.column_name::text AS column_name,
ccu.table_name::text AS foreign_table_name,
ccu.column_name::text AS foreign_column_name
FROM information_schema.table_constraints AS tc
JOIN information_schema.key_column_usage AS kcu
ON tc.constraint_name = kcu.constraint_name
AND tc.table_schema = kcu.table_schema
JOIN information_schema.constraint_column_usage AS ccu
ON ccu.constraint_name = tc.constraint_name
AND ccu.table_schema = tc.table_schema
WHERE tc.constraint_type = 'FOREIGN KEY'
ORDER BY tc.table_name, kcu.column_name;
"#;letrows=Spi::connect(|client|{client.select(sql,None,&[])?.map(|row|->SpiResult<(String,String,String,String)>{lett:Option<String>=row["table_name"].value()?;letc:Option<String>=row["column_name"].value()?;letft:Option<String>=row["foreign_table_name"].value()?;letfc:Option<String>=row["foreign_column_name"].value()?;Ok((t.expect("null"),c.expect("null"),ft.expect("null"),fc.expect("null")))}).collect::<SpiResult<Vec<_>>>()}).expect("SPI failed");TableIterator::new(rows.into_iter())}
This begins to show how easy it is to build introspection tools — or even something more adventurous, like treating your relational schema as a graph.
7. Testing in Rust
pgrx includes a brilliant test harness.
Add this:
#[cfg(any(test,feature="pg_test"))]#[pg_schema]modtests{usesuper::*;usepgrx::prelude::*;#[pg_test]fntest_hello_rustpg(){assert_eq!(hello_rustpg(),"Hello from Rust + pgrx on Postgres 17!");}}/// Required by `cargo pgrx test`#[cfg(test)]pubmodpg_test{pubfnsetup(_opts:Vec<&str>){}pubfnpostgresql_conf_options()->Vec<&'staticstr>{vec![]}}
Then run:
cargo pgrx test pg17
These are real Postgres-backed tests.
It’s one of the biggest advantages of building extensions in Rust.
Conclusion
After building extensions in both C and Rust, I’m firmly in the Rust + pgrx camp.
You still get:
full access to Postgres internals
native performance
the ability to drop into unsafe when needed
But you also get:
safety
ergonomics
powerful testing
a private Postgres instance during development
drastically simpler code
In the next article I’ll push further and treat foreign keys as edges — effectively turning a relational schema into a graph.
But for now, this is a clean foundation: a native PostgreSQL extension written in Rust, tested, and running on Postgres 17.
Today’s post is going to be a quick demonstration of loading dynamic libraries at runtime in Rust.
In my earlier article, I showed how to use Glibc’s
dlopen/dlsym/dlclose
APIs from C to load a shared object off disk and call a function in it. Rust can do the same thing – with a bit more
type safety – using:
This is not meant to be a full plugin framework, just a minimal “host loads a tiny library and calls one function”
example, similar in spirit to the original C version.
A tiny library in Rust
We’ll start with a tiny dynamic library that exports one function, greet, which returns a C-style string:
cargo new --lib rust_greeter
cd rust_greeter
Edit Cargo.toml so that the library is built as a cdylib:
usestd::os::raw::c_char;#[unsafe(no_mangle)]pubextern"C"fngreet()->*constc_char{staticGREETING:&str="Hello from Rust!\0";GREETING.as_ptr().cast()}
The #[unsafe(no_mangle)] form marks the item (the function) as unsafe to call, and also forwards the nested
no_mangle attribute exactly as written. This avoids needing unsafe fn syntax and keeps ABI-exported functions more
visually consistent. It’s a small but nice modernisation that fits well when exposing C-compatible symbols from Rust.
Before we can run any of this, we need to make sure the library is available to the host program. In order to do this,
we simply copy over the library:
cp ../rust_greeter/target/release/libtest.so .
Just copy the so over to the host program folder.
Running cargo run prints:
$ cargo run
Finished `dev` profile [unoptimized + debuginfo] target(s)in 0.01s
Running `target/debug/rust_host`
Hello from Rust!
Mapping back to the C version
When you look at this code, you can see that Library::new("./libtest.so") now takes the place of dlopen().
We can get to the symbol that we want to call with lib.get(b"greet\0") rather than dlsym(), and we clean everything
up now by just dropping the library.
Platform notes
Keep in mind that I’ve written this code on my linux machine, so you’ll have different targets depending on the
platform that you work from.
Platform
Output
Linux
libtest.so
macOS
libtest.dylib
Windows
test.dll
cdylib produces the correct format automatically.
Conclusion
We:
built a tiny Rust cdylib exporting a C-ABI function,
loaded it at runtime with libloading,
looked up a symbol by name, and
invoked it through a typed function pointer.
I guess this was just a modern update to an existing article.
Just like in the C post, this is a deliberately minimal skeleton — but enough to grow into a proper plugin architecture
once you define a stable API between host and library.
Most operating systems books talk about schedulers as if they’re mysterious forces that magically decide which program
runs next. You’ll usually see phrases like context switching, round-robin, priority queues, and
tasks yielding the CPU—but the moment you try to understand how that works mechanically, the examples jump straight
into kernel code, hardware timers, and interrupt controllers.
That’s… a lot.
Before touching real hardware, it’s far easier to learn the concept in a sandbox: a tiny world that behaves like an
operating system, but without all the baggage. That’s what we’re going to build.
In this post, we’ll simulate a scheduler.
We’ll:
Define a small set of opcodes—like a micro-instruction set for “programs.”
Write a mini assembler that converts human-readable instructions to bytecodes.
Represent each program as a task inside a virtual machine.
Implement a scheduler that runs these tasks, switching between them to simulate concurrency.
The final result:
multiple tiny “programs”
all appearing to run at the same time
under the control of a scheduler you wrote
No threads. No OS dependencies. No unsafe code. Just pure, understandable logic.
By the time we reach the end, you’ll not only understand what a scheduler does—you’ll understand why it needs to exist,
and the mental model will stay with you when you look at real operating systems later.
Let’s build a scheduler from scratch.
Programs
Before scheduling, we need programs. We need a way to define, write, and store programs so that we can load these into
our computer.
Instruction Set
First up, we’re going to define a set of instructions that our “computer” will execute. We’ll define these as a well
known common set of instructions as there’s a few components that will need to understand these. The virtual machine
that we’ll get to will use these to interpret and perform action, but our assembler will have the job of converting
those instructions into bytes on disk.
As you can see, our computer won’t do a lot. Its functionality isn’t the subject here though; we want to see
how these instructions get scheduled.
We use repr(u8) to make casting to u8 a much smoother process; that will help our assembler later on change our
defined instructions into programs on disk.
Conversely, we need to make the bytes-on-disk to instructions-in-memory translation easier.
Those byte values marry up to what our program defines. You can see it’s important that our ISA
doesn’t change now, otherwise our programs will have completely different meanings!
Of course, if it does change - we simply re-assemble our programs.
Tasks
Now we define our Task abstraction. A task is how our virtual machine will store the state of a program that’s been
loaded off of disk and is executing / has executed. We only need some basics to define the state of our task.
pubstructTask{/// The path to the taskpubpath:String,/// Program codepubcode:Program,/// Current program counter (instruction pointer register)pubpc:usize,/// Stack statepubstack:Vec<i32>,/// Determines if this task is still runningpubdone:bool,/// Priority of this taskpubpriority:usize,}
We can load programs off of disk and make tasks out of them ready to execute.
implTask{pubfnload(path:&str)->std::io::Result<Task>{letraw_code=std::fs::read(path).expect("Failed to read file");letinstructions=raw_code.iter().filter_map(|&b|Instruction::from_byte(b))/* filter bytes out that are invalid */.collect::<Vec<Instruction>>();lettask=Task{path:path.to_string(),code:instructions,pc:0,stack:[].to_vec(),done:false,priority:0};returnOk(task);}}
Virtual Machine
Now that we have tasks loaded from disk, we need something that can run them.
Enter the virtual machine.
This is the component responsible for:
tracking and storing tasks,
executing one instruction at a time,
and cooperatively multitasking between tasks (our scheduler).
The virtual machine stores a collection of Tasks, and has three responsibilities:
add_task – load a task into the VM.
has_runnables – check if any task still needs CPU time.
execute – run one instruction for a given task.
Note: execute receives a reference to a task (&Task). We don’t take ownership of the task, because the scheduler must be able to revisit it later and resume execution.
Scheduling — Round Robin
Now that we have tasks stored in the VM, we need to schedule them.
We’re going to implement the simplest scheduling algorithm: round robin.
Round robin is:
fair (every task gets a turn),
predictable (runs tasks in order),
and conceptually simple (loop through the task list over and over).
Here is the scheduler loop:
pubfnrun_round_robin(&mutself){whileself.has_runnables(){// loop each task once per roundforidxin(0..self.tasks.len()){{lettask_ref:&Task=&self.tasks[idx];iftask_ref.done{continue;}// print the instruction that's about to executeself.execute(task_ref);}// now we advance the task forward one instructionlettask=&mutself.tasks[idx];if!task.done{task.step();iftask.pc>=task.code.len(){task.done=true;}}}// cleanup (remove finished tasks from the VM)self.tasks.retain(|t|!t.done);}}
A few important points:
Only one instruction executes per task per loop — simulating time slices.
Each iteration of the outer while loop represents one round of CPU scheduling.
retain removes finished tasks so the VM doesn’t waste time checking them.
That’s cooperative multitasking: the VM does just enough work per task, then moves on.
Test Programs
We need some test programs to feed into this system. Here’s the code (using our assembler) that creates three
programs that we can use called p1, p2, and p3.
[./bins/p1.bin]: PUSH
[./bins/p2.bin]: SLEEP
[./bins/p3.bin]: PUSH
[./bins/p1.bin]: ADD
[./bins/p2.bin]: SLEEP
[./bins/p3.bin]: LOAD
[./bins/p1.bin]: SUB
[./bins/p2.bin]: SLEEP
[./bins/p3.bin]: HALT
[./bins/p1.bin]: LOAD
[./bins/p2.bin]: NOP
[./bins/p1.bin]: HALT
[./bins/p2.bin]: HALT
Each task gets one instruction at a time.
They appear to multitask, but really, we are just interleaving execution.
This is exactly how early cooperative schedulers worked.
Different Algorithms
We implemented Round Robin and looked at Priority Scheduling, but operating systems use a variety of scheduling strategies.
Each one optimizes something different — fairness, throughput, responsiveness, or predictability.
Here’s a breakdown of the most common ones you’ll see in real operating systems:
First-Come, First-Served (FCFS)
The simplest possible scheduler.
Tasks are run in the order they arrive.
No preemption — once a task starts running, it keeps the CPU until completion.
Pros:
Very predictable.
Easy to implement.
Cons:
Terrible response time — one long task can block all others (the “convoy effect”).
Used in: batch systems, print queues, embedded devices.
Shortest Job First (SJF) / Shortest Remaining Time First (SRTF)
Runs the shortest task first.
SJF — non-preemptive (once a task starts, it finishes).
SRTF — preemptive (if a new shorter task arrives, preempt the current one).
Pros:
Great throughput (lowest total completion time).
Cons:
Requires knowing how long tasks will run (hard in general).
Small tasks can starve large tasks.
Used in: job schedulers, long-running batch systems, HPC.
Round Robin (RR)(what we implemented)
Each task gets a fixed unit of time (called a quantum), and then moves to the back of the queue.
Pros:
Very fair.
Great for interactive workloads (UI responsiveness).
Cons:
If quantum is too short: too much overhead (context switching).
If quantum too long: behaves like FCFS.
Used in: timesharing OS kernels, early Unix schedulers.
Priority Scheduling
Each task has a priority number.
Always selects the runnable task with the highest priority.
Can be preemptive or non-preemptive.
Pros:
High-importance tasks get CPU time first.
Cons:
Starvation — low priority tasks may never run.
Used in: realtime systems, audio/video processing, embedded control software.
Multi-Level Feedback Queue (MLFQ)(how modern OS schedulers work)
Combines priority and round robin.
Multiple queues (high priority to low priority)
Round Robin within each queue
Tasks that use a lot of CPU get demoted to lower priority queues
Tasks that frequently yield or sleep get promoted (interactive = fast)
Pros:
Gives priority to interactive tasks
Penalizes CPU-bound tasks
Adapts automatically over time (no tuning per process)
Cons:
Harder to implement
Requires tracking task behavior (history)
Used in: Windows, macOS, Linux (CFS is conceptually similar).
Comparison
Algorithm
Preemptive
Goal / Optimization
Fairness
Weaknesses
FCFS
N
Simplicity
Low
One long task blocks everything
SJF / SRTF
Y/N
Lowest total completion time
Low
Starvation of long tasks
Round Robin
Y
Responsiveness / interactivity
High
Requires good quantum tuning
Priority Scheduling
Y/N
Importance / latency sensitivity
Low
Starvation of low priorities
Multi-Level Feedback Queue
Y
Realistic, adaptive fairness
High
More complex to implement
TL;DR
Different schedulers optimize for different outcomes:
Fairness? → Round Robin
Highest priority first? → Priority Scheduling
Best throughput? → Shortest Job First / SRTF
Real OS behavior? → Multi-Level Feedback Queue
Schedulers are tradeoffs — once you understand what they optimize, you understand why real operating systems don’t use just one mechanism.
Conclusion
We started with nothing but a tiny instruction set and finished with:
a program format,
an assembler,
a task abstraction,
a virtual machine,
and a functioning scheduler.
The magic was never “multitasking” — it was switching to the next task at the right moment.
Schedulers are simple at their heart:
Run something for a bit. Save its state. Move on.
Now that we’ve built a round-robin scheduler, it’s easy to extend:
In the last post we landed in 32-bit protected mode.
Before moving on though, let’s tidy up this project a little bit so we can start managing larger pieces of code cleanly.
So far, everything has been hard-wired: ORG directives, absolute addresses, and flat binaries. That works for a boot
sector, but we’re about to deal with paging, higher memory, and higher-level languages - so we need a proper linker
script to take over memory layout.
Linker
A new file has been added to manage how the result of stage2.asm is laid out, and we call it stage2.ld.
The linker script gives us precise control over where each section of code and data ends up in memory — something ORG
statements alone can’t handle once we start mixing multiple object files, sections, and languages.
Up until now, our binaries were assembled as raw flat images: every byte went exactly where NASM told it. But in larger
systems (especially when we introduce C or Rust later), each file gets compiled separately into relocatable object
files (.o). The linker then combines them — and the .ld script tells it exactly how to do that.
What does a linker script do?
At a high level, the linker script acts as a map for your binary. It tells the linker:
where the program starts (ENTRY(start2))
where each section should live in memory (. = 0x8000)
how to group sections from different files (.text, .data, .bss, etc.)
which global symbols to export for use in assembly or C (e.g. _stack_top, _kernel_end)
This becomes essential when we start paging or using virtual addresses — because physical load addresses and virtual
execution addresses will differ.
Example
Here’s a minimal example of what we might use in stage2.ld.
A file like this replaces the hard-coded layout logic from our assembly. NASM now just emits relocatable code, and the
linker (ld) uses this script to position everything properly inside the final binary.
Updating the Makefile
We now assemble stage2.asm into an object file and link it with the new script to produce an ELF image:
This new process might look like extra work, but it pays off the moment we start mixing in C code and paging structures.
The linker will take care of symbol addresses and memory offsets — no more hardcoded numbers scattered through our
assembly.
Paging and Long Mode
Now that our build is structured and linkable, we can continue where we left off in Stage 2 — by preparing to enter
64-bit long mode.
To do that, we first need to enable paging and set up basic 64-bit page tables.
Modern x86 CPUs can only enter 64-bit mode after paging and PAE (Physical Address Extension) are enabled.
That means we’ll build just enough of a paging hierarchy to identity-map the first few megabytes of physical memory —
enough to run the kernel.
Understanding Paging and PAE
To execute 64-bit code, the CPU insists on PAE paging being enabled and on using the long-mode paging format
(the 4-level tree). Concretely:
CR4.PAE = 1 (turn on PAE)
EFER.LME = 1 (allow long mode)
CR0.PG = 1 (turn on paging)
Then a far jump into a 64-bit code segment (CS.L=1) to start executing in 64-bit.
The 4-level tree (long-mode paging)
Long mode uses this hierarchy (all entries are 64-bit):
For a minimal set up, we’ll skip PT by using one 2 MiB page: create PML4[0] -> PDPT[0] -> PD[0] with PS=1
which identity-maps 0x00000000–0x001FFFFF.
That’s enough for Stage 2 and the jump.
Why not pt?
You only need a PT (page table) if you want 4 KiB granularity (e.g., guard pages, mapping tiny regions,
marking some pages NX later, etc.). Early on, 2 MiB pages are simpler and faster to set up:
Use 2 MiB page (no PT):
Fewer entries to touch
Great for identity-mapping “just get me to 64-bit”
Use 4 KiB pages (needs PT):
Fine-grained control (per-page permissions)
Slightly more code and memory for the PT
Paging entries
Each paging entry (in PML4, PDPT, PD, and PT) follows the same 64-bit structure and flag semantics you listed. The
only difference is which address bits and specific flag bits are valid at that level.
Each paging entry = 64 bits (low 32 bits + high 32 bits)
Each paging entry = 64 bits (low 32 bits + high 32 bits)
Bits
Name
Meaning
0
P (Present)
Must be 1 for valid entries
1
RW (Writable)
1 = writable, 0 = read-only
2
US (User/Supervisor)
0 = kernel-only, 1 = user-accessible
3
PWT (Page Write-Through)
Cache control bit (leave 0)
4
PCD (Page Cache Disable)
Cache disable bit (leave 0)
5
A (Accessed)
CPU sets when accessed
6
D (Dirty)
For pages only (set by CPU)
7
PS (Page Size)
0 = points to next table, 1 = large page (2 MiB or 1 GiB)
8
G (Global)
Optional: prevents TLB flush on CR3 reload
9–11
Available (ignored by hardware)
You can use for OS bookkeeping
12–51
Physical Address
Base address of next-level table or physical page
52–58
Available (ignored)
—
59
PAT (Page Attribute Table)
Rarely used; controls memory type
60–62
Ignored / Reserved
—
63
NX (No Execute)
1 = non-executable (if EFER.NXE = 1)
How that applies per level:
Level
Structure
What address field points to
Special bits
PML4E
PML4 entry
Physical base of PDPT
P, RW, US same
PDPTE
PDPT entry
Physical base of PD (or 1 GiB page if PS = 1)
PS = 1 → 1 GiB page
PDE
Page directory entry
Physical base of PT (or 2 MiB page if PS = 1)
PS = 1 → 2 MiB page
PTE
Page table entry
Physical base of 4 KiB page
PS ignored
Setup
To setup these page entries, we configure the flags that we need at each level.
For the 64-bit code descriptor: Access = 0x9A, Flags = 0x20 (L=1). Granularity (G) and Limit are ignored in long mode,
so this minimalist form is fine.
Now can can push to long mode.
lgdt[gdt_desc]; GDT that includes 0x18 (64-bit CS)jmp0x18:long_entry; load CS with L=1 → CPU switches to 64-bit
This means that we need a new block of 64-bit code to jump to:
BITS64SECTION.text64extern_stack; from your linker scriptlong_entry:movax,0x20; 64-bit data selectormovds,axmoves,axmovss,axmovfs,axmovgs,axmovrsp,_stack; top of your stage2 stack from .ldlearsi,[relmsg_long]callserial_puts64.hang:hltjmp.hang%include "boot/serial64.asm"
SECTION.rodatamsg_longdb"64-bitLongmode:Enabled",13,10,0
Of course, we’ve had to re-implement the serial library as well to be supported in 64-bit mode.
We cleaned up the build with a linker script, set up a minimal long-mode paging tree (PML4 → PDPT → PD with 2 MiB pages),
extended the GDT with 64-bit descriptors, and executed the CR4/EFER/CR0 sequence to reach 64-bit long mode—while
keeping serial output alive the whole way. The result is a small but realistic bootstrap that moves from BIOS real mode
to protected mode to long mode using identity-mapped memory and clean section layout.
In the next part we’ll start acting like an OS: map more memory (and likely move to a higher-half layout), add early
IDT exception handlers, and bring up a simple 64-bit “kernel entry” that can print, panic cleanly, and prepare for
timers/interrupts.