In a previous post, we built a dynamic variant type in C
that could safely hold integers, floats, strings, and arrays — all while tracking its current type.
That implementation used a simple, effective trick from the C toolbox: the tagged union.
But what is a tagged union, really? What does it look like in memory? How does it behave across different languages
like Rust or Swift? And how can understanding this help you write better low-level code?
In this article, we’ll go deep into the mechanics of tagged unions — demystifying how they work, how they’re laid out
in memory, and how other languages leverage this idea at a higher level.
What is a Tagged Union?
A tagged union is a structure that lets you store different types of data in the same memory space — but only
one of them at a time.
It consists of two parts:
A tag (also called a discriminator or type id) — this tells you what kind of value the union currently holds.
A union — this is a single memory region that can represent multiple types, but only one is valid at any given time.
Together, they form a tagged union, also known as:
Variant type (C++)
Discriminated union (TypeScript)
Sum type (Haskell, Rust)
enum with payloads (Rust, Swift, F#)
A Simple C Implementation
Let’s take a minimal version of the ced_var_t type we saw earlier:
When you store a value, you must set both correctly:
tagged_value_tv;v.type=TYPE_INT32;v.data.i32=42;
When you want to read the value, you must check the tag first.
What Does This Look Like in Memory?
Let’s visualize what’s in memory for the above:
+--------+---------+------------------+
| tag | padding | data |
| (4B) | (4B) | (8B - union) |
+--------+---------+------------------+
Because str is a char*, which is typically 8 bytes on 64-bit systems, the entire union needs to reserve
8 bytes. So the full struct is:
4 bytes for type (as “tag”)
4 bytes padding (for alignment)
8 bytes for union
Total: 16 bytes
Even if you’re only storing a small int32_t, the full memory block remains the same.
Keep in mind! This is a classic example of how low-level memory alignment and padding rules affect real-world struct layout — and why tools like sizeof() sometimes surprise you.
Why Is This Useful?
Tagged unions give you flexibility:
Store different types dynamically
Inspect and branch on their type at runtime
Compact representation (compared to void* + metadata hacks)
Zero-cost abstraction in low-level code
They’re great for interpreters, configuration systems, scripting languages, or anything where types can vary
dynamically.
But It’s Also Dangerous
C won’t stop you from misusing the union:
v.type=TYPE_FLOAT;v.data.i32=1234;// Invalid — tag and payload don't match!
Accessing a union member that doesn’t match the tag is undefined behavior. It might work. Or it might crash. Or
silently corrupt your data.
That’s why you must manage the tag manually, and treat it as the single source of truth.
How Rust Does It Better
In Rust, you’d write this as:
enumValue{Int32(i32),Float(f32),Str(String),}
Rust does three major things better:
Enforces valid construction — you can’t create an invalid state.
Requires match exhaustiveness — you have to handle all cases.
Uses niche optimization — it can sometimes omit the tag entirely.
For example, Option<&T> takes no extra space. Rust uses the null pointer as the tag for None.
Comparison with Other Languages
Language
Feature
Example
C
Tagged union (manual)
struct + union + enum
Rust
Sum type with safety
enum Value { Int(i32), ... }
Swift
enum with associated data
enum Value { case int(Int), ... }
TypeScript
Discriminated union
{ kind: "int", value: number }
Haskell/OCaml
Algebraic data type
data Value = Int Int | Float Float
Performance Considerations
Tagged unions are generally compact, but there are trade-offs:
The size of the union is dictated by the largest member.
On 64-bit platforms, alignment often causes padding (e.g., 4-byte tag + 4-byte pad + 8-byte payload).
If a variant holds something heap-allocated (like strings), you may incur pointer-chasing and memory fragmentation.
You can use tools like sizeof(), offsetof(), or pahole to understand the exact layout.
Debugging Tips
If you’re building or debugging a tagged union system in C:
Use assert() to guard tag-to-union access:
assert(v->type==TYPE_INT32);
Use valgrind to catch misuses or memory leaks.
Inspect raw memory using gdb, hexdump, or print helpers.
Consider printing the tag as a string to make logs easier to read.
Rust is famous for giving you memory safety without a garbage collector. But when you start doing lower-level work —
self-referential structs, async state machines, or FFI — you run into a powerful but mysterious feature: Pin.
In this article, we’ll answer the following:
What does it mean for a type to be pin-safe?
Why would a type need to be pinned in the first place?
How do you build one safely — without fighting the borrow checker?
We’ll walk through simple examples first, then build up to a self-referential type.
What is a Pin-safe Type?
A pin-safe type is a type that can be safely used with Rust’s Pin API.
Pin: It promises not to move itself in memory after being pinned, and it uses unsafe code responsibly to uphold that guarantee.
You create a Pin-safe type when:
You need to guarantee that a value won’t move in memory after being created.
You want to allow self-referencing inside a struct (e.g., a field pointing to another field).
You’re building async state machines, generators, or intrusive data structures.
Self-Referential Structs: The Core Problem
Let’s look at a classic case. Say you have a struct like this:
structExample{a:String,b:*constString,// b points to a}
This is a self-referential struct: b stores a pointer to another field inside the same struct.
Seems harmless?
Here’s the catch: Rust moves values around freely — into function calls, collections, etc. If you set up a pointer
inside a struct and then move the struct, your pointer is now invalid. This opens the door to use-after-free bugs.
Rust’s borrow checker normally prevents you from doing this. But sometimes you do need this — and that’s where Pin
comes in.
Pin to the Rescue
Pin<T> says Once this value is pinned, it cannot be moved again.
This is perfect for self-referential types — it guarantees their memory address won’t change.
But you have to build your type carefully to uphold this contract.
A Pin-safe Self-Referential Type
Now let’s build a Pin-safe type step-by-step.
Step 1: Define the structure
usestd::pin::Pin;usestd::marker::PhantomPinned;structSelfRef{data:String,data_ref:*constString,// raw pointer, not a safe Rust reference_pin:PhantomPinned,// opt-out of Unpin}
data: holds some content.
data_ref: stores a pointer to that data.
PhantomPinned: tells Rust this type is not safe to move after being pinned.
In Part 4, we introduced named words and a
dictionary that allowed our VM to call subroutines by name. But there was still one major gap:
We couldn’t write Forth-like code.
You still had to manually build vec![Instruction::Push(5), ...] in Rust. That changes now.
In this post, we’ll add a hand-rolled parser that understands simple Forth syntax — including word definitions like
: square dup * ; — and emits instructions automatically.
By the end of this part, you’ll be able to write:
5 square : square dup * ;
And run it on your virtual machine with no hardcoded addresses or manual instruction building.
The Goal
Our parser will:
Tokenize simple Forth input
Track whether we’re inside a : definition
Split instructions into main and definitions
Insert a Halt after top-level code to prevent fall-through
Track the correct addresses for word definitions
Build a final list of instructions ready to run
Here’s the full updated code.
Parser Support
First of all, we define our Parser struct — this separates the parsing logic from the VM runtime.
The heart of the parser is the parse method. We split the input on whitespace and interpret each token in turn.
fnparse(&mutself,input:&str){letmuttokens=input.split_whitespace().peekable();letmutdefining:Option<String>=None;letmutbuffer:Vec<Instruction>=Vec::new();whileletSome(token)=tokens.next(){matchtoken{":"=>{// Beginning of a word definitionletname=tokens.next().expect("Expected word name after ':'");defining=Some(name.to_string());buffer.clear();}";"=>{// End of word definitionifletSome(name)=defining.take(){buffer.push(Instruction::Return);letaddr=self.main.len()+self.definitions.len()+1;// +1 for HALTself.dictionary.insert(name,addr);self.definitions.extend(buffer.drain(..));}else{panic!("Unexpected ';' outside of word definition");}}word=>{// Otherwise, parse an instructionletinstr=ifletOk(n)=word.parse::<i32>(){Instruction::Push(n)}else{matchword{"dup"=>Instruction::Dup,"drop"=>Instruction::Drop,"swap"=>Instruction::Swap,"over"=>Instruction::Over,"+"=>Instruction::Add,"*"=>Instruction::Mul,"depth"=>Instruction::Depth,_=>Instruction::CallWord(word.to_string()),}};// Add to appropriate sectionifdefining.is_some(){buffer.push(instr);}else{self.main.push(instr);}}}}}
Breakdown of the cases:
: begins a new named word definition.
; ends the definition, emits a Return, and stores the word’s starting address in the dictionary.
A number becomes a Push(n) instruction.
Built-in words like + and * become direct Instruction variants.
Any unknown token is assumed to be a user-defined word, and gets translated to CallWord("name").
Finalizing the Program
Once parsing is complete, we combine the main program with definitions — separated by a Halt to ensure we don’t fall through.
fnfinalize(self)->(Vec<Instruction>,HashMap<String,usize>){letmutinstructions=self.main;instructions.push(Instruction::Halt);// Halts after main programinstructions.extend(self.definitions);(instructions,self.dictionary)}
Main Program
Our main() function now uses the parser to construct the program from a Forth-style string.
In Part 3, we introduced control flow and
subroutines into our virtual machine. That gave us branching logic and reusable code blocks — a huge step forward.
But one core Forth idea is still missing: the ability to define and name new words.
In this part, we’ll add a dictionary to our VM and support calling reusable routines by name. This will allow us to
define Forth-style words like:
: square dup * ;
5 square
Let’s get into it.
The Concept of a “Word”
In Forth, a word is any named function — even built-ins like + and * are just words. User-defined words are
created using : and ;, and then they behave just like native instructions.
To support this, we need:
A dictionary mapping word names to addresses
An instruction that can call a word by name
A way to define new words at specific locations in the program
Extending the Instruction Set
First, we extend our enum to support calling named words:
This works just like Call, but performs a dictionary lookup first.
Example: Defining square
Here’s a complete program that defines and calls a square word:
letprogram=vec![Instruction::Push(5),Instruction::CallWord("square".to_string()),Instruction::Halt,// : square dup * ;Instruction::Dup,Instruction::Mul,Instruction::Return,];letmutvm=VM::new(program);vm.add_word("square",3);// definition starts at index 3vm.run();println!("Final stack: {:?}",vm.stack);
Output:
[25]
We’ve now made it possible to extend the language from within the language — a hallmark of Forth.
Optional: Parsing : square dup * ;
Currently we define words manually by inserting them into the dictionary, but in true Forth style we’d like to write:
: square dup * ;
5 square
To support that, we’ll need a minimal parser or macro-assembler to convert high-level Forth code into VM instructions.
This will be the focus of a future post.
Conclusion
In this post, we gave our VM the ability to define and call named words, which turns our stack machine into
something far more expressive and composable.
Our VM now supports:
Arithmetic
Stack manipulation
Control flow and subroutines
A dictionary of named routines
In Part 5, we’ll push even further — implementing a simple parser that can read actual Forth-like text,
resolve words, and build programs dynamically.
We’re getting very close to having a minimal, working Forth interpreter — and it’s all built in Rust.
In Part 2, we extended our Forth-style virtual
machine with a bunch of classic stack manipulation words — from OVER and ROT to 2DUP, 2SWAP, and more.
This gave our machine more expressive power, but it still lacked something crucial: control flow. In this part,
we’ll fix that.
By adding branching and subroutine support, we allow our VM to make decisions and reuse logic — two
foundational ideas in all real programming languages.
Control Flow in Stack Machines
Stack machines like Forth typically handle control flow through explicit instruction manipulation — that is,
jumping to new parts of the program and returning when done.
We’ll implement:
Instruction
Stack Effect
Description
IfZero(offset)
( n -- )
Jumps offset if top is zero
Jump(offset)
( -- )
Always jumps offset
Call(addr)
( -- )
Saves return address and jumps
Return
( -- )
Pops return address and jumps to it
These instructions give us the power to create conditionals and function-like routines.
In order to support our ability to call subroutines, our virtual machine needs another stack. This stack is in charge
of remembering where we came from so that we can return back to the correct place. The return stack is just another
piece of state management for the virtual machine:
We use continue here because we don’t want to execute the usual ip += 1 after a jump.
IFZERO
Conditionally jumps based on the top stack value.
Stack effect: ( n -- )
Instruction::IfZero(offset)=>{letcond=self.stack.pop().expect("Stack underflow on IFZERO");ifcond==0{self.ip=((self.ipasisize)+offset)asusize;continue;}}
If the value is zero, we adjust ip by the offset. If not, we let the loop continue as normal.
CALL
Pushes the current instruction pointer onto the return stack and jumps to the absolute address.
If you accidentally used Call(5), you’d be jumping to Return, skipping your routine completely — a classic off-by-one bug that’s easy to spot once you think in terms of instruction addresses.
Conclusion
With these new control flow instructions, we’ve unlocked a huge amount of expressive power. Our VM can now:
Execute conditional logic
Jump forwards and backwards
Encapsulate and reuse stack behavior with subroutines
In the next part, we’ll take the leap into defining named words, allowing us to simulate real Forth syntax like:
: square dup * ;
5 square
We’ll build a dictionary, wire up some simple parsing, and move closer to an interactive REPL.