Cogs and Levers A blog full of technical stuff

Kerberos on Linux

Introduction

Kerberos is one of those protocols that sounds mysterious until you see it in action. The moment you type kinit, run klist, and watch a ticket pop up, it clicks: this is Single Sign-On in its rawest form. In this post we’ll set up a tiny realm on a Debian test box (koffing.local), get a ticket-granting ticket (TGT), and then use it for SSH without typing a password.

What is Kerberos?

Born at MIT’s Project Athena in the 1980s, Kerberos solved campus-wide single sign-on over untrusted networks. It matured through v4 to Kerberos 5 (the standard you use today). It underpins enterprise SSO in Windows domains (Active Directory) and many UNIX shops.

Kerberos authenticates clients to services without sending reusable secrets. You authenticate once to the KDC, get a TGT (Ticket Granting Ticket), then use it to obtain per-service tickets from the TGS (Ticket Granting Service).

Services trust the KDC, not your password.

Core terms

  • Realm: Admin boundary (e.g., LOCAL).
  • Principal: Identity in the realm, like michael@LOCAL (user) or host/koffing.local@LOCAL (service).
  • KDC: The authentication authority. Runs on koffing.local as krb5kdc and kadmind.
  • TGT: Your “hall pass.” Lets you ask the KDC for service tickets.
  • Service ticket: What you present to a service (e.g., SSHD on koffing.local) to prove identity.
  • Keytab: File holding long-term service keys (like for sshd). Lets the service authenticate without storing a password.

Here’s a visual representation of how the Kerberos flow operates:

sequenceDiagram participant U as User participant AS as KDC/AS participant TGS as KDC/TGS participant S as Service (e.g., SSHD) U->>AS: AS-REQ (I am michael) AS-->>U: AS-REP (TGT + session key) U->>TGS: TGS-REQ (I want ticket for host/koffing.local) TGS-->>U: TGS-REP (service ticket) U->>S: AP-REQ (here's my service ticket) S-->>U: AP-REP (optional) + access granted

Ok, with all of that out of the way we can get to setting up.

Setup

There’s a few packages to install and a little bit of configuration. All of these instructions are written for a Debian/Ubuntu flavour of Linux. I’m sure that the instructions aren’t too far off for other distributions.

Install the packages

We install the Key Distribution Service krb5-kdc, Administration Server krb5-admin-server, and some Client Utilities krb5-user.

sudo apt update
sudo apt install -y krb5-kdc krb5-admin-server krb5-user

Configure your realm

The fully qualified name of my virtual machine that I’m testing all of this out on is called koffing.local. These values would change to suit your environment.

Edit /etc/krb5.conf and make sure it looks like this:

[libdefaults]
  default_realm = LOCAL
  rdns = false
  dns_lookup_kdc = false
  forwardable = true

[realms]
  LOCAL = {
    kdc = koffing.local
    admin_server = koffing.local
  }

[domain_realm]
  .local = LOCAL
  koffing.local = LOCAL

Make sure your host resolves correctly:

hostname -f        # should print: koffing.local (for me)

getent hosts koffing.local
# If needed, add to /etc/hosts:
# 127.0.1.1   koffing.local koffing

Create the KDC database

Now we initialize the database that will hold all of your principals, policies, realms, etc.

sudo mkdir -p /var/lib/krb5kdc
sudo kdb5_util create -s -r LOCAL
# set the KDC master password when prompted

Start the daemons:

sudo systemctl enable --now krb5-kdc krb5-admin-server
sudo systemctl status krb5-kdc krb5-admin-server --no-pager

Add principals

Create an admin and a user:

sudo kadmin.local -q "addprinc admin/admin"
sudo kadmin.local -q "addprinc michael"

Hello, Kerberos!

Now it’s time to give this a quick test. You can get a ticket with the following:

kdestroy
kinit michael
klist

You should see something similar to the following:

Ticket cache: FILE:/tmp/krb5cc_1000
Default principal: michael@LOCAL

Valid starting     Expires            Service principal
13/09/25 16:14:32  14/09/25 02:14:32  krbtgt/LOCAL@LOCAL
	renew until 14/09/25 16:14:28

That’s your TGT — Kerberos is alive.

Troubleshooting

Kerberos is famously unforgiving about typos and hostname mismatches. Here are some quick checks if things go sideways:

Check hostnames / FQDNs

hostname -f # should print koffing.local
getent hosts koffing.local
Hostnames! If these don’t line up, Kerberos tickets won’t match the service principal name.

Check if the KDC is running

sudo systemctl status krb5-kdc krb5-admin-server --no-pager

Look at logs (Debian uses journalctl instead of flat log files):

sudo journalctl -u krb5-kdc -u krb5-admin-server -b --no-pager

Verbose kinit to see exactly what’s happening:

KRB5_TRACE=/dev/stderr kinit -V michael

This will show you which hostnames it resolves, which tickets it requests, and where it fails.

List all principals in the KDC database:

sudo kadmin.local -q "listprincs"

Clear your credential cache if tickets get stale:

kdestroy

The two most common pitfalls are:

  • Hostname mismatch
  • Realm mismatch (default realm not set in /etc/krb5.conf).

SSO

So, we’ve got the proof of concept going, but it would be good to see this in action. What we’ll cover in this next section is getting the sshd service to trust our Kerberos tickets. This will allow for passwordless SSH for the user.

Add the host service principal and keytab

In order to get KDC to vouch for services, those services need principal definitions. A principal is any Kerberos identity. Users get user principals (as we saw above), services also need principals.

sudo kadmin.local -q "addprinc -randkey host/koffing.local"

For SSH on my virtual machine koffing.local, the conventional name is:

host/koffing.local@LOCAL
  • The host/ prefix is the standard for SSH, rsh, and other “host-based” services.
  • The FQDN (koffing.local) must match what the client thinks it is connecting to.
  • @LOCAL is your realm.

When a client does ssh michael@koffing.local, the SSH server needs to prove “I really am host/koffing.local, trusted by the KDC.”

Now we need a keytab.

sudo kadmin.local -q "ktadd -k /etc/krb5.keytab host/koffing.local"

A keytab is a file that stores one or more Kerberos keys (like passwords, but in cryptographic form). Unlike users (who can type passwords into kinit), services can’t type passwords interactively. So the KDC generates a random key for host/koffing.local@LOCAL (-randkey) and you export it into /etc/krb5.keytab with ktadd.

Now sshd can silently use that keytab to decrypt tickets clients send it.

Enable GSSAPI in sshd

The global /etc/ssh/sshd_config needs a couple of flags flicked. The SSH daemon doesn’t implement Kerberos directly, so it uses the GSSAPI library functions provided by MIT Kerberos (or Heimdal) to handle ticket validation. GSSAPI isn’t a protocol itself; it’s an API or abstraction layer.

Once we’ve flipped these switches we are telling sshd “Accept authentication from any GSSAPI mechanism. In practice, this means Kerberos tickets.”.

# GSSAPI options
GSSAPIAuthentication yes
GSSAPICleanupCredentials yes

This setup is obviously done on any server that you want to do this SSO style login with. It’s a bit confusing in my example here, because everything is on the one machine.

Configure your SSH client

Conversely, we have configuration to do on the client side. For clients that want to connect with this type of authentication, the following settings are required in their ~/.ssh/config:

Host koffing.local
  GSSAPIAuthentication yes
  GSSAPIDelegateCredentials yes

Testing

kdestroy
kinit michael
ssh michael@koffing.local

If everything lines up, ssh should not prompt for a password. Your Kerberos TGT has been used to authenticate silently.

Where Kerberos Fits

Kerberos is ideal for LAN-based authentication: it provides fast, passwordless single sign-on for services like SSH, Postgres, and intranet HTTP apps. But it isn’t designed for cross-organization web or mobile use.

Modern protocols like OIDC (OpenID Connect) build on OAuth 2.0 to provide authentication and federation across the public internet. They use signed tokens, redirect flows, and JSON-based metadata — making them better suited for SaaS, cloud apps, and mobile clients.

In short: Kerberos is the right tool inside the castle walls; OIDC is the right tool when your users are everywhere.

Wrap-up

We’ve stood up a Kerberos realm (LOCAL), issued a TGT for a user (michael), and used it for passwordless SSH into the same box. That’s enough to demystify Kerberos: no secrets flying across the network, just short-lived tickets granted by a trusted KDC.

There’s plenty more that we can accomplish here as we could create service principals for HTTP, Postgres, or cross-realm trust.

Hello, Jail: A Quick Introduction to FreeBSD Jails

FreeBSD Jails are one of the earliest implementations of operating system-level virtualization—dating back to the early 2000s, long before Docker popularized the idea of lightweight containers. Despite their age, jails remain a powerful, flexible, and minimal way to isolate services and processes on FreeBSD systems.

This post walks through a minimal “Hello World” setup using Jails, with just enough commentary to orient new users and show where jails shine in the modern world of virtualization.

Why Jails?

A FreeBSD jail is a chroot-like environment with its own file system, users, network interfaces, and process table. But unlike chroot, jails extend control to include process isolation, network access, and fine-grained permission control. They’re more secure, more flexible, and more deeply integrated into the FreeBSD base system.

Here’s how jails compare with some familiar alternatives:

  • Versus VMs: Jails don’t emulate hardware or run separate kernels. They’re faster to start, lighter on resources, and simpler to manage. But they’re limited to the same FreeBSD kernel as the host.
  • Versus Docker: Docker containers typically run on a Linux host and rely on a container runtime, layered filesystems, and extensive tooling. Jails are simpler, arguably more robust, and don’t require external daemons. However, they lack some of the ecosystem and portability benefits that Docker brings.

If you’re already running FreeBSD and want to isolate services or test systems with minimal overhead, jails are a perfect fit.

Setup

Let’s build a bare-bones jail. The goal here is simplicity: get a jail running with minimal commands. This is the BSD jail equivalent of “Hello, World.”

# Make a directory to hold the jail
mkdir hw

# Install a minimal FreeBSD userland into that directory
sudo bsdinstall jail /home/michael/src/jails/hw

# Start the jail with a name, IP address, and a shell
sudo jail -c name=hw host.hostname=hw.example.org \
    ip4.addr=192.168.1.190 \
    path=/home/michael/src/jails/hw \
    command=/bin/sh

You now have a running jail named hw, with a hostname and IP, running a shell isolated from the host system.

192.168.1.190 is just a static address picked arbitrarily by me. For you, you’ll want to pick an address that is reachable on your local network.

Poking Around

With your jail up and running, that means you can start working with it. To enter the jail, you can use the following:

sudo jexec hw /bin/sh

jexec allows you to send any command that you need to into the jail to execute.

sudo jexec hw ls /

Querying

You can list running jails with:

jls

You should see something like this:

JID  IP Address      Hostname                      Path
2    192.168.1.190   hw.example.org                /home/michael/src/jails/hw

You can also look at what’s currently running in the jail:

ps -J hw

You should see the /bin/sh process:

PID TT  STAT    TIME COMMAND
2390  5  I+J  0:00.01 /bin/sh

Finishing up

To terminate the jail:

sudo jail -r hw

This is a minimal setup with no automated networking, no jail management frameworks, and no persistent configuration. And that’s exactly the point: you can get a working jail in three commands and tear it down just as easily.

When to Use Jails

Jails make sense when:

  • You want process and network isolation on FreeBSD without the overhead of full VMs.
  • You want to run multiple versions of a service (e.g., Postgres 13 and 15) on the same host.
  • You want stronger guarantees than chroot provides for service containment.
  • You’re building or testing FreeBSD-based systems and want a reproducible sandbox.

For more complex jail setups, FreeBSD offers tools like ezjail, iocage, and bastille that add automation and persistence. But it’s worth knowing how the pieces fit together at the core.

Conclusion

FreeBSD jails offer a uniquely minimal, powerful, and mature alternative to both VMs and containers. With just a few commands, you can create a secure, isolated environment for experimentation, testing, or even production workloads.

This post only scratched the surface, but hopefully it’s enough to get you curious. If you’re already on FreeBSD, jails are just sitting there, waiting to be used—no extra software required.

Hooking open() with LD_PRELOAD

Introduction

Modern Linux systems provide a fascinating feature for overriding shared library behavior at runtime: LD_PRELOAD. This environment variable lets you inject a custom shared library before anything else is loaded — meaning you can intercept and modify calls to common functions like open, read, connect, and more.

In this post, we’ll walk through hooking the open() function using LD_PRELOAD and a simple shared object. No extra tooling required — just a few lines of C, and the ability to compile a .so file.

Intercepting open()

Let’s write a tiny library that intercepts calls to open() and prints the file path being accessed. We’ll also forward the call to the real open() so the program behaves normally.

Create a file named hook_open.c with the following:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdarg.h>
#include <dlfcn.h>
#include <fcntl.h>

int open(const char *pathname, int flags, ...) {
    static int (*real_open)(const char *, int, ...) = NULL;
    if (!real_open)
        real_open = dlsym(RTLD_NEXT, "open");

    va_list args;
    va_start(args, flags);
    mode_t mode = va_arg(args, int);
    va_end(args);

    fprintf(stderr, "[HOOK] open() called with path: %s\n", pathname);
    return real_open(pathname, flags, mode);
}

This function matches the signature of open, grabs the “real” function using dlsym(RTLD_NEXT, ...), and then forwards the call after logging it.

Note We use va_list to handle the optional mode argument safely.

Compiling the Hook

Compile your code into a shared object:

gcc -fPIC -shared -o hook_open.so hook_open.c -ldl

Now you can use this library with any dynamically linked program that calls open.

Testing with a Simple Program

Try running a standard tool like cat to confirm that it’s using open():

LD_PRELOAD=./hook_open.so cat hook_open.c

You should see:

[HOOK] open() called with path: hook_open.c
#define _GNU_SOURCE
...

Each time the program calls open(), your hook intercepts it, logs the call, and passes control along.

Notes and Gotchas

  • This only works with dynamically linked binaries — statically linked programs don’t go through the dynamic linker.
  • Some programs (like ls) may use openat() instead of open(). You can hook that too, using the same method.
  • If your hook causes a crash or hangs, it’s often due to incorrect use of va_arg or missing dlsym resolution.

Where to Go From Here

You can expand this basic example to:

  • Block access to specific files
  • Redirect file paths
  • Inject fake contents
  • Hook other syscalls like connect(), write(), execve()

LD_PRELOAD is a powerful mechanism for debugging, sandboxing, and learning how programs interact with the system. Just don’t forget — you’re rewriting the behavior of fundamental APIs at runtime.

With great power comes great segfaults!

Hexagonal Architecture in Rust

Introduction

Hexagonal Architecture, also known as Ports and Adapters, is a compelling design pattern that encourages the decoupling of domain logic from infrastructure concerns.

In this post, I’ll walk through a Rust project called banker that adopts this architecture, showing how it helps keep domain logic clean, composable, and well-tested.

You can follow along with the full code up in my GitHub Repository to get this running locally.

Project Structure

The banker project is organized as a set of crates:

crates/
├── banker-core       # The domain and business logic
├── banker-adapters   # Infrastructure adapters (e.g. in-memory repo)
├── banker-fixtures   # Helpers and test data
└── banker-http       # Web interface via Axum

Each crate plays a role in isolating logic boundaries:

  • banker-core defines the domain entities, business rules, and traits (ports).
  • banker-adapters implements the ports with concrete infrastructure (like an in-memory repository).
  • banker-fixtures provides test helpers and mock repositories.
  • banker-http exposes an HTTP API with axum, calling into the domain via ports.

Structurally, the project flows as follows:

graph TD subgraph Core BankService AccountRepo[AccountRepo trait] end subgraph Adapters HTTP[HTTP Handler] InMemory[InMemoryAccountRepo] Fixtures[Fixture Test Repo] end HTTP -->|calls| BankService BankService -->|trait| AccountRepo InMemory -->|implements| AccountRepo Fixtures -->|implements| AccountRepo

Defining the Domain (banker-core)

In Hexagonal Architecture, the domain represents the core of your application—the rules, behaviors, and models that define what your system actually does. It’s intentionally isolated from infrastructure concerns like databases or HTTP. This separation ensures the business logic remains testable, reusable, and resilient to changes in external technology choices.

The banker-core crate contains the central business model:

pub struct AccountId(pub String);

pub struct Account {
    pub id: AccountId,
    pub balance_cents: i64,
}

pub trait AccountRepo {
    fn get(&self, id: &AccountId) -> Result<Option<Account>>;
    fn upsert(&self, account: &Account) -> Result<()>;
}

The Bank service orchestrates operations:

pub struct Bank<R: AccountRepo> {
    repo: R,
}

impl<R: AccountRepo> Bank<R> {
    pub fn deposit(&self, cmd: Deposit) -> Result<Account, BankError> {
        let mut acct = self.repo.get(&cmd.id)?.ok_or(BankError::NotFound)?;
        acct.balance_cents += cmd.amount_cents;
        self.repo.upsert(&acct)?;
        Ok(acct)
    }
    // ... open and withdraw omitted for brevity
}

The Bank struct acts as the use-case layer, coordinating logic between domain entities and ports.

Implementing Adapters

In Hexagonal Architecture, adapters are the glue between your domain and the outside world. They translate external inputs (like HTTP requests or database queries) into something your domain understands—and vice versa. Adapters implement the domain’s ports (traits), allowing your application core to remain oblivious to how and where the data comes from.

The in-memory repository implements the AccountRepo trait and lives in banker-adapters:

pub struct InMemoryAccountRepo {
    inner: Arc<Mutex<HashMap<AccountId, Account>>>,
}

impl AccountRepo for InMemoryAccountRepo {
    fn get(&self, id: &AccountId) -> Result<Option<Account>> {
        Ok(self.inner.lock().unwrap().get(id).cloned())
    }
    fn upsert(&self, account: &Account) -> Result<()> {
        self.inner.lock().unwrap().insert(account.id.clone(), account.clone());
        Ok(())
    }
}

This adapter is used both in the HTTP interface and in tests.

Testing via Fixtures

banker-fixtures provides helpers to test the domain independently of any infrastructure:

pub fn deposit(bank: &Bank<impl AccountRepo>, id: &AccountId, amt: i64) -> Account {
    bank.deposit(Deposit { id: id.clone(), amount_cents: amt }).unwrap()
}

#[test]
fn withdrawing_too_much_fails() {
    let bank = Bank::new(InMemRepo::new());
    let id = rand_id("acc");
    open(&bank, &id);
    deposit(&bank, &id, 100);

    let err = bank.withdraw(Withdraw { id, amount_cents: 200 }).unwrap_err();
    assert!(matches!(err, BankError::InsufficientFunds));
}

Connecting via Transport

The outermost layer of a hexagonal architecture typically handles transport—the mechanism through which external actors interact with the system. In our case, that’s HTTP, implemented using the axum framework. This layer invokes domain services via the ports defined in banker-core, ensuring the business logic remains insulated from the specifics of web handling.

In banker-http, we wire up the application for HTTP access using axum:

#[tokio::main]
async fn main() -> Result<()> {
    let state = AppState {
        bank: Arc::new(Bank::new(InMemoryAccountRepo::new())),
    };
    let app = Router::new()
        .route("/open", post(open))
        .route("/deposit", post(deposit))
        .route("/withdraw", post(withdraw))
        .with_state(state);
    axum::serve(tokio::net::TcpListener::bind("127.0.0.1:8080").await?, app).await?;
    Ok(())
}

Each handler invokes domain logic through the Bank service, returning simple JSON responses.

This is one example of a primary adapter—other adapters (e.g., CLI, gRPC) could be swapped in without changing the core.

Takeaways

  • Traits in Rust are a perfect match for defining ports.
  • Structs implementing those traits become adapters—testable and swappable.
  • The core domain crate (banker-core) has no dependencies on infrastructure or axum.
  • Tests can exercise the domain logic via fixtures and in-memory mocks.

Hexagonal Architecture in Rust isn’t just theoretical—it’s ergonomic. With traits, lifetimes, and ownership semantics, you can cleanly separate concerns while still writing expressive, high-performance code.

Backpropagation from Scratch

Introduction

One of the most powerful ideas behind deep learning is backpropagation—the algorithm that lets a neural network learn from its mistakes. But while modern tools like PyTorch and TensorFlow make it easy to use backprop, they also hide the magic.

In this post, we’ll strip things down to the fundamentals and implement a neural network from scratch in NumPy to solve the XOR problem.

Along the way, we’ll dig into what backprop really is, how it works, and why it matters.

What Is Backpropagation?

Backpropagation is a method for computing how to adjust the weights—the tunable parameters of a neural network—so that it improves its predictions. It does this by minimizing a loss function, which measures how far off the network’s outputs are from the correct answers. To do that, it calculates gradients, which tell us how much each weight contributes to the overall error and how to adjust it to reduce that error.

Think of it like this:

  • In calculus, we use derivatives to understand how one variable changes with respect to another.
  • In neural networks, we want to know: How much does this weight affect the final error?
  • Enter the chain rule—a calculus technique that lets us break down complex derivatives into manageable parts.

The Chain Rule

Mathematically, if

\[z = f(g(x))\]

then:

\[\frac{dz}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}\]

Backpropagation applies the chain rule across all the layers in a network, allowing us to efficiently compute the gradient of the loss function for every weight.

Neural Network Flow

graph TD A[Input Layer] --> B[Hidden Layer] B --> C[Output Layer] C --> D[Loss Function] D -->|Backpropagate| C C -->|Backpropagate| B B -->|Backpropagate| A

We push inputs forward through the network to get predictions (forward pass), then pull error gradients backward to adjust the weights (backward pass).

Solving XOR with a Neural Network

The XOR problem is a classic test for neural networks. It looks like this:

Input Output
[0, 0] 0
[0, 1] 1
[1, 0] 1
[1, 1] 0

A simple linear model can’t solve XOR because it’s not linearly separable. But with a small neural network—just one hidden layer—we can crack it.

We’ll walk through our implementation step by step.

Activation Functions

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return x * (1 - x)

def mse_loss(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

We’re using the sigmoid function for both hidden and output layers.

The sigmoid activation function is defined as:

\[\sigma(x) = \frac{1}{1 + e^{-x}}\]

Its smooth curve is perfect for computing gradients.

Its derivative, used during backpropagation, is:

\[\sigma'(x) = \sigma(x) \cdot (1 - \sigma(x))\]

The mse_loss function computes the mean squared error between the network’s predictions and the known correct values (y).

Mathematically, the mean squared error is given by:

\[\text{MSE}(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]

Where:

  • \(y_i\) is the actual target value (y_true),
  • \(\hat{y}_i\) is the network’s predicted output (y_pred),
  • \(n\) is the number of training samples.

Data and Network Setup

X = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
])

y = np.array([
    [0],
    [1],
    [1],
    [0]
])

The x matrix defines all of our inputs. You can see these as the bit pairs that you’d normally pass through an xor operation. The y matrix then defines the “well known” outputs.

np.random.seed(42)
input_size = 2
hidden_size = 2
output_size = 1
learning_rate = 0.1

The input_size is the number of input features. We have two values going in as an input here.

The hidden_size is the number of “neurons” in the hidden layer. Hidden layers are where the network transforms input into internal features. XOR requires non-linear transformation, so at least one hidden layer is essential. Setting this to 2 keeps the network small, but expressive enough to learn XOR.

output_size is the number of output neurons. XOR is a binary classification problem so we only need a single output.

Finally, learning_rate controls how fast the network learns. This value scales the size of the weight updates during training. By increasing this value, we get the network to learn faster but we risk overshooting optimal values. Lower values are safer, but slower.

W1 = np.random.randn(input_size, hidden_size)
b1 = np.zeros((1, hidden_size))
W2 = np.random.randn(hidden_size, output_size)
b2 = np.zeros((1, output_size))

We initialize weights randomly and biases to zero. The small network has two hidden units.

Training Loop

We run a “forward pass” and a “backward pass” many times (we refer to these as epochs).

Forward pass

The forward pass takes the input X, feeds it through the network layer by layer, and computes the output a2. Then it calculates how far off the prediction is using a loss function.

# Forward pass
z1 = np.dot(X, W1) + b1
a1 = sigmoid(z1)

z2 = np.dot(a1, W2) + b2
a2 = sigmoid(z2)

loss = mse_loss(y, a2)

In this step, we are calculating the loss for the current set of weights.

This loss is a measure of how “wrong” the network is, and it’s what drives the learning process in the backward pass.

Backward pass

The backward pass is how the network learns—by adjusting the weights based on how much they contributed to the final error. This is done by applying the chain rule in reverse across the network.

# Step 1: Derivative of loss with respect to output (a2)
d_loss_a2 = 2 * (a2 - y) / y.size

This computes the gradient of the mean squared error loss with respect to the output. It answers: How much does a small change in the output affect the loss?

\[\frac{\partial \text{Loss}}{\partial \hat{y}} = \frac{2}{n} (\hat{y} - y)\]
# Step 2: Derivative of sigmoid at output layer
d_a2_z2 = sigmoid_derivative(a2)
d_z2 = d_loss_a2 * d_a2_z2

Now we apply the chain rule. Since the output passed through a sigmoid function, we compute the derivative of the sigmoid to see how a change in the pre-activation \(z_2\) affects the output.

# Step 3: Gradients for W2 and b2
d_W2 = np.dot(a1.T, d_z2)
d_b2 = np.sum(d_z2, axis=0, keepdims=True)
  • a1.T is the transposed output from the hidden layer.
  • d_z2 is the error signal coming back from the output.
  • The dot product calculates how much each weight in W2 contributed to the error.
  • The bias gradient is simply the sum across all samples.
# Step 4: Propagate error back to hidden layer
d_a1 = np.dot(d_z2, W2.T)
d_z1 = d_a1 * sigmoid_derivative(a1)

Now we move the error back to the hidden layer:

  • d_a1 is the effect of the output error on the hidden layer output.
  • We multiply by the derivative of the hidden layer activation to get the true gradient of the hidden pre-activations.
# Step 5: Gradients for W1 and b1
d_W1 = np.dot(X.T, d_z1)
d_b1 = np.sum(d_z1, axis=0, keepdims=True)
  • X.T is the input data, transposed.
  • We compute how each input feature contributed to the hidden layer error.

This entire sequence completes one application of backpropagation—moving from output to hidden to input layer, using the chain rule and computing gradients at each step.

The final gradients (d_W1, d_W2, d_b1, d_b2) are then used in the weight update step:

# Apply the gradients to update the weights
W2 -= learning_rate * d_W2
b2 -= learning_rate * d_b2
W1 -= learning_rate * d_W1
b1 -= learning_rate * d_b1

This updates the model just a little bit—nudging the weights toward values that reduce the overall loss.

Final Predictions

print("\nFinal predictions:")
print(a2)

When we ran this code, we saw:

Epoch 0, Loss: 0.2558
...
Epoch 9000, Loss: 0.1438

Final predictions:
[[0.1241]
[0.4808]
[0.8914]
[0.5080]]

Interpreting the Results

The network is getting better, but not perfect. Let’s look at what these predictions mean:

Input Expected Predicted Interpreted
[0, 0] 0 0.1241 0
[0, 1] 1 0.4808 ~0.5
[1, 0] 1 0.8914 1
[1, 1] 0 0.5080 ~0.5

It’s nailed [1, 0] and is close on [0, 0], but it’s uncertain about [0, 1] and [1, 1]. That’s okay—XOR is a tough problem when learning from scratch with minimal capacity.

This ambiguity is actually a great teaching point: neural networks don’t just “flip a switch” to get things right. They learn gradually, and sometimes unevenly, especially when training conditions (like architecture or learning rate) are modest.

You can tweak the hidden layer size, activation functions, or even the optimizer to get better results—but the core algorithm stays the same: forward pass, loss computation, backpropagation, weight update.

Conclusion

As it stands, this tiny XOR network is a full demonstration of what makes neural networks learn.

You’ve now seen backpropagation from the inside.

A full version of this program can be found as a gist.