In Part 3 finalised out boot loader, so that it now
successfully loads Stage 2 for us. In this post, we’ll focus on setting the system so that we unlock the more advanced
features.
Inside of Stage 2 we’ll look at setting up the following:
Enable the A20 line
Set up a Global Descriptor Table (GDT)
Switch to 32-bit Protected Mode
By the end of this article, we’ll at least be in 32-bit protected mode.
A20 Line
Before we can enter 32-bit protected mode, we need to enable the A20 line.
Back in the original Intel 8086, there were only 20 address lines — A0 through A19 — meaning it could address
1 MiB of memory (from 0x00000 to 0xFFFFF). When Intel introduced the 80286, it gained more address lines
and could access memory above 1 MiB. However, to remain compatible with older DOS software that relied on address
wrap-around (where 0xFFFFF + 1 rolled back to 0x00000), IBM added a hardware gate: A20.
When the A20 line is disabled, physical address bit 20 is forced to 0.
So addresses “wrap” every 1 MiB — 0x100000 looks the same as 0x000000.
When A20 is enabled, memory above 1 MiB becomes accessible.
Protected-mode code, paging, and modern kernels all assume that A20 is on.
Enabling A20
To enable the A20 Line, we use the Fast A20 Gate (port 0x92).
Most modern systems and emulators expose bit 1 of port 0x92 (the “System Control Port A”) as a direct A20 enable bit.
Bit 0 — system reset (don’t touch this)
Bit 1 — A20 gate (1 = enabled)
We add the following to do this:
%define A20_GATE 0x92
inal,A20_GATE; read system control port Aoral,0x02; set bit 1 (A20 enable)andal,0xFE; clear bit 0 (reset)outA20_GATE,al
Global Descriptor Table (GDT)
When the CPU is in real mode, memory addressing is done through segment:offset pairs. Each segment register
(CS, DS, SS, etc.) represents a base address (shifted left by 4), and the offset is added to that. This gives you
access to 1 MiB of address space — the legacy 8086 model.
When we switch to protected mode, the segmentation model changes. Instead of using raw segment values, each segment
register now holds a selector — an index into a table called the Global Descriptor Table (GDT).
The GDT tells the CPU what each segment means:
Its base address
size (limit)
access rights
flags like “code or data”, “read/write”, or “privilege level”
The descriptor layout in 32-bit mode looks like this:
Bits
Field
Description
0-15
Limit (low)
Segment limit (low 16 bits)
16-31
Base (low)
Segment base address (low 16 bits)
32-39
Base (mid)
Segment base (middle 8 bits)
40-47
Access Byte
Type, privilege level, presence
48-51
Limit (high)
High 4 bits of segment limit
52-55
Flags
Granularity, 32-bit flag, etc.
56-63
Base (high)
Segment base (high 8 bits)
In our boot setup, we’ll create a very simple GDT with:
A null descriptor (required; selector 0 is invalid by design).
A data segment descriptor — flat 4 GiB region, readable, writable.
This gives us a flat memory model, where all segments start at base 0 and cover the entire address space.
That makes protected mode addressing behave almost like real mode linear memory, simplifying everything until paging
and virtual memory come later.
Once that GDT is loaded with lgdt, we can safely set the PE (Protection Enable) bit in CR0 and perform a far jump
into 32-bit protected mode code.
Defining the GDT
We define our GDT as three quad words. One for null, one for code, and one for data.
align8; --- GDT for entering 32-bit PM (null, code, data) ---gdt32:dq0x0000000000000000; nulldq0x00CF9A000000FFFF; 0x08: 32-bit code, base=0, limit=4GiBdq0x00CF92000000FFFF; 0x10: 32-bit data, base=0, limit=4GiBgdt32_desc:dwgdt32_end-gdt32-1; limit = (size of GDT - 1)ddgdt32; base = address of GDTgdt32_end:
Breaking down the 32-bit code GDT:
0x00CF9A000000FFFF
If we split this into bytes (little-endian in memory):
[FFFF] [0000] [00][9A] [CF][00]
We can now start to map these to the fields:
Field
Value
Meaning
Limit (low 16)
0xFFFF
segment limit = 0xFFFF
Base (low 16)
0x0000
base = 0x00000000
Base (mid 8)
0x00
base = 0x00000000
Access Byte
0x9A
flags that define “code, ring 0, present”
Limit (high 4) + flags
0xCF
limit high nibble=0xF, flags=0xC
Base (high 8)
0x00
base = 0x00000000
The “limit” of 0xFFFF and granularity bit (G=1) combine to make the segment effectively 4 GiB in size
(0xFFFFF × 4 KiB pages = 4 GiB).
Loading the GDT
Now that we have our GDT defined, we can use lgdt to load it.
The operand to lgdt wants to see a 16bit limit first, and then a 32-bit linear address (in 32-bit mode) to where the
GDT starts.
Protected Mode
With the GDT now loaded, we’re free to push over to protected mode. This is 32-bit protected mode, so we’re jumping into
code that needs the [BITS 32] directive.
We make our far jump into 32-bit land. This jump both updates CS and flushes the prefetch
queue — it’s the required way to officially enter protected mode.
Immediately we set all of our segment selectors to 0x10 which is data GDT entry.
We’re now in 32-bit protected mode.
Stage 2 (full listing)
Our current code for Stage 2 now looks like this:
; ---------------------------------------------------------; boot/stage2.asm — loaded by MBR at 0000:8000 (LBA 1..16); ---------------------------------------------------------BITS16ORG0x8000%define A20_GATE 0x92
start2:clixorax,axmovds,ax; ds = 0 so labels assembled with ORG work as absolutemoves,axcld; count upwardssticallserial_initmovsi,stage2_msgcallserial_putsinal,A20_GATE; A20 fastoral,0x02andal,0xFEoutA20_GATE,almovsi,a20_msgcallserial_putsclilgdt[gdt32_desc]moveax,cr0oreax,1; CR0.PE=1movcr0,eaxmovsi,gdt_msgcallserial_puts; selectors: 0x08 = code32, 0x10 = data32jmp0x08:pm_entry; far jump to load 32-bit CSstage2_msgdb"Stage2: OK",13,10,0a20_msgdb"A20Line:Enabled", 13, 10, 0
gdt_msg db "GDT:Loaded",13,10,0%include "boot/serial16.asm"
[BITS32]pm_entry:movax,0x10; 0x10 = data32movds,axmoves,axmovss,axmovfs,axmovgs,axmovesp,0x90000; temporary 32-bit stack movesi,pm_msgcallserial_puts32.hang:hltjmp.hangalign8; --- GDT for entering 32-bit PM (null, code, data) ---gdt32:dq0x0000000000000000; nulldq0x00CF9A000000FFFF; 0x08: 32-bit code, base=0, limit=4GiBdq0x00CF92000000FFFF; 0x10: 32-bit data, base=0, limit=4GiBgdt32_desc:dwgdt32_end-gdt32-1ddgdt32gdt32_end:pm_msgdb"Enteredprotectedmode...",13,10,0%include "boot/serial32.asm"
Notes
I’ve had to duplicate the serial assembly file. Originally it was 16 bits only, but now we need 32-bit support.
These routines look alot like their 16-bit counterparts:
We’ve now built the minimal foundation of a protected-mode operating system: flat memory model, GDT, and a working
serial console. From this point on, we can start using true 32-bit instructions and data structures. In the next post,
we’ll extend this with an Interrupt Descriptor Table (IDT), Programmable Interrupt Timer (PIT), and paging, preparing
the system for 64-bit long mode.
In Part 2 we wired up a serial port so we could see life
signs from our bootloader. Now we’re going to take the next big step — load a second stage from disk. We’ll keep
stage2 simple for now - we’ll just prove that control has been transferred.
Our 512-byte boot sector is tiny, so it’ll stay simple: it loads the next few sectors (Stage 2) into memory and jumps
there. Stage 2 will then have a lot more room to move to enable our processor.
Finishing Stage 1
Before we can get moving with Stage 2, the first stage of our boot process still has a few things left to do.
The BIOS hands our boot sector the boot drive number in DL (e.g., 0x80 for the first HDD, 0x00 for the floppy).
We need to stash that away for later.
mov[boot_drive],dl
Disk Address Packet (DAP)
The BIOS via int 0x13(AH=0x42) provides a read function that will allow us to read Stage 2 off of disk and into
memory. The extended read function uses a 16-byte structure pointed to by DS:SI:
This call reads count sectors from starting LBA into segment:offset specified in the DAP.
If you recall, we setup our stack at 0x7000. By loading Stage 2 at 0x8000 and having 16 sectors (8 KiB), Stage 2 will
occupy 0x8000..0x9FFF, so there won’t be a collision.
After this call we either have Stage 2 successfully loaded at STAGE2_SEG:STAGE2_OFF or the carry flag will be set; in
which case, we have an error.
If everything has gone ok, we can use a farjmp to transfer control there in real mode.
jmpSTAGE2_SEG:STAGE2_OFF
Now that we’ve got a bit more space to work with, we can set some more things up (video, disk i/o, a20 lines, gdt, etc.).
Boot loader
Here’s a full rundown of the boot loader so far:
; ---------------------------------------------------------; boot/boot.asm: Main boot loader; ---------------------------------------------------------;BITS16ORG0x7C00%define STAGE2_SEG 0x0000
%define STAGE2_OFF 0x8000
%define STAGE2_LBA 1
%define STAGE2_SECTORS 16
main:clixorax,axmovss,axmovbp,0x7000movsp,bp; temp stack setup (so it's below code)movds,ax; DS = 0 -> labels are absolute 0x7Cxxmoves,ax; ES = 0cld; lods/stos auto-incrementstimov[boot_drive],dl; remember the BIOS drivecallserial_initmovsi,boot_msgcallserial_putsmovsi,dap; DAP for stage2 -> 0000:8000movbyte[si],16movbyte[si+1],0movword[si+2],STAGE2_SECTORSmovword[si+4],STAGE2_OFFmovword[si+6],STAGE2_SEGmovdword[si+8],STAGE2_LBAmovdword[si+12],0movax,STAGE2_SEGmoves,axmovdl,[boot_drive]movah,0x42movsi,dapint0x13jcdisk_errormovsi,stage2_msgcallserial_putsjmpSTAGE2_SEG:STAGE2_OFFdisk_error:movsi,derr_msgcallserial_puts.hang:hltjmp.hang%include "boot/serial.asm"
boot_msgdb"Booting...", 13, 10, 0
stage2_msg db "StartingStage2...", 13, 10, 0
derr_msg db "Diskerror!",13,10,0boot_drivedb0dap:db16,0dw0,0,0dd0,0times510-($-$$)db0dw0AA55h
If we were to run this now without a Stage 2 in place, we should pretty reliably get a Disk error!:
Our Stage 2 runs in real mode, but it’s free of the 512-byte limit that our boot loader had. We’ll keep the
implementation very simple right now, just to prove that we’ve jumped over to Stage 2 - and fill it out later.
; ---------------------------------------------------------; boot/stage2.asm — loaded by MBR at 0000:8000 (LBA 1..16); ---------------------------------------------------------BITS16ORG0x8000; the offset where we were loaded to by MBRstart2:clixorax,axmovds,ax; ds = 0 so labels assembled with ORG work as absolutemoves,axcld; count upwardssticallserial_initmovsi,stage2_msgcallserial_puts.hang:hltjmp.hangstage2_msgdb"Stage2: OK",13,10,0%include "boot/serial.asm"
Building
We need to include Stage 2 as a part of the build now in the Makefile. Not only do we need to assemble this, but it needs
to make it into our final os image:
After stage2.bin is assembled, you can see we pad it out to the full 8k which is our 16 sectors. This gets appended
after the boot loader in the image.
With this very simple Stage 2 in place, we give this a quick build and run we should be able to confirm that we are up
and running in Stage 2.
➜ make run
qemu-system-x86_64 -drive file=os.img,format=raw,if=ide,media=disk -serial stdio -debugcon file:debug.log -global isa-debugcon.iobase=0xe9 -display none -no-reboot -no-shutdown -d guest_errors,cpu_reset -D qemu.log
Booting ...
Starting Stage2 ...
Stage2: OK
Conclusion
We’ve made it to Stage 2. We’ve got a great base to work from here. In the next upcoming posts in this series we’ll start
to use Stage 2 to setup more of the boot process.
In our previous post we successfully setup our development
and build environment, booting a very basic boot loader in QEMU.
In today’s post, we’re going to add some serial integration so that we can see some “signs of life” when our boot
loader runs.
Stack Setup
Before moving on any further, it’ll be a good move for us to setup a temporary stack. It’ll live for as long as our
boot loader lives. We know our boot code is loaded at 0x7C00. The stack grows downward, so we place it below the boot
sector at 0x7000. This keeps it out of the way of our code/data and gives us space to work with. We disable
interrupts while changing SS:SP (an interrupt during this window would push onto an uninitialized stack), then
re-enable them once the stack and data segments are valid.
main:cli; disable interruptsxorax,axmovss,axmovsp,0x7000; stack at 0000:7000 (grows downward)movds,ax; DS = 0 so [label] addresses resolve correctlymoves,ax; ES = 0cld; string ops auto-incrementsti; re-enable interrupts
We make sure we can’t be interrupted while doing this, so we clear the interrupt flag with cli. Next, set up the
stack so that SS:SP points to 0000:7000. Making ds and es point to the same segment as our code 0000
simplifies things for us. cld makes sure that our lods and stos operations always count ascending. Finally, we
re-enable interrupts.
It’ll look something like this:
graph TB
A0["0x0000 — 0x03FF IVT (Interrupt Vector Table)"]
A1["0x0400 — ~0x04FF BDA (BIOS Data Area)"]
A2["..."]
S["0x7000 (SS:SP start) Stack top → grows downward"]
GAP["0x7000 — 0x7BFF Gap (free space)"]
BOOT["0x7C00 — 0x7DFF Boot sector (512 bytes)"]
A3["... up to conventional memory"]
A0 --> A1 --> A2 --> S --> GAP --> BOOT --> A3
Serial
UART (the serial port) is the early
debugging channel that we’ll use. It’s the standard for debugging x86 and embedded work, so it’s perfect for what we’re
doing.
Registers
We write-to and read-from the UART registers to setup communications over serial. Here’s each of the registers.
COM1 is defined at a base-address of 0x3F8. The following map is an offset from that base.
IER or DLM (when DLAB=1) — Interrupt Enable / Divisor Latch High
+2
IIR/FCR — Interrupt ID / FIFO Control
+3
LCR — Line Control (word length, parity, stop, DLAB)
+4
MCR — Modem Control (DTR, RTS, OUT1, OUT2, LOOP)
+5
LSR — Line Status (TX empty, RX ready, etc.)
+6
MSR — Modem Status
+7
SCR — Scratch
Init
We can now walk through the init code for the serial line.
%define COM1 0x3F8
serial_init:; 1) Disable UART-generated interrupts (clear IER)movdx,COM1+1xorax,ax; AL=0outdx,al; IER = 0 (no UART IRQs); 2) Enable DLAB so we can set the baud rate divisormovdx,COM1+3moval,0x80; LCR: DLAB=1outdx,al; 3) Set divisor to 1 -> 115200 baud (on standard PC clock)movdx,COM1+0moval,0x01; DLL = 1outdx,almovdx,COM1+1xoral,al; DLM = 0outdx,al; 4) 8 data bits, no parity, 1 stop bit; clear DLAB to use data regsmovdx,COM1+3moval,0x03; LCR: 8N1 (DLAB=0)outdx,al; 5) Enable FIFO, clear RX/TX FIFOs, set 14-byte RX thresholdmovdx,COM1+2moval,0xC7; FCR: 1100_0111boutdx,al; 6) Modem Control: assert DTR, RTS, and OUT2movdx,COM1+4moval,0x0B; MCR: DTR|RTS|OUT2outdx,alret
Each of these steps is needed in the init phase:
IER=0 (no UART IRQs): We’re going to use polling (check LSR bits) in early boot, so we explicitly disable UART interrupts.
DLAB=1, set divisor: Standard PC UART clock (1.8432 MHz / 16 = 115200). A divisor of 1 yields 115200 baud. Later you can choose 2 (57600), 12 (9600), etc.
LCR=0x03 (8N1): The classic “8 data bits, No parity, 1 stop.” Clearing DLAB returns access to THR/RBR/IER instead of the divisor latches.
FCR=0xC7: Enables the FIFO, clears both FIFOs, and sets the RX trigger level to 14 bytes. (On 8250/16450 parts without FIFOs this is ignored—harmless.)
MCR=0x0B: Asserts DTR and RTS so the other side knows we’re ready; sets OUT2, which on PCs typically gates the UART interrupt line (even if we aren’t using IRQs yet, OUT2 is commonly left on).
Waiting
Because working with UART is asynchronous, we need to wait for the transmitter holding register is ready. So this waits
for the THR empty (bit 5).
Now that we have a way to integrate with the serial line, we can use it to prove signs of life in our bootloader.
After our stack is setup, we can start using these functions.
callserial_init; initialize serialmovsi,msg_alive; si = our string to printcallserial_puts; print the stringhlt; halt.halt:jmp.halt%include "boot/serial.asm" ; serial functions
msg_alivedb"Serialcommunicationsarealive!",0times510-($-$$)db0dw0AA55h
To clean up the boot code, I tucked all of the serial communication code away into an asm file of its own. It’s still
assembled as part of the boot.asm as it’s just text included.
Setting this up for a run, you should see a message in your console.
➜ make run
qemu-system-x86_64 -drive file=os.img,format=raw -serial stdio -debugcon file:debug.log -global isa-debugcon.iobase=0xe9 -display none -no-reboot -no-shutdown -d guest_errors,cpu_reset -D qemu.log
Serial communications are alive!
We are alive!
Conclusion
We have put ourselves in a very strong position with these recent additions. This is an invaluable debugging and
diagnostic tool being able to write breadcrumbs into the console to check where execution has made it to.
We’ll continue to build on this when we return for Stage2.
One of the best ways to learn how computers work is to get as close to the hardware as possible. Writing assembly
language with no other tools or libraries really helps you to understand exactly what makes them tick. I’m building
this article series to walk through the full setup of an x86 system to go from
power on to a minimal running operating system.
I’ll gradually build this from the ground up, introducing concepts as we go through these articles.
Today, we’ll get all the tooling and build environment setup so we can develop comfortably.
Tools
Before we begin, we need some tools installed.
QEMU for virtualising the system that will run our operating system
NASM to be our assembler
Make to manage our build chain
Get these installed on your respective system, and we can get started getting the project directory setup.
Project Setup
First up, let’s create our project directory and get our Makefile and bootloader started.
mkdir byo_os
mkdir byo_os/boot
cd byo_os
Boot loader
A boot loader is the very first piece of software that runs when a computer starts. Its job is to prepare the CPU and
memory so that a full operating system can take over. When the machine powers on, the BIOS (or UEFI) firmware
looks for a bootable program and transfers control to it.
In this tutorial we’re building a BIOS-style boot loader.
When a machine boots in legacy BIOS mode, the firmware reads the first 512 bytes of the boot device — called the
boot sector — into memory at address 0x7C00 and jumps there. Those 512 bytes must end with the magic signature
0xAA55, which tells the BIOS that this sector is bootable. From that point, our code is executing directly on the CPU
in 16-bit real mode, with no operating system or filesystem support at all.
Modern systems use UEFI, which is the successor to BIOS. UEFI firmware looks for a structured executable (a
PE/COFF file) stored on a FAT partition and provides a much richer environment — including APIs for disk I/O,
graphics, and memory services. It’s powerful, but it’s also more complex and hides many of the low-level details we
want to understand.
Starting with BIOS keeps things simple: one sector, one jump, and full control. Once we’ve built a working
real-mode boot loader and kernel, it’ll be easy to explore a UEFI variant later because the CPU initialization concepts
remain the same — only the firmware interface changes.
Here is our first boot loader.
; ./boot/boot.asmORG0x7C00; our code starts at 0x7C00BITS16; we're in 16-bit real modemain:cli; no interruptshlt; stop the processor.halt:jmp.halttimes510-($-$$)db0; pad out to 510 bytesdw0AA55h; 2 byte signature
Our boot loader must be 512 bytes. We ensure that it is with times 510-($-$$) db 0. This directive pads our
boot loader out to 510 bytes, leaving space for the final 2 signature bytes dw 0AA55h which all boot loaders must
finish with.
Building
With this code written, we need to be able to build and run it. Using a Makefile is an easy way to wrap up all of
these actions so we don’t need to remember all of the build steps.
This will build a boot/boot.bin for us, and it will also pack it into an os.img which we will use to run our os.
The key lines in making the os image are the dd and truncate. They get our 512 byte boot sector first in the image,
and then the truncate extends the image to 32 sectors (16 KB total) by padding it with zeros. The extra space
simulates a small disk, leaving room for later stages like a kernel or filesystem. The first 512 bytes remain our boot
sector; the rest is just blank space the BIOS ignores for now.
-drive file=os.img,format=raw Attach a raw disk image as the primary drive. When QEMU boots in BIOS mode, it loads the first sector (the MBR) if it ends with the signature 0xAA55.
-serial stdio redirect the guest’s COM1 serial port (I/O 0x3F8) to this terminal’s stdin/stdout, so any serial output from the guest appears in your console.
-debugcon file:debug.log will dump the debug console into a file called debug.log
-global isa-debugcon.iobase=0xe9 Map QEMU’s simple debug console to I/O port 0xE9. Any out 0xE9, al from your code is appended to debug.log
-display none Disables the graphical display window. No VGA text output will be visible unless you use -nographic, serial, or the 0xE9 debug console
-no-reboot on a guest reboot request, do not reboot; QEMU exits instead (handy for catching triple-fault loops).
-no-shutdown on a guest power-off, don’t quit QEMU; keep it running so logs/console remain available.
-d guest_errors,cpu_reset Enables QEMU’s internal debug logging for guest faults and CPU resets (for example, triple faults). The messages are written to the file specified by -D
-D qemu.log Write QEMU’s debug logs (from -d) to qemu.log instead of stderr.
We will plan to print with BIOS INT 0x10 later on, so this instruction will evolve as we go.
Running
Let’s give it a go.
By running make you should see output like this:
➜ make
nasm -f bin boot/boot.asm -o boot/boot.bin
rm -f os.img
dd if=boot/boot.bin of=os.img bs=512 count=1 conv=notrunc
1+0 records in
1+0 records out
512 bytes copied, 9.4217e-05 s, 5.4 MB/s
truncate -s $((32*512)) os.img
You can then run this make run:
➜ make run
qemu-system-x86_64 -drive file=os.img,format=raw -serial stdio -debugcon file:debug.log -global isa-debugcon.iobase=0xe9 -display none -no-reboot -no-shutdown -d guest_errors,cpu_reset -D qemu.log
And there you have it. Our bootloader ran very briefly, and now our machine is halted.
Conclusion
We’ve managed to setup our build environment and get a very simple boot loader being executed by QEMU. In further
tutorials we’ll look at integrating the serial COM1 ports so that we can get some signs of life reported out to the
console.
The Naive Bayes classifier is one of the simplest algorithms in machine learning, yet it’s surprisingly powerful.
It answers the question:
“Given some evidence, what is the most likely class?”
It’s naive because it assumes that features are conditionally independent given the class. That assumption rarely
holds in the real world — but the algorithm still works remarkably well for many tasks such as spam filtering, document
classification, and sentiment analysis.
At its core, Naive Bayes is just counting, multiplying probabilities, and picking the largest one.
Bayes’ Rule Refresher
First, let’s start with a quick definition of terms.
Class is the label that we’re trying to predict. In our example below, the class will be either “spam” or “ham”
(not spam).
The features are the observed pieces of evidence. For text, features are usually the words in a message.
P is shorthand for “probability”.
P(Class) = the prior probability: how likely a class is before seeing any features.
P(Features | Class) = the likelihood: how likely it is to see those words if the class is true.
P(Features) = the evidence: how likely the features are overall, across the classes. This acts as a normalising constant so probabilities sum to 1.
So both classes land on the same score — a perfect tie, in this example.
Python Demo (from scratch)
Here’s a tiny implementation that mirrors the example above:
fromcollectionsimportCounter,defaultdict# Training data
docs=[("spam","buy cheap"),("spam","cheap pills"),("ham","meeting schedule"),("ham","project meeting"),]class_counts=Counter()word_counts=defaultdict(Counter)# Build counts
forlabel,textindocs:class_counts[label]+=1forwordintext.split():word_counts[label][word]+=1defclassify(text,alpha=1.0):words=text.split()scores={}total_docs=sum(class_counts.values())vocab={wforcountsinword_counts.values()forwincounts}V=len(vocab)forlabelinclass_counts:# Prior
score=class_counts[label]/total_docstotal_words=sum(word_counts[label].values())forwordinwords:count=word_counts[label][word]# Laplace smoothing
score*=(count+alpha)/(total_words+alpha*V)scores[label]=score# Pick the class with the highest score
returnmax(scores,key=scores.get),scoresprint(classify("cheap project"))print(classify("project schedule"))print(classify("cheap schedule"))
As we predicted earlier, "cheap project" is a tie, while "project schedule" is more likely ham. Finally, "cheap schedule"
is noted as spam because it uses stronger spam trigger words.
Real-World Notes
Naive Bayes is fast, memory-efficient, and easy to implement.
Works well for text classification, document tagging, and spam filtering.
The independence assumption is rarely true, but it doesn’t matter — it often performs surprisingly well.
In production, you’d tokenize better, remove stop words, and work with thousands of documents.
Conclusion
Building a Naive Bayes classifier from first principles is a great exercise because it shows how machine learning can be
just careful counting and probability. With priors, likelihoods, and a dash of smoothing, you get a surprisingly useful
classifier — all without heavy math or libraries.