Understanding CPUs, Branches, and Branch Prediction

15 Mar, 2025

This post is a journey into the heart of how computers think—specifically, how CPUs handle decisions in your code. To ground things, I’ll reference a small project I built: a credit card validator in x86 assembly, finished around March 15, 2025. It checks card numbers using the Luhn algorithm and identifies types like Visa or MasterCard. But that’s just the appetizer—the real meal is understanding CPUs, instructions, branches, pipelines, and branch prediction from scratch. By the end, you’ll know why my validator started slow and how I made it snappier, plus get a solid grip on what’s happening under your code’s hood. Let’s start at square one.

What’s a CPU, Anyway?

The Central Processing Unit (CPU) is the brain of your computer. It’s a tiny chip—smaller than a postage stamp—that executes every instruction in your programs, from opening a browser to running a game. Think of it as a tireless worker reading a to-do list: “Add these numbers,” “Move this data,” “Check if X is true.” That list is your program, written in a language the CPU understands called machine code—raw 0s and 1s.

But programmers don’t write 0s and 1s (unless they’re masochists). We use higher-level languages like Python or C, which get translated into machine code by compilers. My validator used assembly language, a step above machine code where instructions like mov (move data) or add (add numbers) replace cryptic binary. For example, here’s a snippet from my project’s luhn.asm:

mov eax, ebx    ; Copy value from ebx to eax
add edx, eax    ; Add eax to edx

The CPU I targeted was x86 (specifically i386, a 32-bit architecture common in older Intel/AMD chips). It has registers—tiny, super-fast storage slots like eax and edx, each holding 32 bits (4 bytes) of data. The CPU fetches instructions, uses registers to juggle numbers, and churns out results.

Instructions and the Flow of Code

A program is just a sequence of instructions. Most are straightforward: “Add 5 to eax,” “Store this in memory.” The CPU plows through them in order, one after another. But sometimes, you need decisions—“If the number’s even, do this; if odd, do that.” That’s where branches come in.

A branch is an instruction that can change the flow. There are two flavors:

Unconditional: “Jump to this spot, no questions asked.” Like jmp .next_digit in my code—it always leaps.
Conditional: “Jump only if something’s true.” Like cmp eax, 9; jle .add_doubled—compares eax to 9 and jumps if less-or-equal.

Here’s a piece of my validator’s Luhn loop, checking digits:

cmp ecx, [esp]       ; Compare loop counter to end point
je .check_sum        ; If equal, jump to finish
movzx eax, byte [esi] ; Load next digit
sub eax, '0'         ; Convert ASCII to number
test ecx, 1          ; Is position odd?
jz .add_digit        ; If zero (even), skip doubling

Branches let code adapt, but they’re troublemakers. The CPU doesn’t know the next step until it crunches the condition, and that delay messes with its rhythm.

The Pipeline: Speed Through Overlap

To understand why, meet the pipeline. CPUs don’t execute one instruction at a time—they’re multitaskers. Imagine a car assembly line: one worker welds the frame, another bolts the seats, another paints. By the time the first car’s painted, three more are in progress. The CPU’s pipeline does this with instructions, splitting execution into stages. For x86, it’s roughly:

Fetch: Grab the instruction from memory.
Decode: Break it down (e.g., add eax, 5 means “add 5 to register eax”).
Execute: Do the work—add, move, compare.

This overlap aims for one instruction finished per cycle. But branches? They’re the wrench in the gears.

Branches Break the Flow

When the CPU hits a branch—like je .check_sum—it’s stuck. Should it fetch the next line or jump elsewhere? It can’t decide until the condition (cmp ecx, [esp]) resolves in the execute stage. Without help, it stalls, waiting 2-3 cycles before fetching the right instruction. In my validator, every digit had multiple branches, and card type checks piled on more:

detect_card_type:
    movzx eax, byte [ebx]  ; Load first digit
    sub eax, '0'           ; Convert to number
    cmp eax, 4             ; Is it 4?
    je .visa               ; Yes, it’s Visa
    cmp eax, 5             ; Is it 5?
    je .check_mastercard   ; Maybe MasterCard

For a 16-digit card, that’s dozens of decision points. Stalls galore—think of it as the CPU twiddling its thumbs while the pipeline sits empty.

Branch Prediction: The CPU’s Educated Guess

Enter branch prediction. Instead of stalling, the CPU guesses which way the branch goes and fetches ahead. It’s like betting on a coin flip—right guess, you win; wrong guess, you pay. In my validator, prediction was a lifesaver, but it’s not foolproof.

How It Guesses

CPUs use tricks to predict:

Static Prediction:
- Dumb but simple: “Always assume not taken” (keep going) or “backward jumps are taken” (loops repeat).
- Early CPUs loved this. Useless for my random card type checks, though.
Dynamic Prediction:
- Tracks history with a Branch History Table (BHT)—a small memory noting past branch behavior.
- 1-bit version: “Last time taken = predict taken.” Too flip-floppy for my Luhn loop’s every-other-digit pattern.
Two-Bit Predictor:
- Smarter: a 2-bit counter per branch (0-3). States: strongly not taken (0), weakly not taken (1), weakly taken (2), strongly taken (3).
- Taken bumps it up, not taken bumps it down. Takes two flips to change its mind—perfect for steady loops.
Target Prediction:
- A Branch Target Buffer (BTB) guesses where taken branches go (e.g., .visa’s address), not just if they’re taken.

The Cost of Being Wrong

A wrong guess—a misprediction—is brutal. The pipeline’s loaded with junk from the wrong path. It flushes everything post-fetch and restarts, costing 5-20 cycles depending on pipeline depth (i386’s on the lower end). In my original Luhn loop:

.luhn_loop:
    cmp ecx, [esp]       ; End?
    je .check_sum
    test ecx, 1          ; Odd position?
    jz .add_digit
    shl eax, 1           ; Double
    cmp eax, 9           ; > 9?
    jle .add_doubled
    sub eax, 9           ; Adjust

Four branches per digit, 16 digits—64 chances to guess wrong. Random numbers meant ~50% accuracy, so 32 mispredictions at 5-10 cycles each = 160-320 wasted cycles. Card type’s four-branch chain added another 20-40. Ouch.

Fixing the Validator: Less Guessing, More Winning

I couldn’t let the CPU flounder. Here’s how I cut branches:

Lookup Table:
- Luhn’s doubling (0→0, 1→2, ..., 5→1, ...) became a precomputed array:
```
SECTION .data
luhn_db db 0, 2, 4, 6, 8, 1, 3, 5, 7, 9
...
movzx eax, byte [luhn_db + eax]  ; Grab result
```
- Nuked two branches per doubled digit. Predictor’s workload dropped.

Jump Table:

Card type’s if-else mess became:

SECTION .data
jmp_tbl dd .unknown, .unknown, .unknown, .check_amex,
        dd .visa, .check_mastercard, .check_discover
...
cmp eax, 6
ja .unknown
jmp [jmp_tbl + eax * 4]

One branch instead of four. CPU guesses once, then zooms.

Slick Arithmetic:

MasterCard’s range check:

.check_mastercard:
    movzx eax, byte [ebx + 1]
    sub eax, '0'
    sub eax, 1        ; 1-5 → 0-4
    cmp eax, 4
    jbe .mastercard

Two branches to one. Less room for error.

Total branches fell from ~36 to ~18 per run, halving misprediction costs to 90-180 cycles. Not bad for a little assembly hack.

Why It All Matters

CPUs are built for speed, but branches throw curveballs. Pipelines crave steady flow; prediction keeps them humming when branches hit. My validator’s tweaks didn’t just save time—they showed how code shapes hardware behavior. Fewer branches = fewer guesses = happier CPU. Next time you write a loop or an if-chain, think: “Could I dodge a decision?” Your pipeline will thank you.

#Low Level #NASM #Optimizations #Problem Solving