Loading…

Microscalea field journal for small language models

progress0 / 2605 xp

Curriculum Models Labs

0

← back to the atlas

Act II · Region 02

Inside the Machine

The transformer block, earned one piece at a time

The showpiece region. You will assemble a modern 2026 transformer block from first principles — attention, multi-head, GQA, SwiGLU, RoPE, NoPE, local-global, tied embeddings — with a playable model for every component and the math made visible, not hidden.

badge · Architect

0 of 13 lessons completed

1
Attention, from first principles
Drag a query vector and watch softmax weights shift live. Learn Q, K, V projections, dot-product scoring, √d scaling, and how the attention matrix forms
12 min
60 xp
2
Attention in production
How attention works at serving time — prefill is compute-bound, decode is memory-bound. KV cache shape, varlen packing, and batching arithmetic
14 min
65 xp
3
Multi-head attention
Split d_model into h independent heads, each learning syntax, coreference, or position patterns. How the W_O projection recombines them
8 min
40 xp
4
From MHA to GQA
Collapse 32 KV heads into 8 groups and cut memory 3×. The MHA-to-MQA-to-GQA progression, per-token cache math, and the Ainslie uptraining recipe
9 min
50 xp
5
MLA: compressing the KV cache into a latent
How DeepSeek compresses 128K-context KV cache by 93% — down-project keys and values into a shared latent, reconstruct per-head on the fly
12 min
55 xp
6
Eight stations, two lanterns
Why DeepSeek-V3 claims 671B parameters but only activates 37B per token. Top-k routing, shared experts, and the load-balance thermostat
22 min
65 xp
7
SwiGLU vs GeLU
The gated linear unit and why it wins — bilinear expressiveness, the 8/3 ratio, and Hessian conditioning
7 min
40 xp
8
RoPE as rotation
Position as rotation in the complex plane — the frequency schedule, relative-distance invariance, and how YaRN extends context length
10 min
55 xp
9
Stretching context: YaRN, NTK-by-parts, and attention sinks
Position Interpolation vs NTK-aware vs YaRN — why low-freq and high-freq RoPE dimensions need different treatment. Plus attention sinks
14 min
55 xp
10
NoPE layers
Sometimes the best position encoding is none. Why SmolLM3 drops RoPE every 4th layer — the 3:1 NoPE ratio that boosts long-context without hurting short
8 min
40 xp
11
Local + global attention
Interleave 5 local sliding-window layers with 1 global layer. How Gemma 3's 5:1 ratio slashes KV cache while keeping long-context coherence
9 min
45 xp
12
Tied embeddings
Set W_unembedding = W_embedding transposed and save ~30% of parameters at SLM scale. The Press & Wolf trick and gradient asymmetry
5 min
30 xp
13
Build-a-block capstone
Wire up RMSNorm, GQA, SwiGLU, RoPE, and residual connections into the canonical 2026 decoder block — every Act II component in one working layer
12 min
80 xp