Microscale
0
← back to the atlas
Act II · Region 02

Inside the Machine

The transformer block, earned one piece at a time

The showpiece region. You will assemble a modern 2026 transformer block from first principles — attention, multi-head, GQA, SwiGLU, RoPE, NoPE, local-global, tied embeddings — with a playable model for every component and the math made visible, not hidden.

badge · Architect
0 of 10 lessons completed
  1. 1
    Attention, from first principles
    Drag a query vector and watch softmax weights redraw live
  2. 2
    Attention in production
    Prefill vs decode, KV cache as a tensor, one kernel call serving many users
  3. 3
    Multi-head attention
    Why parallel heads learn different patterns
  4. 4
    From MHA to GQA
    Collapse 32 heads into 8 groups — KV cache drops 3×
  5. 5
    SwiGLU vs GeLU
    The gated linear unit and why it wins
  6. 6
    RoPE as rotation
    Position as literal rotation in the plane
  7. 7
    NoPE layers
    Sometimes the best position encoding is none
  8. 8
    Local + global attention
    Gemma 3's 5:1 trick
  9. 9
    Tied embeddings
    Halve the parameter count with one assignment
  10. 10
    Build-a-block capstone
    Assemble the canonical 2026 transformer layer