Multi-head attention
Split d_model into h independent heads, each learning syntax, coreference, or position patterns. How the W_O projection recombines them
Lab 02 · Attention Under the Microscope · 60–90 minOne head only sees one thing
The attention mechanism you learned about in the last lesson computes a single weighted average over values. “Single” is the problem. A real sentence has many relationships happening at once — syntax, semantics, coreference, positional nearness, dependency structure — and a single weighted average cannot express them all simultaneously without collapsing them.
The fix, almost trivially, is to run several attention heads in parallel, each with its own learned projections. Each head can specialise in a different pattern. Their outputs are then concatenated and linearly mixed into the final layer output.
The learned weight matrices , , are different for each head — that's how the heads end up attending to different things even though the input is identical.
Real heads, real patterns
The probe below shows what GPT-2 small (the 124-million-parameter model that started the modern era) actually does on real sentences. Pick a sentence, pick a layer, pick a head — you're looking at the unmodified attention weights from a forward pass.
Each sentence below was chosen because it surfaces a phenomenon documented in the interpretability literature. The featured-heads cards beneath the probe link each visible pattern back to the paper that first described it.
The mixing happens at the end
After each head computes its own weighted sum, their outputs are concatenated and projected through a final matrix . This is not decorative — the projection is what lets the downstream FFN receive a single unified vector that contains information from all heads. Without it, the layer would output a giant concatenated blob with no interaction between heads.
A subtle and important fact: the head dimension is usually , so that after concatenation the shape is the same as the input. If and , each head operates in a 128-dimensional subspace. Notice that this is also how modern Phi-4-mini is configured — 24 query heads, each with . You've now seen where that number comes from.