lesson multi-head · 8 min · 40 xp

Multi-head attention

Split d_model into h independent heads, each learning syntax, coreference, or position patterns. How the W_O projection recombines them

Lab 02 · Attention Under the Microscope · 60–90 min

One head only sees one thing

The attention mechanism you learned about in the last lesson computes a single weighted average over values. “Single” is the problem. A real sentence has many relationships happening at once — syntax, semantics, coreference, positional nearness, dependency structure — and a single weighted average cannot express them all simultaneously without collapsing them.

The fix, almost trivially, is to run several attention heads in parallel, each with its own learned projections. Each head can specialise in a different pattern. Their outputs are then concatenated and linearly mixed into the final layer output.

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\, W^O

\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

The learned weight matrices $W_i^Q$ , $W_i^K$ , $W_i^V$ are different for each head — that's how the heads end up attending to different things even though the input is identical.

Real heads, real patterns

The probe below shows what GPT-2 small (the 124-million-parameter model that started the modern era) actually does on real sentences. Pick a sentence, pick a layer, pick a head — you're looking at the unmodified attention weights from a forward pass.

Each sentence below was chosen because it surfaces a phenomenon documented in the interpretability literature. The featured-heads cards beneath the probe link each visible pattern back to the paper that first described it.

Loading probe...

The mixing happens at the end

After each head computes its own weighted sum, their outputs are concatenated and projected through a final matrix $W^O$ . This is not decorative — the $W^O$ projection is what lets the downstream FFN receive a single unified vector that contains information from all heads. Without it, the layer would output a giant concatenated blob with no interaction between heads.

A subtle and important fact: the head dimension $d_h$ is usually $d_{\text{model}}/h$ , so that after concatenation the shape is the same as the input. If $d_{\text{model}} = 3072$ and $h = 24$ , each head operates in a 128-dimensional subspace. Notice that this is also how modern Phi-4-mini is configured — 24 query heads, each with $d_h = 128$ . You've now seen where that number comes from.

Why not give every head access to the full

d_{\text{model}}

? You could, but the total parameter count of QKV projections would blow up as

O(h \cdot d_{\text{model}}^2)

. The

d_h = d_{\text{model}}/h

choice keeps it at

O(d_{\text{model}}^2)

regardless of head count — a parameter-free design choice that pays for itself forever.

comprehension check

comprehension · pick a tiereach tests the same ideas at a different depth

Recall: facts the lesson stated

Recall · question 1 / 3

Why do we use multiple attention heads instead of one big one?

One head only sees one thing

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\, W^O

\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

The learned weight matrices

W_i^Q

W_i^K

W_i^V

are different for each head — that's how the heads end up attending to different things even though the input is identical.

Real heads, real patterns

The mixing happens at the end

After each head computes its own weighted sum, their outputs are concatenated and projected through a final matrix

W^O

. This is not decorative — the

W^O

projection is what lets the downstream FFN receive a single unified vector that contains information from all heads. Without it, the layer would output a giant concatenated blob with no interaction between heads.

A subtle and important fact: the head dimension

d_h

is usually

d_{\text{model}}/h

, so that after concatenation the shape is the same as the input. If

d_{\text{model}} = 3072

and

h = 24

, each head operates in a 128-dimensional subspace. Notice that this is also how modern Phi-4-mini is configured — 24 query heads, each with

d_h = 128

. You've now seen where that number comes from.

Why not give every head access to the full

d_{\text{model}}

? You could, but the total parameter count of QKV projections would blow up as

O(h \cdot d_{\text{model}}^2)

. The

d_h = d_{\text{model}}/h

choice keeps it at

O(d_{\text{model}}^2)

regardless of head count — a parameter-free design choice that pays for itself forever.