lesson tied-embeddings · 5 min · 30 xp

Tied embeddings

Set W_unembedding = W_embedding transposed and save ~30% of parameters at SLM scale. The Press & Wolf trick and gradient asymmetry

Two big matrices that do almost the same job

A transformer begins and ends with an embedding matrix. Up front, the input embedding $W_E \in \mathbb{R}^{|V| \times d}$ turns a token ID into a vector. At the output, the unembedding (LM head) $W_U \in \mathbb{R}^{d \times |V|}$ turns a vector back into a logit over the vocabulary.

For Phi-4-mini's configuration — $|V| = 200{,}064$ and $d = 3072$ — each of those matrices has ~615M parameters. Together they cost about 1.2 billion parametersout of a ~3.8B total. That's 32% of the model doing vocabulary bookkeeping.

\text{tied: } W_U = W_E^\top

The tied embedding trick is the simplest optimisation in this whole course: set them equal. Use the transpose of $W_E$ as the unembedding. One matrix, half the parameters, no quality drop on any public benchmark (and often a tiny improvement due to the regularization effect).

vocabulary size |V|200k

hidden dim d3072

non-embed params

3.62B

embed params (one matrix)

0.61B

total (tied)

4.24B

saved by tying

0.61B

parameter share of LM head

0%25%50%75%100%

non-embeddingsaved by tying

Who ties, who doesn't

Tied: Phi-4-mini, Gemma 3, SmolLM3, many sub-4B models
Untied: Llama 3.x (deliberately — they can absorb the cost at their scale)
Mixed: some models share them during pretraining and un-tie for fine-tuning

At larger scales (14B+), the embedding matrix becomes a smaller fraction of total params, so untying costs relatively less and the slight capacity advantage of untied can be worth it. For SLMs, though, tying is almost always the right call — a 1B model cannot afford to spend a third of its capacity on vocabulary bookkeeping.

The input embedding and unembedding are solving closely related problems — “what does this token mean?” and “given a meaning vector, which token is closest?”. It's not shocking that the same matrix works for both. What's shocking is that for decades we trained them separately because “that's what the original Transformer did.”

comprehension check

comprehension · pick a tiereach tests the same ideas at a different depth

Recall: facts the lesson stated

Recall · question 1 / 2

What does the 'tied embeddings' trick do?

Two big matrices that do almost the same job

A transformer begins and ends with an embedding matrix. Up front, the input embedding

W_E \in \mathbb{R}^{|V| \times d}

turns a token ID into a vector. At the output, the unembedding (LM head)

W_U \in \mathbb{R}^{d \times |V|}

turns a vector back into a logit over the vocabulary.

For Phi-4-mini's configuration —

|V| = 200{,}064

and

d = 3072

— each of those matrices has ~615M parameters. Together they cost about 1.2 billion parametersout of a ~3.8B total. That's 32% of the model doing vocabulary bookkeeping.

\text{tied: } W_U = W_E^\top

The tied embedding trick is the simplest optimisation in this whole course: set them equal. Use the transpose of

W_E

as the unembedding. One matrix, half the parameters, no quality drop on any public benchmark (and often a tiny improvement due to the regularization effect).

Who ties, who doesn't

Tied: Phi-4-mini, Gemma 3, SmolLM3, many sub-4B models

Untied: Llama 3.x (deliberately — they can absorb the cost at their scale)

Mixed: some models share them during pretraining and un-tie for fine-tuning