Two big matrices that do almost the same job
A transformer begins and ends with an embedding matrix. Up front, the input embedding turns a token ID into a vector. At the output, the unembedding (LM head) turns a vector back into a logit over the vocabulary.
For Phi-4-mini's configuration — and — each of those matrices has ~615M parameters. Together they cost about 1.2 billion parametersout of a ~3.8B total. That's 32% of the model doing vocabulary bookkeeping.
The tied embedding trick is the simplest optimisation in this whole course: set them equal. Use the transpose of as the unembedding. One matrix, half the parameters, no quality drop on any public benchmark (and often a tiny improvement due to the regularization effect).
Who ties, who doesn't
- Tied: Phi-4-mini, Gemma 3, SmolLM3, many sub-4B models
- Untied: Llama 3.x (deliberately — they can absorb the cost at their scale)
- Mixed: some models share them during pretraining and un-tie for fine-tuning
At larger scales (14B+), the embedding matrix becomes a smaller fraction of total params, so untying costs relatively less and the slight capacity advantage of untied can be worth it. For SLMs, though, tying is almost always the right call — a 1B model cannot afford to spend a third of its capacity on vocabulary bookkeeping.