Don't update the weights — update a low-rank patch
A fine-tuning update to a dense layer is just where is the learned adjustment. For a square matrix the update has parameters. For a Llama attention projection with , that's 16.7 milliontrainable parameters per projection. Multiply by the number of projections across all layers and you're looking at billions of trainable params — the same as full fine-tuning.
Don't store the update as a full matrix. Store it as a product of two skinny matrices:
For and : parameters vs — a 256× reduction. The matrix has rank at most by construction, so it literally cannot express anything outside that low-rank subspace. That's exactly the constraint we wanted.
The init trick and the zero-at-start property
LoRA starts with and (small random init). The product is zero at initialization, which means the fine-tuned model starts identical to the base. Training only ever departs from the base in the low-rank subspace defined by . This is why LoRA fine-tuning is so much less catastrophic-forgetting prone than full FT: the base weights are literally frozen and the only learnable additions are constrained to rank .
At inference you can merge into directly: , after which the model has zero inference overhead. Or you can keep the adapter unmerged and hot-swap it for a different task — the basis for vLLM's multi-LoRA serving.
Practical rank choices
- r = 8 — style / instruction tuning (cheapest, still effective)
- r = 16 — most specialization tasks (my default)
- r = 32–64 — adding new domain knowledge, larger behavioral shifts
- r = 128+ — approaching full-FT territory; worth considering if you have compute
The scaling factor controls the effective update magnitude: . A common default is .