Microscale
0
Act VIMaking It Yours
lesson lora · 11 min · 55 xp

LoRA, visualized

Watch BA form and the rank slider do work

Don't update the weights — update a low-rank patch

A fine-tuning update to a dense layer is just WW+ΔWW \leftarrow W + \Delta W where ΔW\Delta W is the learned adjustment. For a square matrix WRd×dW \in \mathbb{R}^{d \times d} the update has d2d^2 parameters. For a Llama attention projection with d=4096d = 4096, that's 16.7 milliontrainable parameters per projection. Multiply by the number of projections across all layers and you're looking at billions of trainable params — the same as full fine-tuning.

MMXXVI
historical note
2021 · Edward Hu and team, Microsoft
LoRA was motivated by an earlier observation from Armen Aghajanyan's “Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning”. Aghajanyan measured the intrinsic dimension of fine-tuning — the smallest subspace of weight space in which you can still reach good task performance. For many tasks, that dimension is a few hundred, not billions. Hu et al. turned this empirical finding into an architecture: if the fine-tuning update lives in a low-rank subspace, parameterise it as a low-rank product.
◆ paper
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning
Aghajanyan, Zettlemoyer, Gupta · 2020
arxiv:2012.13255
The paper that made LoRA's core assumption empirically plausible. Showed that for many tasks, the intrinsic dimension of fine-tuning is in the hundreds, not millions — strong evidence that a rank-8 or rank-16 adapter should suffice.
◆ paper
LoRA: Low-Rank Adaptation of Large Language Models
Hu, Shen, Wallis, Allen-Zhu, Li, Wang, Wang, Chen · 2021
arxiv:2106.09685
The original LoRA paper. Shows that freezing the base model and learning only a low-rank update to each attention projection matches full fine-tuning on GPT-3 with 10,000× fewer trainable parameters.

Don't store the update as a full matrix. Store it as a product of two skinny matrices:

the LoRA parameterisation
ΔW  =  BA,BRd×r,  ARr×d,rd\Delta W \;=\; B A, \quad B \in \mathbb{R}^{d \times r}, \; A \in \mathbb{R}^{r \times d}, \quad r \ll d
params(B)+params(A)  =  2drd2\text{params}(B) + \text{params}(A) \;=\; 2dr \quad \ll \quad d^2

For d=4096d = 4096 and r=8r = 8: 240968=65,5362 \cdot 4096 \cdot 8 = 65{,}536 parameters vs 40962=16,777,2164096^2 = 16{,}777{,}216 — a 256× reduction. The matrix BABA has rank at most rrby construction, so it literally cannot express anything outside that low-rank subspace. That's exactly the constraint we wanted.

4
12
display only — real d is thousands
full FT params
144
LoRA params
96
savings
1.5×
B ∈ ℝ12×4
12 × 4
·
A ∈ ℝ4×12
4 × 12
=
ΔW = BA ∈ ℝ12×12
12 × 12
Notice how BABA always has rank at most rr— a mathematical constraint, not a suggestion. You literally cannot express a full-rank update this way. That's exactly why it's cheap.

The init trick and the zero-at-start property

LoRA starts with B=0B = 0 and ANA \sim \mathcal{N} (small random init). The product is zero at initialization, which means the fine-tuned model starts identical to the base. Training only ever departs from the base in the low-rank subspace defined by BABA. This is why LoRA fine-tuning is so much less catastrophic-forgetting prone than full FT: the base weights are literally frozen and the only learnable additions are constrained to rank rr.

At inference you can merge BABA into WW directly: Wmerged=W+BAW_\text{merged} = W + BA, after which the model has zero inference overhead. Or you can keep the adapter unmerged and hot-swap it for a different task — the basis for vLLM's multi-LoRA serving.

Practical rank choices

  • r = 8 — style / instruction tuning (cheapest, still effective)
  • r = 16 — most specialization tasks (my default)
  • r = 32–64 — adding new domain knowledge, larger behavioral shifts
  • r = 128+ — approaching full-FT territory; worth considering if you have compute

The scaling factor α\alpha controls the effective update magnitude: αrBA\frac{\alpha}{r} B A. A common default is α=2r\alpha = 2r.