lesson lora · 11 min · 55 xp

LoRA, visualized

How LoRA freezes pretrained weights and trains low-rank matrices BA — interactive rank slider shows how adapter capacity reshapes the model

2 hands-on labs

Don't update the weights — update a low-rank patch

A fine-tuning update to a dense layer is just $W \leftarrow W + \Delta W$ where $\Delta W$ is the learned adjustment. For a square matrix $W \in \mathbb{R}^{d \times d}$ the update has $d^2$ parameters. For a Llama attention projection with $d = 4096$ , that's 16.7 milliontrainable parameters per projection. Multiply by the number of projections across all layers and you're looking at billions of trainable params — the same as full fine-tuning.

MMXXVI

historical note

2021 · Edward Hu and team, Microsoft

LoRA was motivated by an earlier observation from Armen Aghajanyan's “Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning”. Aghajanyan measured the intrinsic dimension of fine-tuning — the smallest subspace of weight space in which you can still reach good task performance. For many tasks, that dimension is a few hundred, not billions. Hu et al. turned this empirical finding into an architecture: if the fine-tuning update lives in a low-rank subspace, parameterise it as a low-rank product.

◆ paper

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning

Aghajanyan, Zettlemoyer, Gupta · 2020

arxiv:2012.13255

The paper that made LoRA's core assumption empirically plausible. Showed that for many tasks, the intrinsic dimension of fine-tuning is in the hundreds, not millions — strong evidence that a rank-8 or rank-16 adapter should suffice.

◆ paper

LoRA: Low-Rank Adaptation of Large Language Models

Hu, Shen, Wallis, Allen-Zhu, Li, Wang, Wang, Chen · 2021

arxiv:2106.09685

The original LoRA paper. Shows that freezing the base model and learning only a low-rank update to each attention projection matches full fine-tuning on GPT-3 with 10,000× fewer trainable parameters.

Don't store the update as a full matrix. Store it as a product of two skinny matrices:

the LoRA parameterisation

\Delta W \;=\; B A, \quad B \in \mathbb{R}^{d \times r}, \; A \in \mathbb{R}^{r \times d}, \quad r \ll d

\text{params}(B) + \text{params}(A) \;=\; 2dr \quad \ll \quad d^2

For $d = 4096$ and $r = 8$ : $2 \cdot 4096 \cdot 8 = 65{,}536$ parameters vs $4096^2 = 16{,}777{,}216$ — a 256× reduction. The matrix $BA$ has rank at most $r$ by construction, so it literally cannot express anything outside that low-rank subspace. That's exactly the constraint we wanted.

rank r4

dim d (display)12

display only — real d is thousands

full FT params

144

LoRA params

savings

1.5×

B ∈ ℝ^12×4

12 × 4

A ∈ ℝ^4×12

4 × 12

ΔW = BA ∈ ℝ^12×12

12 × 12

Notice how

BA

always has rank at most

r

— a mathematical constraint, not a suggestion. You literally cannot express a full-rank update this way. That's exactly why it's cheap.

The init trick and the zero-at-start property

LoRA starts with $B = 0$ and $A \sim \mathcal{N}$ (small random init). The product is zero at initialization, which means the fine-tuned model starts identical to the base. Training only ever departs from the base in the low-rank subspace defined by $BA$ . This is why LoRA fine-tuning is so much less catastrophic-forgetting prone than full FT: the base weights are literally frozen and the only learnable additions are constrained to rank $r$ .

At inference you can merge $BA$ into $W$ directly: $W_\text{merged} = W + BA$ , after which the model has zero inference overhead. Or you can keep the adapter unmerged and hot-swap it for a different task — the basis for vLLM's multi-LoRA serving.

Practical rank choices

r = 8 — style / instruction tuning (cheapest, still effective)
r = 16 — most specialization tasks (my default)
r = 32–64 — adding new domain knowledge, larger behavioral shifts
r = 128+ — approaching full-FT territory; worth considering if you have compute

The scaling factor $\alpha$ controls the effective update magnitude: $\frac{\alpha}{r} B A$ . A common default is $\alpha = 2r$ .

Don't update the weights — update a low-rank patch

A fine-tuning update to a dense layer is just

W \leftarrow W + \Delta W

where

\Delta W

is the learned adjustment. For a square matrix

W \in \mathbb{R}^{d \times d}

the update has

d^2

parameters. For a Llama attention projection with

d = 4096

, that's 16.7 milliontrainable parameters per projection. Multiply by the number of projections across all layers and you're looking at billions of trainable params — the same as full fine-tuning.

MMXXVI

historical note

2021 · Edward Hu and team, Microsoft

◆ paper

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning

Aghajanyan, Zettlemoyer, Gupta · 2020

arxiv:2012.13255

◆ paper

LoRA: Low-Rank Adaptation of Large Language Models

Hu, Shen, Wallis, Allen-Zhu, Li, Wang, Wang, Chen · 2021

arxiv:2106.09685

Don't store the update as a full matrix. Store it as a product of two skinny matrices:

the LoRA parameterisation

\Delta W \;=\; B A, \quad B \in \mathbb{R}^{d \times r}, \; A \in \mathbb{R}^{r \times d}, \quad r \ll d

\text{params}(B) + \text{params}(A) \;=\; 2dr \quad \ll \quad d^2

The init trick and the zero-at-start property

LoRA starts with

B = 0

and

A \sim \mathcal{N}

(small random init). The product is zero at initialization, which means the fine-tuned model starts identical to the base. Training only ever departs from the base in the low-rank subspace defined by

BA

. This is why LoRA fine-tuning is so much less catastrophic-forgetting prone than full FT: the base weights are literally frozen and the only learnable additions are constrained to rank

r

At inference you can merge

BA

into

W

directly:

W_\text{merged} = W + BA

, after which the model has zero inference overhead. Or you can keep the adapter unmerged and hot-swap it for a different task — the basis for vLLM's multi-LoRA serving.

Practical rank choices

r = 8 — style / instruction tuning (cheapest, still effective)

r = 16 — most specialization tasks (my default)

r = 32–64 — adding new domain knowledge, larger behavioral shifts

r = 128+ — approaching full-FT territory; worth considering if you have compute

The scaling factor

\alpha

controls the effective update magnitude:

\frac{\alpha}{r} B A

. A common default is

\alpha = 2r

LoRA, visualized

Don't update the weights — update a low-rank patch

The init trick and the zero-at-start property

Practical rank choices

LoRA for Behavioral Fine-Tuning

LoRA for Tool Calling

LoRA, visualized

Don't update the weights — update a low-rank patch

The init trick and the zero-at-start property

Practical rank choices

LoRA for Behavioral Fine-Tuning

LoRA for Tool Calling