lesson qlora · 10 min · 50 xp

QLoRA and the NF4 grid

How QLoRA combines 4-bit NormalFloat quantization with LoRA adapters — the NF4 data type, double quantization, and paged optimizers

LoRA solves params but not memory

LoRA gets you down to tens of millions of trainable parameters, but the base weightsstill live in VRAM at full precision. A 7B model in FP16 is 14 GB just for weights; add optimizer state, gradients, activations, and you're past a 24 GB consumer GPU before training starts.

QLoRA (Dettmers et al. 2023) finishes the job. Quantize the base weights to 4 bits, keep the LoRA adapter in BF16, dequantize on-the-fly during the forward pass. Memory for the base drops 4×; you can fine-tune a 70B model on a single 48 GB GPU or a 7B on a 12 GB consumer card.

Three tricks that make it work

4-bit NormalFloat (NF4) — use a quantile-based 4-bit grid rather than a uniform one.
Double quantization — quantize the quantization constants themselves, saving ~0.4 bits/weight.
Paged optimizers — let AdamW states spill to CPU unified memory when VRAM pressure spikes.

The NF4 grid is the most interesting piece — let's look at it.

neural network weights ≈ unit normal distribution

bits per weight

4bits

compression vs FP16

4×

quality loss

< 1pt

weight distribution (unit normal)

uniform 4-bit grid

NF4 grid (quantile-based)

Why the NF4 grid is information-theoretically optimal for normal data

Trained neural network weights are approximately $\mathcal{N}(0, \sigma^2)$ distributed. Suppose you have only $2^4 = 16$ possible values to represent every weight. Where should those 16 values sit?

the optimal-bin argument

Information-theoretically, you want each bin to hold equal probability mass. If one bin covers 40% of the weight distribution and another covers 1%, the 40% bin is wasting coding budget on a narrow range and the 1% bin is wasting it on empty territory. Both bins should cover equal mass: $1/16$ .

Equal probability mass means the bins should be denser where the distribution is peaked (near zero) and sparserin the tails. That's exactly what the NF4 grid does — its 16 values are at the quantiles of a unit normal. Each bin holds about 1/16 of the total probability mass; no bins are wasted in empty tails.

LoRA solves params but not memory

Three tricks that make it work

4-bit NormalFloat (NF4) — use a quantile-based 4-bit grid rather than a uniform one.

Double quantization — quantize the quantization constants themselves, saving ~0.4 bits/weight.

Paged optimizers — let AdamW states spill to CPU unified memory when VRAM pressure spikes.

The NF4 grid is the most interesting piece — let's look at it.

Why the NF4 grid is information-theoretically optimal for normal data

Trained neural network weights are approximately

\mathcal{N}(0, \sigma^2)

distributed. Suppose you have only

2^4 = 16

possible values to represent every weight. Where should those 16 values sit?

the optimal-bin argument