QLoRA and the NF4 grid
How QLoRA combines 4-bit NormalFloat quantization with LoRA adapters — the NF4 data type, double quantization, and paged optimizers
LoRA solves params but not memory
LoRA gets you down to tens of millions of trainable parameters, but the base weightsstill live in VRAM at full precision. A 7B model in FP16 is 14 GB just for weights; add optimizer state, gradients, activations, and you're past a 24 GB consumer GPU before training starts.
QLoRA (Dettmers et al. 2023) finishes the job. Quantize the base weights to 4 bits, keep the LoRA adapter in BF16, dequantize on-the-fly during the forward pass. Memory for the base drops 4×; you can fine-tune a 70B model on a single 48 GB GPU or a 7B on a 12 GB consumer card.
Three tricks that make it work
- 4-bit NormalFloat (NF4) — use a quantile-based 4-bit grid rather than a uniform one.
- Double quantization — quantize the quantization constants themselves, saving ~0.4 bits/weight.
- Paged optimizers — let AdamW states spill to CPU unified memory when VRAM pressure spikes.
The NF4 grid is the most interesting piece — let's look at it.
Why the NF4 grid is information-theoretically optimal for normal data
Trained neural network weights are approximately distributed. Suppose you have only possible values to represent every weight. Where should those 16 values sit?
Information-theoretically, you want each bin to hold equal probability mass. If one bin covers 40% of the weight distribution and another covers 1%, the 40% bin is wasting coding budget on a narrow range and the 1% bin is wasting it on empty territory. Both bins should cover equal mass: .
Equal probability mass means the bins should be denser where the distribution is peaked (near zero) and sparserin the tails. That's exactly what the NF4 grid does — its 16 values are at the quantiles of a unit normal. Each bin holds about 1/16 of the total probability mass; no bins are wasted in empty tails.