Microscale
0
Act VIMaking It Yours
lesson qlora · 10 min · 50 xp

QLoRA and the NF4 grid

Quantile-based 4-bit quantization

LoRA solves params but not memory

LoRA gets you down to tens of millions of trainable parameters, but the base weightsstill live in VRAM at full precision. A 7B model in FP16 is 14 GB just for weights; add optimizer state, gradients, activations, and you're past a 24 GB consumer GPU before training starts.

QLoRA (Dettmers et al. 2023) finishes the job. Quantize the base weights to 4 bits, keep the LoRA adapter in BF16, dequantize on-the-fly during the forward pass. Memory for the base drops 4×; you can fine-tune a 70B model on a single 48 GB GPU or a 7B on a 12 GB consumer card.

Three tricks that make it work

  1. 4-bit NormalFloat (NF4) — use a quantile-based 4-bit grid rather than a uniform one.
  2. Double quantization — quantize the quantization constants themselves, saving ~0.4 bits/weight.
  3. Paged optimizers — let AdamW states spill to CPU unified memory when VRAM pressure spikes.

The NF4 grid is the most interesting piece — let's look at it.

neural network weights ≈ unit normal distribution
uniform 4-bit grid — wastes bins in the tailsnf4 grid — quantile-spaced, dense where the data is
bits per weight
4bits
compression vs FP16
4×
quality loss
< 1pt
weight distribution (unit normal)
uniform 4-bit grid
NF4 grid (quantile-based)

Why the NF4 grid is information-theoretically optimal for normal data

Trained neural network weights are approximately N(0,σ2)\mathcal{N}(0, \sigma^2) distributed. Suppose you have only 24=162^4 = 16 possible values to represent every weight. Where should those 16 values sit?

the optimal-bin argument

Information-theoretically, you want each bin to hold equal probability mass. If one bin covers 40% of the weight distribution and another covers 1%, the 40% bin is wasting coding budget on a narrow range and the 1% bin is wasting it on empty territory. Both bins should cover equal mass: 1/161/16.

Equal probability mass means the bins should be denser where the distribution is peaked (near zero) and sparserin the tails. That's exactly what the NF4 grid does — its 16 values are at the quantiles of a unit normal. Each bin holds about 1/16 of the total probability mass; no bins are wasted in empty tails.