lesson bitnet · 10 min · 50 xp

BitNet 1.58, ternary

Weights collapse to {-1, 0, +1} — replacing multiplies with additions for 16-32× memory reduction and 82% less energy per token

What if a weight could only be -1, 0, or +1?

Every quantization scheme in the last two lessons is a post-hoc compression: train in FP16, then shrink. BitNet b1.58 (Microsoft, 2024) takes a completely different approach: train natively at 1.58 bits per weight. Every weight is one of three values: $\{-1, 0, +1\}$ . Activations are INT8. $\log_2(3) \approx 1.58$ — hence the name.

The shocking part: this works. BitNet b1.58 2B4T (2 billion params, 4 trillion training tokens) is within 1–2 points of equivalent FP16 models on MMLU, GSM8K, HumanEval+, and others. The 2B model fits in 0.4 GB on disk. It runs on a phone. It runs on a microcontroller.

The killer move — multiplications become additions

A standard matmul for an FP16 layer performs tens of millions of FP16 multiply-accumulates. On 7nm silicon, each multiply burns real energy. Now look at BitNet: every weight is in $\{-1, 0, +1\}$ , so every “multiplication” is either:

Skip (weight is 0): nothing to do.
Add the activation (weight is +1).
Subtract the activation (weight is −1).

No floating-point multiplications. The entire matmul becomes a tree of additions and sign flips. On 7nm hardware, multiplications are ~40× more energy-hungry than additions. BitNet reports 40× less energy per multiply, ~3× less for add — and when you scale to 30B parameters, total energy consumption drops roughly 39× vs FP16.

ternary weight distribution — 100 cells shown

−

bits/weight

1.58bits

2B model size

0.4GB

vs FP16 size

5×

energy ratio

40×

Why you can't retrofit BitNet to an existing model

BitNet only works if you train from scratch at ternary precision. You can't take a regular Llama and quantize to ternary — the accumulated rounding error is catastrophic because the FP16 training never gave the weights the opportunity to compensate for the precision constraint.

Training with ternary weights requires a straight-through estimator: during the forward pass, round to $\{-1, 0, +1\}$ ; during the backward pass, pretend the rounding was identity so gradients flow. The network learns duringtraining to arrange its weights such that ternary rounding doesn't break the computation.

The exact recipe: weights are quantised with absmean — each weight matrix is scaled by the mean absolute value of its entries, then rounded to the nearest of $\{-1, 0, +1\}$ . Activations take a different path: per-token absmax scaling to INT8, because activation distributions have outliers that ternary cannot tolerate. And the linear is wrapped in a BitLinear layer that does a LayerNorm before quantising — the norm is load-bearing, it pre-conditions the distribution so absmean rounding does not collapse whole rows to zero. Drop the LayerNorm and training diverges within a few thousand steps.

How bad is post-hoc ternarisation? Taking an FP16 Llama-2-7B and rounding every weight to $\{-1, 0, +1\}$ produces a model that outputs near-random tokens — wikitext-2 perplexity blows up from ~5.7 to 10⁴+. Microsoft's own ablation: to match FP16 quality, their 1.58-bit runs needed roughly the same token budget as the baseline at small scale and up to ~2× the tokens at the 3B mark to close the gap. Ternary is not a compression step applied to weights — it is a constraint the optimiser has to solve through, and the solution only exists in a basin the FP16 optimum never touches.

Why this might be the direction

If BitNet holds up at 30B and 70B scales, it rewrites the economics of inference. A model that runs on add-and-subtract silicon is orders of magnitude cheaper to serve than one that needs FP16 multiplications. This directly attacks the inference-economics driver from Act I.

Ecosystem support is still thin in April 2026 — limited inference engines, limited fine-tuning tooling, almost no HuggingFace PEFT integration. But the direction is clear, and BitNet b1.58 2B4T is the proof of concept.

The real unlock is silicon. Every GPU shipped since 2012 is optimised for dense FP multiply-accumulate; its tensor cores are literally systolic arrays of multipliers. Run BitNet on an H100 and you still pay the multiplier tax because the hardware has no “add-or-skip” primitive — the reported energy wins assume custom silicon or at minimum a kernel that packs ternary weights into bitfields and uses popcount-style accumulation. Microsoft's bitnet.cpp CPU runtime (late 2024) shows 2–6× speedups on x86 and ARM for the 2B model by exactly this trick — but the factor-of-40 energy story stays theoretical until an accelerator ships with native 1.58-bit arithmetic units. The 2026 question is whether the next wave of inference ASICs (Groq, Cerebras, Tenstorrent, Etched) bet on it.If they do, BitNet goes from curiosity to dominant architecture in 18 months. If they don't, it stays a research win stuck behind hardware that was built for the wrong workload.

What if a weight could only be -1, 0, or +1?

\{-1, 0, +1\}

. Activations are INT8.

\log_2(3) \approx 1.58

— hence the name.

The killer move — multiplications become additions

A standard matmul for an FP16 layer performs tens of millions of FP16 multiply-accumulates. On 7nm silicon, each multiply burns real energy. Now look at BitNet: every weight is in

\{-1, 0, +1\}

, so every “multiplication” is either:

Skip (weight is 0): nothing to do.

Add the activation (weight is +1).

Subtract the activation (weight is −1).

Why you can't retrofit BitNet to an existing model

Training with ternary weights requires a straight-through estimator: during the forward pass, round to

\{-1, 0, +1\}

; during the backward pass, pretend the rounding was identity so gradients flow. The network learns duringtraining to arrange its weights such that ternary rounding doesn't break the computation.

The exact recipe: weights are quantised with absmean — each weight matrix is scaled by the mean absolute value of its entries, then rounded to the nearest of

\{-1, 0, +1\}

. Activations take a different path: per-token absmax scaling to INT8, because activation distributions have outliers that ternary cannot tolerate. And the linear is wrapped in a BitLinear layer that does a LayerNorm before quantising — the norm is load-bearing, it pre-conditions the distribution so absmean rounding does not collapse whole rows to zero. Drop the LayerNorm and training diverges within a few thousand steps.

How bad is post-hoc ternarisation? Taking an FP16 Llama-2-7B and rounding every weight to

\{-1, 0, +1\}

produces a model that outputs near-random tokens — wikitext-2 perplexity blows up from ~5.7 to 10⁴+. Microsoft's own ablation: to match FP16 quality, their 1.58-bit runs needed roughly the same token budget as the baseline at small scale and up to ~2× the tokens at the 3B mark to close the gap. Ternary is not a compression step applied to weights — it is a constraint the optimiser has to solve through, and the solution only exists in a basin the FP16 optimum never touches.

Why this might be the direction