Eight stations, two lanterns
Picture a small workshop. Along one wall, eight lab stations in a row — each with its own bench, its own instruments, its own quiet specialization. Above them, a router with just two lanterns. Every token that arrives gets exactly two stations lit for it, and only those two do any work.
That is the whole idea of a Mixture of Experts (MoE). The model holds a huge library of parameters — Mixtral keeps 46.7 B worth of weights behind the curtain — but only touches a small slice per token. 12.9 B are active at any moment. The gap between reachable and active parameters is where the compute economics of 2023–2026 quietly lives.
Tiny glyphs inside each station (dictionary, flask, music note, …) are decorative — real experts rarely specialize this cleanly. Section 3's DeepDive on interpretability covers what mechanistic work actually finds.
The router equation
An MoE layer replaces the single dense FFN of a standard transformer with a pool of experts and a tiny router that decides which to use. Formally, for a token representation :
The router is just a learned linear layer . Apply softmax; you have a distribution over experts. Multiply each expert's output by its probability; sum. That is all.
- — the router weights. Tiny.Typically < 0.1% of the layer's parameters.
- — the -th expert, usually a SwiGLU-style FFN. The big objects.
- softmax over logits → per-expert probabilities. Top- kept, the rest zeroed.
The pool size varies wildly across 2026 models. Mixtral keeps . DeepSeek-V3 uses (256 routed + 1 shared). Llama 4 Maverick lives at (128 routed + 1 shared). More experts means finer-grained specialization — but also harder routing, more aggressive load-balance problems, and a bigger gap between training and inference regimes.
Top-k, and the softmax that almost is
The formula above runs softmax over all experts. In practice, no model does that. Every real MoE layer keeps only the top- experts by router score, zeros the rest, and renormalises:
where is the set of highest-scoring experts. Now only experts are actually run; the rest might as well not exist for this token. That is the whole compute story.
One tall FFN, every parameter active for every token.
Flops per token scale with · layers.
Archetype: Llama 3 8B, Phi-4, Qwen3-4B.
Eight columns, two lit. Memory scales with ; compute with .
Capacity grows without paying the full compute price — memory bandwidth becomes the new binding constraint.
Archetype: Mixtral, DeepSeek-V3, Qwen3-235B-A22B.
Why not sum over all experts? Two reasons. Compute — evaluating every expert for every token throws away the MoE bargain entirely. Specialization — if every expert contributes on every token, nothing is really an expert; you just trained a wider ensemble.
Load balance, or why routers collapse
MoE has one catastrophic failure mode, and it shows up within the first few thousand training steps: a self-reinforcing loop where a few experts get more tokens, train faster on those tokens, get preferred by the router, get more tokens, train even faster — and the rest of the pool starves.
- Random init gives expert 3 a slight edge on some token type.
- The router routes more tokens there; expert 3 gets more gradient signal.
- Expert 3 becomes strictly better at those tokens; router's preference sharpens.
- Goto 2. Expert 7 never receives tokens and its parameters stay at their initialization.
Left alone, the router collapses onto 2–3 experts and the other 5 are effectively dead weight. You spent the memory on a wide pool; you got a narrow model with a large router. Switch Transformer's response was an auxiliary loss that pulls routing toward uniform:
where is the fraction of tokens in the current batch routed to expert , and is the average router probability of expert across the batch. The product is minimized when both are — i.e. perfectly uniform routing. is usually a small coefficient (~0.01) so the load-balance term doesn't swamp the language-modelling loss.
with no intervention, the router collapses onto 2 experts — 3 of 8 are effectively dead within a few hundred tokens.
DeepSeekMoE's trick: shared + routed
Look back at the ten-token panel above. Three of the ten tokens are the word the. In a Mixtral-style MoE, those three instances each burn top-2 slots of the routed pool — the router has no choice but to route them somewhere, and whichever experts happen to win end up carrying the “generic glue token” load forever. They stop specializing.
DeepSeekMoE's insight (arXiv 2401.06066): split the pool in two.
- Shared experts (typically 1, always active). Carry the general competence every token needs.
- Routed experts (many, top- per token). Free to specialize because the generic load was handled elsewhere.
DeepSeek-V3 ships 1 shared + 256 routed with top-8 routed per token — so 9 experts fire out of 257. Llama 4 Maverick ships the same shape: 1 shared + 128 routed, top-1 routed — just 2 experts fire out of 129. Both treat the shared expert as a parameter-efficient dense backbone, with the routed pool layered on for capacity.
The total-vs-active paradox
Every modern MoE model lives on a two-axis budget: total parameters decide what the model can learn; active parameters per token decide what the model costs to run. The gap between them is the compute economics of the 2026 frontier. Browse the stamp cards.
Two patterns leap out. First: total parameters have climbed much faster than active parameters over the past two years. Mixtral (early 2024) was 46.7 B total; Kimi K2 (mid-2025) is 1 T total — a 20× increase. Active parameters moved from 12.9 B to 32 B — only ~2.5×. The MoE bargain is doing most of the work.
Second: the shared-expert pattern has spread quickly since DeepSeekMoE introduced it. Three of the five cards above — DeepSeek-V3, Llama 4 Maverick, and Kimi K2 — carry a shared expert; only the two outliers Mixtral (early 2024) and Qwen3-235B (2025) retain the Mixtral-style pure routed pool. At scale, the shared expert seems to earn its place.
Three tiers. Three ways to test the same ideas.
Recall checks the facts. Apply runs the router on new numbers. Reason asks about scenarios the lesson didn't cover — you'll have to transfer the mechanism.