lesson moe · 22 min · 65 xp

Eight stations, two lanterns

Why DeepSeek-V3 claims 671B parameters but only activates 37B per token. Top-k routing, shared experts, and the load-balance thermostat

Eight stations, two lanterns

Picture a small workshop. Along one wall, eight lab stations in a row — each with its own bench, its own instruments, its own quiet specialization. Above them, a router with just two lanterns. Every token that arrives gets exactly two stations lit for it, and only those two do any work.

That is the whole idea of a Mixture of Experts (MoE). The model holds a huge library of parameters — Mixtral keeps 46.7 B worth of weights behind the curtain — but only touches a small slice per token. 12.9 B are active at any moment. The gap between reachable and active parameters is where the compute economics of 2023–2026 quietly lives.

MMXXVI

historical note

January 2024 · Mistral AI, Paris

Mixtral 8×7B was the first open-weight model to put a real MoE in every serious inference stack. Eight experts per FFN layer, a router that picks two, a routing decision made per token per layer. Its paper and weights made “MoE” a word SLM practitioners had to know. Everything since — DeepSeek-V3, Llama 4, Qwen3-235B, Kimi K2 — builds on the same skeleton.

◆ paper

Mixtral of Experts

Jiang, Sablayrolles, Roux, Mensch, … · 2024 · arXiv

arxiv:2401.04088

The canonical reference for “sparse MoE at inference time” in the open-weight era. Section 2.1 of the paper spells out the router equation we earn in the next section, and Section 3 shows that the routing is more syntactic than semantic — the specialization story is messier than the metaphor.

the router's desk

one token enters · the router chooses top-2 · the rest dim

token preset

top-k

bias · E10.00

bias · E20.00

bias · E30.00

bias · E40.00

bias · E50.00

bias · E60.00

bias · E70.00

bias · E80.00

total parameters reachable

46.9 B

active parameters this token

13.3 B (28%)

this gap is the whole trick. total params grow with experts; active params grow only with

k

total

46.9B

active

13.3B

Tiny glyphs inside each station (dictionary, flask, music note, …) are decorative — real experts rarely specialize this cleanly. Section 3's DeepDive on interpretability covers what mechanistic work actually finds.

The router equation

An MoE layer replaces the single dense FFN of a standard transformer with a pool of $N$ experts and a tiny router that decides which to use. Formally, for a token representation $x$ :

h \;=\; \sum_{i=1}^{N} \text{softmax}(W_r \cdot x)_i \,\cdot\, E_i(x)

The router is just a learned linear layer $W_r \in \mathbb{R}^{N \times d}$ . Apply softmax; you have a distribution over experts. Multiply each expert's output by its probability; sum. That is all.

the three moving parts

$W_r$ — the router weights. Tiny.Typically < 0.1% of the layer's parameters.
$E_i(\cdot)$ — the $i$ -th expert, usually a SwiGLU-style FFN. The big objects.
softmax over logits → per-expert probabilities. Top- $k$ kept, the rest zeroed.

The pool size varies wildly across 2026 models. Mixtral keeps $N = 8$ . DeepSeek-V3 uses $N = 257$ (256 routed + 1 shared). Llama 4 Maverick lives at $N = 129$ (128 routed + 1 shared). More experts means finer-grained specialization — but also harder routing, more aggressive load-balance problems, and a bigger gap between training and inference regimes.

Top-k, and the softmax that almost is

The formula above runs softmax over all $N$ experts. In practice, no model does that. Every real MoE layer keeps only the top- $k$ experts by router score, zeros the rest, and renormalises:

\alpha_i \;=\; \begin{cases} \frac{\exp(\ell_i)}{\sum_{j \in T_k} \exp(\ell_j)} & i \in T_k \\ 0 & \text{otherwise} \end{cases}

where $T_k$ is the set of $k$ highest-scoring experts. Now only $k$ experts are actually run; the rest might as well not exist for this token. That is the whole compute story.

dense FFN

One tall FFN, every parameter active for every token.

Flops per token scale with $d_{model}^2$ · layers.

Archetype: Llama 3 8B, Phi-4, Qwen3-4B.

MoE with top-k

Eight columns, two lit. Memory scales with $N$ ; compute with $k$ .

Capacity grows without paying the full compute price — memory bandwidth becomes the new binding constraint.

Archetype: Mixtral, DeepSeek-V3, Qwen3-235B-A22B.

Why not sum over all experts? Two reasons. Compute — evaluating every expert for every token throws away the MoE bargain entirely. Specialization — if every expert contributes on every token, nothing is really an expert; you just trained a wider ensemble.

◆ paper

Switch Transformer: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Fedus, Zoph, Shazeer · 2022 · JMLR

arxiv:2101.03961

Switch made the radical case for

k = 1

— route each token to exactly one expert. One lantern per station. Simpler, cheaper, and surprisingly competitive. Mixtral returned to

k = 2

after Switch's ablations showed a small but consistent quality gap at fixed active-param count.

Load balance, or why routers collapse

MoE has one catastrophic failure mode, and it shows up within the first few thousand training steps: a self-reinforcing loop where a few experts get more tokens, train faster on those tokens, get preferred by the router, get more tokens, train even faster — and the rest of the pool starves.

Random init gives expert 3 a slight edge on some token type.
The router routes more tokens there; expert 3 gets more gradient signal.
Expert 3 becomes strictly better at those tokens; router's preference sharpens.
Goto 2. Expert 7 never receives tokens and its parameters stay at their initialization.

Left alone, the router collapses onto 2–3 experts and the other 5 are effectively dead weight. You spent the memory on a wide pool; you got a narrow model with a large router. Switch Transformer's response was an auxiliary loss that pulls routing toward uniform:

\mathcal{L}_{aux} \;=\; \alpha \cdot N \cdot \sum_{i=1}^{N} f_i \cdot P_i

where $f_i$ is the fraction of tokens in the current batch routed to expert $i$ , and $P_i$ is the average router probability of expert $i$ across the batch. The product $f_i \cdot P_i$ is minimized when both are $1/N$ — i.e. perfectly uniform routing. $\alpha$ is usually a small coefficient (~0.01) so the load-balance term doesn't swamp the language-modelling loss.

◆ paper

DeepSeek-V3 Technical Report — auxiliary-loss-free load balancing

DeepSeek-AI · 2024 · arXiv

arxiv:2412.19437

DeepSeek-V3 retired the aux loss altogether. Instead, each expert carries a learned bias term added to its router logit. If expert

i

is under-loaded this step, its bias creeps up; if it's over-loaded, down. A thermostat, not a gradient. Section 2.1.2 of the tech report argues this avoids the subtle gradient interference that the aux loss causes with the LM loss.

routing over 200 tokens · 8 experts

with no intervention, the router collapses onto 2 experts — 3 of 8 are effectively dead within a few hundred tokens.

DeepSeekMoE's trick: shared + routed

Look back at the ten-token panel above. Three of the ten tokens are the word the. In a Mixtral-style MoE, those three instances each burn top-2 slots of the routed pool — the router has no choice but to route them somewhere, and whichever experts happen to win end up carrying the “generic glue token” load forever. They stop specializing.

DeepSeekMoE's insight (arXiv 2401.06066): split the pool in two.

Shared experts (typically 1, always active). Carry the general competence every token needs.
Routed experts (many, top- $k$ per token). Free to specialize because the generic load was handled elsewhere.

DeepSeek-V3 ships 1 shared + 256 routed with top-8 routed per token — so 9 experts fire out of 257. Llama 4 Maverick ships the same shape: 1 shared + 128 routed, top-1 routed — just 2 experts fire out of 129. Both treat the shared expert as a parameter-efficient dense backbone, with the routed pool layered on for capacity.

two routers · ten tokens · same inputs

load on left · routed experts absorb every token, including the three instances of

\texttt{the}

— common tokens steal capacity from the pool that should be specializing.

load on right · the shared expert soaks up general structure (syntax, punctuation, the word the), freeing the eight routed experts to commit to narrower specializations.

◆ paper

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts

Dai, Deng, Zhao, Xu, Gao, Zhang, Li, Wen, Liang, Luo · 2024 · arXiv

arxiv:2401.06066

The paper that formalised fine-grained segmentation (each routed expert is

1/m

the width of a classical expert, with

m

times as many of them) plus the shared-expert isolation. Table 3 is the killer ablation: at fixed active parameters, fine-grained + shared beats Mixtral-style across every benchmark they tested.

The total-vs-active paradox

Every modern MoE model lives on a two-axis budget: total parameters decide what the model can learn; active parameters per token decide what the model costs to run. The gap between them is the compute economics of the 2026 frontier. Browse the stamp cards.

2026 moe · stamp card

Mixtral 8×7B

total

46.7 B

active / token

12.9 B

experts

active

The archetype. 8 experts, top-2, no shared. Coarse-grained by modern standards.

2026 moe · stamp card

DeepSeek-V3

total

671 B

active / token

37 B

experts

256 routed + 1 shared

active

8 + 1

Fine-grained. 9 experts active out of 257 — 3.5% of the parameters fire per token.

2026 moe · stamp card

Llama 4 Maverick

total

~400 B

active / token

17 B

experts

128 routed + 1 shared

active

1 + 1

Meta's 2025 MoE. Aggressive top-1 routing lets active parameter count crash to 17 B.

2026 moe · stamp card

Qwen3-235B-A22B

total

235 B

active / token

22 B

experts

128

active

Mid-pool MoE. No shared expert — Alibaba kept it Mixtral-shaped but widened the pool.

2026 moe · stamp card

Kimi K2

total

1 T

active / token

32 B

experts

384 routed + 1 shared

active

8 + 1

Moonshot's trillion-parameter MoE. Inherits the DeepSeekMoE shared-plus-routed pattern; active footprint stays at a serving-feasible 32 B.

Two patterns leap out. First: total parameters have climbed much faster than active parameters over the past two years. Mixtral (early 2024) was 46.7 B total; Kimi K2 (mid-2025) is 1 T total — a 20× increase. Active parameters moved from 12.9 B to 32 B — only ~2.5×. The MoE bargain is doing most of the work.

Second: the shared-expert pattern has spread quickly since DeepSeekMoE introduced it. Three of the five cards above — DeepSeek-V3, Llama 4 Maverick, and Kimi K2 — carry a shared expert; only the two outliers Mixtral (early 2024) and Qwen3-235B (2025) retain the Mixtral-style pure routed pool. At scale, the shared expert seems to earn its place.

A practical consequence for serving.An MoE's inference footprint is dominated by total parameters (you must hold all the experts in HBM, in case any of them gets called) but its flop budget per token is dominated by active parameters. This is why DeepSeek-V3 at 671 B fits on an 8×H100 node despite being ~80× the weights of Llama-3-8B — it holds the library, but each token only reads a small shelf. Serving throughput on MoE is limited by HBM bandwidth (the library size) and by compute (the shelf size), in that order.

comprehension check

Three tiers. Three ways to test the same ideas.

Recall checks the facts. Apply runs the router on new numbers. Reason asks about scenarios the lesson didn't cover — you'll have to transfer the mechanism.

comprehension · pick a tiereach tests the same ideas at a different depth

Recall: facts the lesson stated

Recall · question 1 / 3

In a standard top- $k$ MoE layer, which experts contribute to the output for a given token?

this lesson appears in

lesson moe · 22 min · 65 xp

Eight stations, two lanterns

Why DeepSeek-V3 claims 671B parameters but only activates 37B per token. Top-k routing, shared experts, and the load-balance thermostat