Attention gets all the press, but every transformer block follows attention with a feed-forward network (FFN) — a two-layer MLP applied independently to each token position. FFNs account for two-thirds of the parameters in a modern transformer. Their design matters a lot.
Classical FFN:
FFN(x)=W2σ(W1x+b1)+b2
where σis some non-linearity. In the original Transformer it was ReLU. GPT-2 used GeLU (“gaussian error linear unit”), a smoother cousin. Modern models — Llama, Qwen, Phi, SmolLM — use a different construction entirely: the gated linear unit, specifically the SwiGLU variant introduced by Shazeer (2020).
SwiGLU — the multiplicative gate
SwiGLU(x)=W2((W1x)⊙Swish(W3x))
Two projections are computed from the same input: W1x is the candidate value, W3x is fed through Swish and becomes the gate. An element-wise multiply combines them, then W2 projects down. The Swish activation is Swish(z)=z⋅σ(z) where σ is the sigmoid.
MMXXVI
historical note
2020 · Noam Shazeer, Google
SwiGLU came from a paper with the memorable title “GLU Variants Improve Transformer” (arXiv:2002.05202). Shazeer systematically tested many Gated Linear Unit variants — GeGLU, SwiGLU, ReGLU, Bilinear — against each other on a transformer language model. All the GLU variants beat the classical ReLU/GeLU FFN. Shazeer famously closed the paper with: “We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.” That one-line abdication has now been cited thousands of times in papers that use SwiGLU.
Llama, Qwen, Phi, Mistral, SmolLM all use SwiGLU as the FFN activation. Gemma uses GeGLU (GeLU in place of Swish). The differences between GLU variants are small; the main win is having a gated linear unit at all.
activation functions
ReLU
GeLU
Swish
gradients — what the optimizer sees
Swish and GeLU are smooth everywhere — no kink at zero like ReLU. Smooth gradients → well-conditioned Hessians → larger stable learning rates → faster, less-chaotic training. For deeply overtrained SLMs that run many optimization steps, smoothness compounds.
Why not just bigger ReLU?
SwiGLU introduces a third weight matrix, so at fixed parameter budget you need to shrink the FFN intermediate dimension by about 2/3 to compensate. Almost every ablation ever published (Touvron et al. 2023, Shazeer 2020, Noam Shazeer's GLU paper) finds SwiGLU still wins at fixed parameter count. The multiplicative interaction is worth the extra matrix.
You'll also see GeGLU (GeLU in place of Swish) in Gemma, ReGLU (ReLU) in older models, and plain SwiGLU in Llama, Qwen, Phi, Mistral, SmolLM. The differences between GLU variants are small; the main win is having a gated linear unit at all.
Shazeer's 2020 paper concludes with the famous line: “We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.” Empirically they win. The bilinear-expressiveness story is the most-cited post-hoc theoretical justification, not a first-principles derivation.
comprehension check
comprehension · 1 / 2
What is the key structural difference between a classical FFN and a gated linear unit (SwiGLU)?