Microscale
0
Act IIInside the Machine
lesson swiglu · 7 min · 40 xp

SwiGLU vs GeLU

The gated linear unit and why it wins

The FFN is half of every transformer block

Attention gets all the press, but every transformer block follows attention with a feed-forward network (FFN) — a two-layer MLP applied independently to each token position. FFNs account for two-thirds of the parameters in a modern transformer. Their design matters a lot.

Classical FFN:

FFN(x)=W2σ(W1x+b1)+b2\text{FFN}(x) = W_2 \, \sigma(W_1 x + b_1) + b_2

where σ\sigmais some non-linearity. In the original Transformer it was ReLU. GPT-2 used GeLU (“gaussian error linear unit”), a smoother cousin. Modern models — Llama, Qwen, Phi, SmolLM — use a different construction entirely: the gated linear unit, specifically the SwiGLU variant introduced by Shazeer (2020).

SwiGLU — the multiplicative gate

SwiGLU(x)  =  W2((W1x)Swish(W3x))\text{SwiGLU}(x) \;=\; W_2 \big(\, (W_1 x) \odot \text{Swish}(W_3 x)\, \big)

Two projections are computed from the same input: W1xW_1 x is the candidate value, W3xW_3 x is fed through Swish and becomes the gate. An element-wise multiply combines them, then W2W_2 projects down. The Swish activation is Swish(z)=zσ(z)\text{Swish}(z) = z \cdot \sigma(z) where σ\sigma is the sigmoid.

MMXXVI
historical note
2020 · Noam Shazeer, Google
SwiGLU came from a paper with the memorable title “GLU Variants Improve Transformer” (arXiv:2002.05202). Shazeer systematically tested many Gated Linear Unit variants — GeGLU, SwiGLU, ReGLU, Bilinear — against each other on a transformer language model. All the GLU variants beat the classical ReLU/GeLU FFN. Shazeer famously closed the paper with: “We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.” That one-line abdication has now been cited thousands of times in papers that use SwiGLU.
◆ paper
GLU Variants Improve Transformer
Noam Shazeer · 2020
arxiv:2002.05202
Llama, Qwen, Phi, Mistral, SmolLM all use SwiGLU as the FFN activation. Gemma uses GeGLU (GeLU in place of Swish). The differences between GLU variants are small; the main win is having a gated linear unit at all.
activation functions
-4-2024024input xσ(x)
ReLU
GeLU
Swish
gradients — what the optimizer sees
-40400.51input xderivative
Swish and GeLU are smooth everywhere — no kink at zero like ReLU. Smooth gradients → well-conditioned Hessians → larger stable learning rates → faster, less-chaotic training. For deeply overtrained SLMs that run many optimization steps, smoothness compounds.

Why not just bigger ReLU?

SwiGLU introduces a third weight matrix, so at fixed parameter budget you need to shrink the FFN intermediate dimension by about 2/3 to compensate. Almost every ablation ever published (Touvron et al. 2023, Shazeer 2020, Noam Shazeer's GLU paper) finds SwiGLU still wins at fixed parameter count. The multiplicative interaction is worth the extra matrix.

You'll also see GeGLU (GeLU in place of Swish) in Gemma, ReGLU (ReLU) in older models, and plain SwiGLU in Llama, Qwen, Phi, Mistral, SmolLM. The differences between GLU variants are small; the main win is having a gated linear unit at all.

Shazeer's 2020 paper concludes with the famous line: “We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.” Empirically they win. The bilinear-expressiveness story is the most-cited post-hoc theoretical justification, not a first-principles derivation.
comprehension check
comprehension · 1 / 2

What is the key structural difference between a classical FFN and a gated linear unit (SwiGLU)?