lesson swiglu · 7 min · 40 xp

SwiGLU vs GeLU

The gated linear unit and why it wins — bilinear expressiveness, the 8/3 ratio, and Hessian conditioning

The FFN is half of every transformer block

Attention gets all the press, but every transformer block follows attention with a feed-forward network (FFN) — a two-layer MLP applied independently to each token position. FFNs account for two-thirds of the parameters in a modern transformer. Their design matters a lot.

Classical FFN:

\text{FFN}(x) = W_2 \, \sigma(W_1 x + b_1) + b_2

where $\sigma$ is some non-linearity. In the original Transformer it was ReLU. GPT-2 used GeLU (“gaussian error linear unit”), a smoother cousin. Modern models — Llama, Qwen, Phi, SmolLM — use a different construction entirely: the gated linear unit, specifically the SwiGLU variant introduced by Shazeer (2020).

SwiGLU — the multiplicative gate

\text{SwiGLU}(x) \;=\; W_2 \big(\, (W_1 x) \odot \text{Swish}(W_3 x)\, \big)

Two projections are computed from the same input: $W_1 x$ is the candidate value, $W_3 x$ is fed through Swish and becomes the gate. An element-wise multiply combines them, then $W_2$ projects down. The Swish activation is $\text{Swish}(z) = z \cdot \sigma(z)$ where $\sigma$ is the sigmoid.

MMXXVI

historical note

2020 · Noam Shazeer, Google

SwiGLU came from a paper with the memorable title “GLU Variants Improve Transformer” (arXiv:2002.05202). Shazeer systematically tested many Gated Linear Unit variants — GeGLU, SwiGLU, ReGLU, Bilinear — against each other on a transformer language model. All the GLU variants beat the classical ReLU/GeLU FFN. Shazeer famously closed the paper with: “We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.” That one-line abdication has now been cited thousands of times in papers that use SwiGLU.

◆ paper

GLU Variants Improve Transformer

Noam Shazeer · 2020

arxiv:2002.05202

Llama, Qwen, Phi, Mistral, SmolLM all use SwiGLU as the FFN activation. Gemma uses GeGLU (GeLU in place of Swish). The differences between GLU variants are small; the main win is having a gated linear unit at all.

activation functions

ReLU

GeLU

Swish

gradients — what the optimizer sees

Swish and GeLU are smooth everywhere — no kink at zero like ReLU. Smooth gradients → well-conditioned Hessians → larger stable learning rates → faster, less-chaotic training. For deeply overtrained SLMs that run many optimization steps, smoothness compounds.

Why not just bigger ReLU?

SwiGLU introduces a third weight matrix, so at fixed parameter budget you need to shrink the FFN intermediate dimension by about 2/3 to compensate. Almost every ablation ever published (Touvron et al. 2023, Shazeer 2020, Noam Shazeer's GLU paper) finds SwiGLU still wins at fixed parameter count. The multiplicative interaction is worth the extra matrix.

You'll also see GeGLU (GeLU in place of Swish) in Gemma, ReGLU (ReLU) in older models, and plain SwiGLU in Llama, Qwen, Phi, Mistral, SmolLM. The differences between GLU variants are small; the main win is having a gated linear unit at all.

Shazeer's 2020 paper concludes with the famous line: “We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.” Empirically they win. The bilinear-expressiveness story is the most-cited post-hoc theoretical justification, not a first-principles derivation.

comprehension check

comprehension · pick a tiereach tests the same ideas at a different depth

Recall: facts the lesson stated

Recall · question 1 / 2

What is the key structural difference between a classical FFN and a gated linear unit (SwiGLU)?

The FFN is half of every transformer block

Classical FFN:

\text{FFN}(x) = W_2 \, \sigma(W_1 x + b_1) + b_2

where

\sigma

is some non-linearity. In the original Transformer it was ReLU. GPT-2 used GeLU (“gaussian error linear unit”), a smoother cousin. Modern models — Llama, Qwen, Phi, SmolLM — use a different construction entirely: the gated linear unit, specifically the SwiGLU variant introduced by Shazeer (2020).

SwiGLU — the multiplicative gate

\text{SwiGLU}(x) \;=\; W_2 \big(\, (W_1 x) \odot \text{Swish}(W_3 x)\, \big)

Two projections are computed from the same input:

W_1 x

is the candidate value,

W_3 x

is fed through Swish and becomes the gate. An element-wise multiply combines them, then

W_2

projects down. The Swish activation is

\text{Swish}(z) = z \cdot \sigma(z)