Microscale
0
Act IVHow They Learn
lesson textbooks · 9 min · 45 xp

The textbook hypothesis

Phi's synthetic data recipe

The data is the architecture you don't see

Phi-1 (2023): a 1.3B-parameter model trained on a few billion tokens of hand-curated and generated Python tutorials, reaching performance previously seen only on 13B+ models. The paper was titled “Textbooks Are All You Need” and the core claim was sharp: data quality, not quantity, is the binding constraint.

Phi-4 (2024) and Phi-4-mini (2025) scaled the idea. The Phi-4 technical report describes 50 distinct synthetic dataset types, totalling ~400 billion synthetic tokens, generated by a pipeline that prompts GPT-4 for structured lessons, self-critiques them, and rewrites them. Phi-4 ends up exceeding its own teacher on STEM benchmarks — which only makes sense if you think of the teacher as a data-generation tool, not a capability ceiling.

Why this works — information theory, not magic

Web data contains enormous structural noise: boilerplate, SEO spam, near-duplicates, low-signal chatter. A small model's capacity is finite — every parameter the model spends modeling junk is a parameter not spent modeling signal. When you replace web noise with curated teacher-generated lessons, you raise the signal-to-noise ratio of each training step.

There's a second, subtler effect. Synthetic data can be generated at targeted difficulty and targeted distribution. Web data is whatever the internet happens to be. Synthetic data can be fat-tailed on exactly the skills you want (chains of reasoning, math manipulation, code + explanation) in exactly the proportions you need.

The actual Phi-1.5 pipeline was more surgical than “ask GPT-4 for textbooks.” Gunasekar et al. 2023 trained a small educational-value classifier — a random-forest probe over embedding features — on a few thousand human-labelled web pages. They then ran it over The Stack and filtered web crawl down to the top-scoring ~6B tokens (from a pool of hundreds of billions), and mixed that with ~1B tokens of GPT-3.5-generated synthetic textbooks and exercises. The classifier is the load-bearing piece: without it, the synthetic data drifts toward topics the teacher happens to over-generate. With it, every token earns its place by an explicit criterion you can audit.

The failure mode is distribution narrowness. Phi-1.5 and Phi-2 post strong GSM8K and HumanEval numbers but underperform on TriviaQA and cultural/historical prose — the teacher's “textbook” prior simply doesn't cover that surface area. Bubeck et al. later conceded that Phi models are “spiky” — excellent on the axes the pipeline targeted, mediocre on axes it didn't. This is why you still see the Llama line trained on messy web data: generality and long-tail world-knowledge coverage are the things web data buys that textbook data cannot.

raw web signal
cookie
policy
click
here
subscribe
the
quadratic
formula
is
used
to
solve
The signal (orange) is buried in boilerplate (rust). A small model's capacity leaks into the noise.
synthetic textbook signal
Given
the
quadratic
ax²+bx+c=0
derive
its
roots
by
completing
the
square
Every token carries signal. Small model spends every parameter on the thing you care about.

Is this 'just distillation'?

A sharp reader will object: “you're generating training data with GPT-4, so you're distilling GPT-4. The student can only be as good as its teacher.”

Phi-4 exceedsGPT-4 on STEM benchmarks despite being distilled from it. How? The data generation is not a one-shot rewrite. It's a pipeline: prompt, critique, rewrite, self-improve, re-generate with variations. The final training dataset is the result of many passes through the teacher, each one smoothing out the teacher's own blind spots in a different way. Plus the Phi team layers in curated web data and verified math — the student learns from sources the teacher never saw cleanly.

The deeper point: synthetic data shapes the training distributionin ways that no teacher alone could. It's not capped by the teacher's one-shot ability.

The limit is mode collapse. If you iterate on generations of synthetic data without injecting real feedback, the distribution narrows and the student gets worse over time (Shumailov et al. 2024). Every Phi release mixes synthetic with real data precisely to avoid this.
comprehension check
comprehension · 1 / 1

What's the core claim of the 'textbook hypothesis'?