The data is the architecture you don't see
Phi-1 (2023): a 1.3B-parameter model trained on a few billion tokens of hand-curated and generated Python tutorials, reaching performance previously seen only on 13B+ models. The paper was titled “Textbooks Are All You Need” and the core claim was sharp: data quality, not quantity, is the binding constraint.
Phi-4 (2024) and Phi-4-mini (2025) scaled the idea. The Phi-4 technical report describes 50 distinct synthetic dataset types, totalling ~400 billion synthetic tokens, generated by a pipeline that prompts GPT-4 for structured lessons, self-critiques them, and rewrites them. Phi-4 ends up exceeding its own teacher on STEM benchmarks — which only makes sense if you think of the teacher as a data-generation tool, not a capability ceiling.
Why this works — information theory, not magic
Web data contains enormous structural noise: boilerplate, SEO spam, near-duplicates, low-signal chatter. A small model's capacity is finite — every parameter the model spends modeling junk is a parameter not spent modeling signal. When you replace web noise with curated teacher-generated lessons, you raise the signal-to-noise ratio of each training step.
There's a second, subtler effect. Synthetic data can be generated at targeted difficulty and targeted distribution. Web data is whatever the internet happens to be. Synthetic data can be fat-tailed on exactly the skills you want (chains of reasoning, math manipulation, code + explanation) in exactly the proportions you need.
The actual Phi-1.5 pipeline was more surgical than “ask GPT-4 for textbooks.” Gunasekar et al. 2023 trained a small educational-value classifier — a random-forest probe over embedding features — on a few thousand human-labelled web pages. They then ran it over The Stack and filtered web crawl down to the top-scoring ~6B tokens (from a pool of hundreds of billions), and mixed that with ~1B tokens of GPT-3.5-generated synthetic textbooks and exercises. The classifier is the load-bearing piece: without it, the synthetic data drifts toward topics the teacher happens to over-generate. With it, every token earns its place by an explicit criterion you can audit.
The failure mode is distribution narrowness. Phi-1.5 and Phi-2 post strong GSM8K and HumanEval numbers but underperform on TriviaQA and cultural/historical prose — the teacher's “textbook” prior simply doesn't cover that surface area. Bubeck et al. later conceded that Phi models are “spiky” — excellent on the axes the pipeline targeted, mediocre on axes it didn't. This is why you still see the Llama line trained on messy web data: generality and long-tail world-knowledge coverage are the things web data buys that textbook data cannot.
Is this 'just distillation'?
A sharp reader will object: “you're generating training data with GPT-4, so you're distilling GPT-4. The student can only be as good as its teacher.”
Phi-4 exceedsGPT-4 on STEM benchmarks despite being distilled from it. How? The data generation is not a one-shot rewrite. It's a pipeline: prompt, critique, rewrite, self-improve, re-generate with variations. The final training dataset is the result of many passes through the teacher, each one smoothing out the teacher's own blind spots in a different way. Plus the Phi team layers in curated web data and verified math — the student learns from sources the teacher never saw cleanly.
The deeper point: synthetic data shapes the training distributionin ways that no teacher alone could. It's not capped by the teacher's one-shot ability.