The $1 Pretraining Run

the aha moment

Train a 10M-parameter GPT-2 from scratch on TinyStories for ~20 minutes, watch the loss curve descend from random noise to coherent English, then train a second copy on corrupted data and see the textbook hypothesis as a measured gap. Compute cost: well under $1.

Open in Colab View on GitHub

the facts

Time: 90–120 min
Hardware: GPU · Mac · Colab · CPU
Act: IV · How They Learn
Status: Live
Artifact: Two trained 10M-param models + loss curves + side-by-side generation samples.

run it locally

Clone the labs repo and run this lab as a script or open it as a notebook:

git clone https://github.com/iqbal-sk/Microscale-labs.git
cd Microscale
just setup-auto      # auto-detects CPU / CUDA / Mac
just run 05
# or:  jupyter lab labs/05-dollar-pretraining/lab.py

Full install options (uv, pip, or the platform-specific CUDA paths) are in the labs README.

read alongside

Lesson · 10 min · 50 xp

Scaling laws, alive

Visualize LLM scaling laws interactively — why Chinchilla's 20:1 ratio broke, how inference-optimal uses 1000× more tokens per parameter, and the crossover

Lesson · 9 min · 45 xp

The textbook hypothesis

How Microsoft Phi proved data quality beats scale — educational-value classifiers, synthetic textbook generation, mode collapse risks, and Phi-4 exceeding GPT-4

Open in Colab View on GitHub ← all labs