Your First DPO Alignment

the aha moment

Build 20 preference pairs for a narrow task (chosen vs rejected responses), run TRL's DPOTrainer on Qwen3-0.6B for 100 steps, and watch the model's behaviour shift from generic-chatbot to specifically-matches-your-chosen-examples. Alignment stops being abstract and becomes a trained adapter you can A/B test.

Open in Colab View on GitHub

the facts

Time: 90 min
Hardware: GPU · Mac · Colab
Act: VI · Making It Yours
Status: Live
Artifact: A DPO-aligned LoRA adapter + a before/after comparison report.

run it locally

Clone the labs repo and run this lab as a script or open it as a notebook:

git clone https://github.com/iqbal-sk/Microscale-labs.git
cd Microscale
just setup-auto      # auto-detects CPU / CUDA / Mac
just run 08
# or:  jupyter lab labs/08-dpo-alignment/lab.py

Full install options (uv, pip, or the platform-specific CUDA paths) are in the labs README.

read alongside

Lesson · 12 min · 60 xp

DPO as KL-constrained optimum

From the Bradley-Terry preference model to the KL-constrained optimum — a visual derivation of the DPO loss function

Open in Colab View on GitHub ← all labs