Attention Under the Microscope

the aha moment

Load Qwen3-0.6B, register forward hooks on every attention layer, and extract all 448 head-patterns as heatmaps. Find the previous-token head, find the induction head, find the uniform global-attention head. Zero one and measure the perplexity hit. Some heads matter, most don't — and now you can prove it.

Open in Colab View on GitHub

the facts

Time: 60–90 min
Hardware: CPU · Mac · GPU · Colab
Act: II · Inside the Machine
Status: Live
Artifact: A 28×16 gallery of attention-head heatmaps and an ablation-impact grid.

run it locally

Clone the labs repo and run this lab as a script or open it as a notebook:

git clone https://github.com/iqbal-sk/Microscale-labs.git
cd Microscale
just setup-auto      # auto-detects CPU / CUDA / Mac
just run 02
# or:  jupyter lab labs/02-attention-microscope/lab.py

Full install options (uv, pip, or the platform-specific CUDA paths) are in the labs README.

read alongside

Lesson · 8 min · 40 xp

Multi-head attention

Split d_model into h independent heads, each learning syntax, coreference, or position patterns. How the W_O projection recombines them

Lesson · 12 min · 60 xp

Attention, from first principles

Drag a query vector and watch softmax weights shift live. Learn Q, K, V projections, dot-product scoring, √d scaling, and how the attention matrix forms

Open in Colab View on GitHub ← all labs