Microscalea field journal for small language models

progress0 / 2605 xp

a field journal for small language models

Microscale.

How can you ship a model you can't explain?

It sounds like you've worked near these architectures long enough to know the names — MoE, MLA, RoPE, FlashAttention, LoRA — but not to explain why each one exists. This is the layer underneath.

54 lessons · 13 labs · every claim ties to a shipped model or a paper you can trace

for you if

or browse the full map

new to all this? start here

The Primer — understand LLMs from scratch

No deep-learning background needed. Eight short reads plus a glossary, in plain English.

→

or open the workbench13 hands-on labs

the atlas

Nine regions. One specimen at a time.

Click a region to enter. Each lesson is a playable model, not a lecture — you'll earn the mathematics the same way you earn the badge.

try a specimen

tokenizer

type any sentence

tokens

…

characters

chars per token

…

loading o200k_base merge tables…

decoding o200k_base…

The ▁ symbol is a visual stand-in for a leading space — this is how tokenizers encode word boundaries. A token that starts with ▁means “this piece is the start of a new word”. The number under each chip is the actual integer vocab IDthat gets fed into the transformer's embedding table. Try typing a rare or non-English word and watch it fragment into small pieces — that's BPE gracefully degrading to sub-word coverage instead of hitting an unknown-token wall. Chips with a E3 81 dashed style are partial UTF-8 bytes— the token doesn't correspond to a standalone character on its own, only in combination with its neighbours. That's exactly what byte-level BPE does to CJK and emoji: it splits one visible character across several tokens.

Every lesson embeds a playable version of whatever it teaches. Click the other label above to flip between them.

read the lesson · Tokens and probabilities

who this is for

You already use language models. Maybe you fine-tune them. Maybe you serve them. Maybe you read release notes and still skip the architectural footnotes.

Microscale is for the point where that stops being enough — when “MoE,” “KV cache,” “RoPE scaling,” “FlashAttention,” and “LoRA rank” need to become mechanisms you can reason about, not terms you recognize.

Bring Python, tensors, linear algebra, and patience. If you want prompt tips, you want a different site.

on the workbench

A different way to learn sits next to the reading path. Twelve specimens wait on the workbench — all 448 attention heads of a 600M model classifying themselves into previous-token and induction patterns, a 10M transformer descending from noise to coherent English in twenty minutes of consumer GPU time, a 2 MB LoRA adapter that reshapes a model's voice on twenty cooking examples, your own GPU's bandwidth plotted on a roofline against your own model's arithmetic intensity.

Every one produces a number or a file you keep. None of them require a datacentre.

open the workbenchmicroscale.academy/labs

a note from the cartographer

This journal is organised as a slow path, not a dense reference. There is a canonical order through the regions, but you are free to wander. Nothing is locked; progress rings appear only to help you find your way back.

leave a note for the cartographernotes@microscale.academy