Tokens and probabilities
How BPE tokenization works, what softmax probabilities mean, and how LLMs choose the next token — from logits to nucleus sampling
Lab 01 · The Token Tax · 30 minA token is usually not a word
Everyone's mental model of tokenisation is the same at first: “the model reads words”. And for a surprisingly long time, that's good enough. But it isn't quite right. Modern language models read sub-word pieces — fragments somewhere between characters and words, carefully chosen so that common sequences collapse into a single unit while rare or novel ones fall back to smaller pieces.
Why? Two reasons, both practical:
- Open vocabulary. With whole-word tokens, any word the model has never seen becomes
[UNK]. Sub-word tokens decompose gracefully: “unfortunateness” becomes something likeun+fortun+ate+ness, and each of those pieces is already in the model's vocabulary. - Density. Frequent phrases like “the” or “ing” get a single token. The model's context window then covers more meaning per slot than character-level would.
Byte-pair encoding, in one paragraph
The dominant algorithm is called BPE (byte-pair encoding). You start with every character — or every byte — as its own token. Then you count the most frequent adjacent pair in your training corpus, merge it into a single symbol, and repeat. After tens of thousands of rounds you end up with a vocabulary where some entries are whole words (the, and), some are common suffixes (-ing, -ed), and some are rare-but-useful character sequences that serve as a fallback for anything unfamiliar.
o200k_base, Llama's SentencePiece, Gemma's, Qwen's — is a variation on this same 1994 compression algorithm.The real thing — try it yourself
The playground below is not a toy. It runs the actualBPE merge tables used by OpenAI's production models. When you type a sentence, each keystroke runs through the same 200,064-entry vocabulary that GPT-4o uses, and the integer underneath each chip is the real token ID — the number that gets fed into the transformer's embedding lookup.
Four tokenizers are available: GPT-4o (200k vocab, the current frontier), GPT-4 / GPT-3.5 (100k, cl100k_base), GPT-3 (50k, p50k_base), and GPT-2 (50k, the historical baseline from 2019). Try the same sentence through each of them and watch the token count shift. Larger vocabularies produce fewertokens per sentence — they've learned more common phrases as single units.
What to try in the playground
Four quick experiments worth the minute they take:
- Common English:type “the cat sat on the mat”. You'll see 7 tokens in
o200k_base— every word is one token because all of them are in the high-frequency vocabulary. - Something technical: try “transformer” or “backpropagation”. Watch how some words stay as one token while others fragment — the vocabulary has learned these specific technical terms because they appeared often enough in training.
- A rare or invented word: type “unfurnishednesses” or “xkdftw”. BPE gracefully decomposes into smaller and smaller pieces until it hits character-level fragments — this is the fallback that makes it “open vocabulary”.
- Non-English text: try Japanese (“こんにちは”), Arabic, Hindi, or emoji (🚀🎉). The older tokenizers (GPT-2, GPT-3) will explode the character count because they weren't trained on much non-English data. GPT-4o's
o200k_basewas specifically designed to be better at this — try the same Japanese string through GPT-2 and then through GPT-4o and look at the token counts.
Probability over a vocabulary — how big is big?
Once the text is tokenized, the model's job at each step is to output a probability distribution over the entire vocabulary. For a 200,064-token vocabulary, that's a vector of 200,064 positive real numbers that must sum to 1.
The cost of predicting over a big vocabulary is real — it's the last matmul in the whole network (the output projection, sometimes called the “LM head”), and for a ~3B model with a 200k vocab it can easily be 10–15% of total parameters. This is why the tied embeddings trick, which we cover in Act II, is such a valuable optimization — you save the cost of one of the two big vocab-sized matrices by sharing weights.
The distribution is not the answer — it is the material the answer is sampled from. A decoder draws one token from that 200,064-wide vector, appends it to the context, and runs the whole forward pass again for the next position. How it draws matters: argmax (greedy) picks the single highest-probability token every time — deterministic, often dull, and pathologically prone to loops. Temperature divides the logits before softmax, so sharpens the distribution and flattens it. Top-p (nucleus) sampling keeps only the smallest set of tokens whose probabilities sum to (typically 0.9) and renormalises — the point is to cut off the fat tail of 199,000 tokens each with probability that collectively hold 10% of the mass and almost always represent noise. These aren't stylistic knobs; they are the reason the same model with and feels like two different products.
- Neural Machine Translation of Rare Words with Subword UnitsSennrich, Haddow, Birch · 2015 · ACL 2016BPE re-applied from data compression to NLP. Every modern tokenizer is a descendant.
- SentencePiece: A simple and language independent subword tokenizer and detokenizerKudo, Richardson · 2018 · EMNLP 2018 (system demonstrations)The library behind Llama, Gemma, Qwen tokenizers. Implements BPE and unigram-LM tokenization.
- LLaMA: Open and Efficient Foundation Language ModelsTouvron, Lavril, Izacard et al. · 2023 · arXiv (Meta AI / FAIR)Llama 1 — 32k vocab via SentencePiece BPE; the recipe later families inherit.
- Language Model Tokenizers Introduce Unfairness Between LanguagesPetrov, La Malfa, Torr, Bibi · 2023 · NeurIPS 2023Differences up to 15× in tokenization length between languages on the same tokenizer.