lesson hello-model · 4 min · 15 xp

Hello, model

Run a 600M language model in your browser — see next-token probabilities, then experiment with greedy, sampling, temperature, and top-p decoding

A language model is a probability distribution

Forget the architecture for a moment. Forget “parameters” and “training”. A language model, at its most reduced, is a function that answers one question:

P(\text{next token} \mid \text{everything so far})

Given a sequence of words, it returns a probability over every possible next word in its vocabulary. That's it. Every large language model — every Phi, every Llama, every GPT — is a machine that computes exactly this Vaswani 2017. The architecture is how it computes it; the training is how it gets good at it.

Here is the specimen

Below is a toy model — hand-written in a few lines of JavaScript, running in your browser, knowing nothing beyond a small table of word pairs. We're using a toy so you can see the shapeof the question before we invest any machinery in answering it. A real SLM does exactly what you're about to do, only over a 200,000-token vocabulary and contextualized across thousands of prior tokens.

prefix — the tokens so far

the

cat

sat

next-token distribution · softmax over candidates

the

41.0%

21.0%

14.0%

our

10.0%

his

8.0%

that

6.0%

Greedy always picks the highest-probability token. Sample picks proportional to probability. Try each a few times — greedy is deterministic and often boring; sampling gives you a different continuation every run.

Two ways to pick

Once the model hands you a distribution, you have to decide what to do with it. Two classical choices:

Greedy (argmax): always pick the single highest-probability token. Deterministic. Often produces robotic, repetitive text.
Stochastic sampling: pick each token with probability equal to its model score. Adds variety but can veer off-topic.

Real deployments almost always use a middle ground — things called top-k, top-p (nucleus), and temperature. These are all ways to reshape the distribution before sampling from it (cut off the tail, sharpen or soften the peaks). We come back to them in Act VIII.

Every token you see above — “cat”, “sat”, “on”, “.” — is a single symbol in the model's vocabulary. In a real SLM tokens are usually sub-word pieces, not whole words. We dissect that in the next lesson.

Why this framing matters

Everything that looks magical about a chatbot — reasoning, planning, “personality” — emerges from repeated applications of this one operation. Reason step by step? That's the model predicting each reasoning step as the next token Wei 2022 (CoT). Call a tool? That's the model predicting the JSON that happens to be a function call. Refuse a harmful request? That's the model predicting “I can't help with that” because its training shaped the distribution to make those tokens highly probable after certain prompts Ouyang 2022 (InstructGPT).

Keep this in mind as we go. Every architectural lever, every training trick, every specialization technique in Microscale is, in the end, about reshaping this distribution to land in the places you want.

Sources · primary references · 3

Attention Is All You Need
Vaswani, Shazeer, Parmar et al. · 2017 · NeurIPS 2017
The transformer paper. Every modern decoder is descended from this architecture.
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Wei, Wang, Schuurmans et al. · 2022 · NeurIPS 2022
Each reasoning step is itself a sequence of next-token predictions.
Training language models to follow instructions with human feedback
Ouyang, Wu, Jiang et al. · 2022 · NeurIPS 2022
RLHF shapes the next-token distribution to refuse, comply, or follow a persona — the source of 'personality'.

comprehension check

comprehension · pick a tiereach tests the same ideas at a different depth

Recall: facts the lesson stated

Recall · question 1 / 3

What does a language model compute at each step of generation?