A language model is a probability distribution
Forget the architecture for a moment. Forget “parameters” and “training”. A language model, at its most reduced, is a function that answers one question:
Given a sequence of words, it returns a probability over every possible next word in its vocabulary. That's it. Every large language model — every Phi, every Llama, every GPT — is a machine that computes exactly this. The architecture is how it computes it; the training is how it gets good at it.
Here is the specimen
Below is a toy model — hand-written in a few lines of JavaScript, running in your browser, knowing nothing beyond a small table of word pairs. We're using a toy so you can see the shapeof the question before we invest any machinery in answering it. A real SLM does exactly what you're about to do, only over a 200,000-token vocabulary and contextualized across thousands of prior tokens.
Two ways to pick
Once the model hands you a distribution, you have to decide what to do with it. Two classical choices:
- Greedy (argmax): always pick the single highest-probability token. Deterministic. Often produces robotic, repetitive text.
- Stochastic sampling: pick each token with probability equal to its model score. Adds variety but can veer off-topic.
Real deployments almost always use a middle ground — things called top-k, top-p (nucleus), and temperature. These are all ways to reshape the distribution before sampling from it (cut off the tail, sharpen or soften the peaks). We come back to them in Act VIII.
Why this framing matters
Everything that looks magical about a chatbot — reasoning, planning, “personality” — emerges from repeated applications of this one operation. Reason step by step? That's the model predicting each reasoning step as the next token. Call a tool? That's the model predicting the JSON that happens to be a function call. Refuse a harmful request? That's the model predicting “I can't help with that” because its training shaped the distribution to make those tokens highly probable after certain prompts.
Keep this in mind as we go. Every architectural lever, every training trick, every specialization technique in Microscale is, in the end, about reshaping this distribution to land in the places you want.