Microscale
0
Act VWhere They Break
lesson cot-regression · 9 min · 45 xp

Chain-of-thought regression

Small models write fluent nonsense

Fluent prose is not the same as reasoning

Here is a grade-school word problem — the kind every reasoning model claims to handle. Read it carefully.

A store sells apples for $0.50 each and pears for $0.75 each. Alex buys 6 apples and 4 pears. She pays with a $10 bill. How much change does she get?

The correct answer is $4.00. Click “Show the small model” and read what a 2022-era 1B-parameter model actually produced. The output looks like reasoning. It is not.

trace slot — empty
Click “Show the small model” above to reveal a 2022-era 1B model's reasoning chain for this problem. Then click “Also show a larger model” to put a 70B trace next to it and compare.

What went wrong in step 5?

The small model multiplied, added, and subtracted fine — until the last step, where “pays with $10, cost $5.50, change is $10 + $5.50”. It added when it should have subtracted. Notice the fluency is intact; the model has no visible confusion; it produces the wrong sign in a tone of complete confidence.

Wei et al. 2022 (“Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”) documented this precisely. On GSM8K, prompting a 7B LaMDA with “let's think step by step” actually hurt accuracy by about 3 points versus asking for a direct answer; the same prompt on the 62B variant added ~7 points; on 137B it added ~18 points. Below roughly 10B params at 2022 training quality, CoT was a regression. The visible output looked like reasoning because the training data contained lots of reasoning-shaped prose. But the latent computation driving the output wasn't actually doing the reasoning — the model was pattern-matching on reasoning style without tracking the numerical content. This is the phenomenon the literature calls CoT regression: the chain is strictly worse than no chain at all.

The modern fix — distillation of reasoning traces

Today's 3B reasoning-tuned SLMs (Phi-4-mini-reasoning, DeepSeek-R1-distill variants) handle this same problem. How? Not by being smarter — by being trained on correctCoT traces from much larger teachers. The student inherits the teacher's trajectory through the reasoning space, and learns to reproduce it.

This is not pure learning — it's imitationof correct reasoning. It works for in-distribution problems (ones similar to the teacher's training). It's fragile for OOD problems where the teacher wasn't confident either. But it's enough to close the embarrassing gap on grade-school math for SLMs in 2026.

A diagnostic test: ask the model to solve a problem using exactly one step. If it can't compress, its CoT is probably imitation-only; the model isn't reasoning so much as reciting a memorized-shaped chain.