lesson cot-regression · 9 min · 45 xp

Chain-of-thought regression

Small models write fluent but wrong reasoning chains — imitation, not logic. Learn why CoT needs scale and how to diagnose fake reasoning

Fluent prose is not the same as reasoning

Here is a grade-school word problem — the kind every reasoning model claims to handle. Read it carefully.

A store sells apples for $0.50 each and pears for $0.75 each. Alex buys 6 apples and 4 pears. She pays with a $10 bill. How much change does she get?

The correct answer is $4.00. Click “Show the small model” and read what a 2022-era 1B-parameter model actually produced. The output looks like reasoning. It is not.

trace slot — empty

Click “Show the small model” above to reveal a 2022-era 1B model's reasoning chain for this problem. Then click “Also show a larger model” to put a 70B trace next to it and compare.

What went wrong in step 5?

The small model multiplied, added, and subtracted fine — until the last step, where “pays with $10, cost $5.50, change is $10 + $5.50”. It added when it should have subtracted. Notice the fluency is intact; the model has no visible confusion; it produces the wrong sign in a tone of complete confidence.

Wei et al. 2022 (“Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”) documented this precisely. On GSM8K, prepending few-shot CoT exemplars to the smaller LaMDA models (422M through 8B) hurt or matched standard prompting. CoT only began to help around LaMDA-137B and PaLM-62B, and on PaLM-540B it more than doubled accuracy (17.9% → 56.9%). Below roughly 60-100B parameters at 2022 training quality, CoT was a regression. The visible output lookedlike reasoning because the training data contained lots of reasoning-shaped prose. But the latent computation driving the output wasn't actually doing the reasoning — the model was pattern-matching on reasoning style without tracking the numerical content. This is the phenomenon the literature calls CoT regression: the chain is strictly worse than no chain at all. (The popular zero-shot prompt “Let's think step by step” is from a separate paper — Kojima et al. 2022 — which extended Wei's finding to a no-exemplars setting.)

The modern fix — distillation of reasoning traces

Today's 3B reasoning-tuned SLMs (Phi-4-mini-reasoning, DeepSeek-R1-distill variants) handle this same problem. How? Not by being smarter — by being trained on correctCoT traces from much larger teachers. The student inherits the teacher's trajectory through the reasoning space, and learns to reproduce it.

This is not pure learning — it's imitationof correct reasoning. It works for in-distribution problems (ones similar to the teacher's training). It's fragile for OOD problems where the teacher wasn't confident either. But it's enough to close the embarrassing gap on grade-school math for SLMs in 2026.