Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper
Every inference request costs tokens. Most engineers count the tokens in the answer. Fewer count the tokens in the thinking.
That asymmetry is the mistake.
For most of the transformer era, scaling meant one thing: more parameters. More weights, more pretraining compute, more capability.
OpenAI's o1 introduced a second axis: test-time compute. The model generates a chain of thought before producing its answer. That chain — the reasoning trace — is not output. It is computation. The model is not writing down its work for the reader. It is doing work, in token space, that would otherwise require a larger model.
The original OpenAI post put it plainly: "the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute)."
These are two independent levers on the same objective.
A reasoning model generates tokens in two phases:
1. Reasoning tokens — the internal chain of thought. Not shown to the user in most deployments. Consumed, then discarded.
2. Output tokens — the visible answer.
The reasoning tokens are not decorative. They are where the model corrects itself, explores branches, verifies intermediate steps, and backtracks from wrong paths. Research on o1-like models identifies at least four distinct reasoning patterns in those tokens: correcting mistakes, trying different approaches, verification, and exploration.
Strip the reasoning tokens and you get a smaller model's answer. Keep them and you get a larger model's answer, from a smaller model.
Standard inference cost: proportional to output length.
Reasoning inference cost: proportional to reasoning length plus output length — and reasoning length scales with problem difficulty, not output length.
A simple arithmetic question might consume 20 reasoning tokens before a 5-token answer. A hard coding problem might consume 8,000 reasoning tokens before a 200-token answer. The ratio is not fixed. It is determined by the problem.
This has a concrete engineering consequence: reasoning models are unpredictable in latency and cost. For a given prompt, you do not know in advance how long the model will think. The variance is not a bug — it is the mechanism.
Two distinct strategies exist for allocating more test-time compute:
Sequential scaling — longer chains of thought. More tokens per reasoning trace. The model explores more deeply. Cost grows linearly with trace length, but transformer attention grows quadratically with context, making very long chains expensive per token.
Parallel scaling — multiple independent samples, then selection of the best. Best-of-N sampling. The model thinks multiple times, in parallel, and a verifier picks the winner. Cost grows linearly with N. No quadratic penalty. Less coherent exploration.
A 2024 research paper quantified the tradeoff: by optimally scaling test-time compute you can outperform much larger models in a FLOPs-matched evaluation. A smaller model, given more thinking budget, beats a larger model given none.
More thinking is not always better.
For simple problems, extended reasoning chains introduce redundancy — the model recalculates what it already knows, re-verifies correct answers, or oscillates between solutions it cannot distinguish. The 2024 paper "Do NOT Think That Much for 2+3=?" quantified this: reasoning token consumption is not calibrated to problem difficulty by default. The model does not know when to stop.
This is an inference budget problem. You want to allocate more thinking to hard problems and less to easy ones. Unconstrained, the model cannot make that judgment reliably.
The engineering consequence: for production systems, test-time compute requires a budget, not just a context window.
If you are running a reasoning model in production:
A reasoning model does not have more knowledge than a base model. It has more compute at inference time, structured as token generation. Thinking tokens are not output. They are a compute budget, expressed in the only currency a transformer understands.
Understanding that changes how you price, route, and design systems around these models.