Thinking Tokens Are Compute

Every inference request costs tokens. Most engineers count the tokens in the answer. Fewer count the tokens in the thinking.

That asymmetry is the mistake.

Two Axes of Scaling

For most of the transformer era, scaling meant one thing: more parameters. More weights, more pretraining compute, more capability.

OpenAI's o1 introduced a second axis: test-time compute. The model generates a chain of thought before producing its answer. That chain — the reasoning trace — is not output. It is computation. The model is not writing down its work for the reader. It is doing work, in token space, that would otherwise require a larger model.

The original OpenAI post put it plainly: "the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute)."

These are two independent levers on the same objective.

What Thinking Tokens Actually Are

A reasoning model generates tokens in two phases:

1. Reasoning tokens — the internal chain of thought. Not shown to the user in most deployments. Consumed, then discarded.

2. Output tokens — the visible answer.

The reasoning tokens are not decorative. They are where the model corrects itself, explores branches, verifies intermediate steps, and backtracks from wrong paths. Research on o1-like models identifies at least four distinct reasoning patterns in those tokens: correcting mistakes, trying different approaches, verification, and exploration.

Strip the reasoning tokens and you get a smaller model's answer. Keep them and you get a larger model's answer, from a smaller model.

The Cost Model Is Different

Standard inference cost: proportional to output length.

Reasoning inference cost: proportional to reasoning length plus output length — and reasoning length scales with problem difficulty, not output length.

A simple arithmetic question might consume 20 reasoning tokens before a 5-token answer. A hard coding problem might consume 8,000 reasoning tokens before a 200-token answer. The ratio is not fixed. It is determined by the problem.

This has a concrete engineering consequence: reasoning models are unpredictable in latency and cost. For a given prompt, you do not know in advance how long the model will think. The variance is not a bug — it is the mechanism.

Sequential vs. Parallel Test-Time Compute

Two distinct strategies exist for allocating more test-time compute:

Sequential scaling — longer chains of thought. More tokens per reasoning trace. The model explores more deeply. Cost grows linearly with trace length, but transformer attention grows quadratically with context, making very long chains expensive per token.

Parallel scaling — multiple independent samples, then selection of the best. Best-of-N sampling. The model thinks multiple times, in parallel, and a verifier picks the winner. Cost grows linearly with N. No quadratic penalty. Less coherent exploration.

A 2024 research paper quantified the tradeoff: by optimally scaling test-time compute you can outperform much larger models in a FLOPs-matched evaluation. A smaller model, given more thinking budget, beats a larger model given none.

The Overthinking Problem

More thinking is not always better.

For simple problems, extended reasoning chains introduce redundancy — the model recalculates what it already knows, re-verifies correct answers, or oscillates between solutions it cannot distinguish. The 2024 paper "Do NOT Think That Much for 2+3=?" quantified this: reasoning token consumption is not calibrated to problem difficulty by default. The model does not know when to stop.

This is an inference budget problem. You want to allocate more thinking to hard problems and less to easy ones. Unconstrained, the model cannot make that judgment reliably.

The engineering consequence: for production systems, test-time compute requires a budget, not just a context window.

What This Changes

If you are running a reasoning model in production:

Budget reasoning tokens explicitly. Most APIs expose a reasoning token limit or max completion tokens parameter. Set it. Know what hard and easy problems cost separately.
Do not route all queries to a reasoning model. Reasoning models on simple queries are expensive and no more accurate than standard models. Route by problem class, not model capability.
Latency is not predictable from input length. A short prompt can produce a very long reasoning trace. P99 latency for reasoning models is significantly higher than median. Design accordingly.
Parallel sampling is underused. For problems with a verifiable answer (code, math), best-of-N with a lightweight verifier often beats a single long chain at lower total cost.

The Core Insight

A reasoning model does not have more knowledge than a base model. It has more compute at inference time, structured as token generation. Thinking tokens are not output. They are a compute budget, expressed in the only currency a transformer understands.

Understanding that changes how you price, route, and design systems around these models.

References

OpenAI, 2024. "Learning to reason with LLMs." OpenAI Blog, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/
Snell, C., Lee, J., Xu, K., Kumar, A., 2024. "Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model Parameters." arXiv:2408.03314. https://openreview.net/forum?id=4FWAwZtd2n
Zeng, Z., Cheng, Q., Yin, Z., Zhou, Y., Qiu, X., 2025. "Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?" ACL 2025. https://aclanthology.org/2025.acl-long.232/
Chen, X. et al., 2024. "Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs." arXiv:2412.21187. https://arxiv.org/abs/2412.21187
Yang, W.S., Ma, S., Lin, Y., Wei, F., 2025. "Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning." arXiv:2502.18080. https://arxiv.org/abs/2502.18080