Speculative Decoding: The Free Tokens

Decoding is slow because it is sequential. The model generates one token, feeds it back, generates the next. Every token requires a full forward pass. You cannot parallelize across the output sequence.

Speculative decoding breaks this constraint — not by changing the model, but by exploiting idle GPU compute.

The mechanism

A small, fast draft model proposes K tokens ahead. The large target model then verifies all K in a single forward pass. Tokens the target accepts are kept. The first rejected token is corrected. Then the process repeats.

The key insight: the target model's verification pass is cheap relative to K separate generation passes. You get K tokens for roughly the cost of one verification. That is the free lunch.

The output is mathematically identical to greedy decoding from the target model alone. No quality tradeoff. Just speed.

When it works

The mechanism only pays off when the draft model's acceptance rate exceeds roughly 0.55–0.60. Below that, the overhead of running the draft and verifying its proposals costs more than it saves.

High acceptance requires predictable output: long-form generation, coding completions, structured text. Tasks where the next token is, in context, fairly obvious. The draft model finds its footing, proposes confidently, and the target mostly agrees.

Tasks with high entropy — open-ended reasoning, creative generation, ambiguous instruction — produce low acceptance rates. Speculative decoding helps less here.

The architecture implication

Speculative decoding conflicts with high-concurrency serving. Continuous batching, the technique that makes vLLM fast at scale, fills idle GPU compute with other requests. Speculative decoding wants that same idle compute for draft verification. The two compete.

This means speculative decoding is a single-user or low-concurrency tool. It fits local inference — llama.cpp, Ollama, self-hosted on a personal server — far better than it fits a production API endpoint serving dozens of users.

Variants worth knowing

EAGLE-3: trains a draft head on the target model's own hidden states, achieving 60–66% first-token acceptance. Higher quality drafts, no separate model.
n-gram / prompt-lookup: reuses spans from the input prompt as draft tokens. Zero overhead. Excellent for RAG and document Q&A where the answer echoes the source.
Medusa: adds multiple decoding heads to the target model itself. No separate draft model. Requires fine-tuning.

The practical takeaway

If you run a local inference stack for single-user workloads — coding assistant, document Q&A, long-form generation — speculative decoding is worth enabling. Check your engine's docs; llama.cpp and vLLM both support it natively.

Measure your acceptance rate. If it sits above 0.6, you will see real gains. If it sits below 0.5, move on.

The free lunch exists. It just requires the right menu.

References

Leviathan, Y., Kalman, M., & Matias, Y. (2023). "Fast Inference from Transformers via Speculative Decoding." ICML 2023, Proceedings of Machine Learning Research, Vol. 202, pp. 19274-19286. https://arxiv.org/abs/2211.17192
Li, Y., Wei, F., Zhang, C., & Zhang, H. (2025). "EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test." https://arxiv.org/abs/2503.01840
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J.E., Zhang, H., & Stoica, I. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." ACM SOSP 2023. https://arxiv.org/abs/2309.06180
Medusa: Cai, T., Li, Y., Geng, Z., Peng, H., & Dao, T. (2024). "Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads." https://arxiv.org/abs/2401.10774