Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper

Speculative Decoding: The Free Tokens

inspiration | devinfo.dev | May 25, 2026 | devinfo.dev:2026.0007

Speculative decoding cuts inference latency 2–3x without changing a single output token. The gain is real. So is the catch.

Decoding is slow because it is sequential. The model generates one token, feeds it back, generates the next. Every token requires a full forward pass. You cannot parallelize across the output sequence.

Speculative decoding breaks this constraint — not by changing the model, but by exploiting idle GPU compute.

The mechanism

A small, fast draft model proposes K tokens ahead. The large target model then verifies all K in a single forward pass. Tokens the target accepts are kept. The first rejected token is corrected. Then the process repeats.

The key insight: the target model's verification pass is cheap relative to K separate generation passes. You get K tokens for roughly the cost of one verification. That is the free lunch.

The output is mathematically identical to greedy decoding from the target model alone. No quality tradeoff. Just speed.

When it works

The mechanism only pays off when the draft model's acceptance rate exceeds roughly 0.55–0.60. Below that, the overhead of running the draft and verifying its proposals costs more than it saves.

High acceptance requires predictable output: long-form generation, coding completions, structured text. Tasks where the next token is, in context, fairly obvious. The draft model finds its footing, proposes confidently, and the target mostly agrees.

Tasks with high entropy — open-ended reasoning, creative generation, ambiguous instruction — produce low acceptance rates. Speculative decoding helps less here.

The architecture implication

Speculative decoding conflicts with high-concurrency serving. Continuous batching, the technique that makes vLLM fast at scale, fills idle GPU compute with other requests. Speculative decoding wants that same idle compute for draft verification. The two compete.

This means speculative decoding is a single-user or low-concurrency tool. It fits local inference — llama.cpp, Ollama, self-hosted on a personal server — far better than it fits a production API endpoint serving dozens of users.

Variants worth knowing

The practical takeaway

If you run a local inference stack for single-user workloads — coding assistant, document Q&A, long-form generation — speculative decoding is worth enabling. Check your engine's docs; llama.cpp and vLLM both support it natively.

Measure your acceptance rate. If it sits above 0.6, you will see real gains. If it sits below 0.5, move on.

The free lunch exists. It just requires the right menu.

References