Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper
Autoregressive decoding is sequential by construction. The model generates one token, feeds it back as input, and generates the next. Every token requires a full forward pass. That is the latency floor — and it is a hardware problem, not a model problem.
The large model is memory-bandwidth-bound. It moves billions of parameters from VRAM to compute units for each token it generates. That movement takes time regardless of how fast your arithmetic hardware runs. You cannot speed up autoregressive decoding by adding more compute.
Speculative decoding solves it differently.
A small, fast draft model generates a sequence of candidate tokens — typically 4 to 8. The large model then verifies all of them in a single forward pass. Tokens that match what the large model would have generated are accepted. The first rejected token is discarded, and decoding continues from that point.
The key insight: the large model's forward pass over k tokens costs only marginally more than a single-token pass. The model is memory-bandwidth-bound, not compute-bound. Verifying 5 tokens in one pass is nearly as cheap as verifying 1.
When the draft model's acceptance rate is high — meaning its predictions frequently match the large model's — the system generates multiple tokens per large-model pass. Throughput scales with acceptance rate.
Leviathan et al. (2023) demonstrated 2.3–3.4x speedups on summarization and translation tasks using T5-small as the draft model for T5-large. Chen et al. (2023) at DeepMind achieved 2.5x speedup on HumanEval with code generation, with the distribution of outputs provably identical to standard autoregressive decoding. SpecTr, a tree-based variant, pushed this to 2.13x over standard speculative decoding by representing the draft as a token tree rather than a sequence.
Critically: the output distribution is mathematically unchanged. Speculative decoding is lossless acceleration. It is not an approximation.
The draft model must satisfy two properties simultaneously: it must be fast, and its predictions must frequently match the large model's.
Speed is obvious — if the draft model is nearly as slow as the target, the savings evaporate. The draft model should be at least an order of magnitude smaller.
Acceptance rate is less obvious. It depends on the similarity of the draft and target models' distributions. Draft models trained as smaller versions of the same architecture from the same data family work best. A completely unrelated model will have low acceptance rate and deliver poor speedup.
This is why model families that ship paired draft models — Llama 3.2 1B alongside Llama 3.1 70B, for instance — can realize speculative decoding gains out of the box.
A variant eliminates the external draft model entirely. Zhang et al. (2023) showed that a model can draft using a subset of its own layers — skipping selected intermediate layers during the draft phase — then verify with the full network. This delivers up to 1.99x speedup on LLaMA-2 with no additional model to load or maintain.
The tradeoff: lower acceptance rate than a dedicated external draft model, but zero memory overhead and no need to source a compatible draft.
Speculative decoding reframes latency optimization. The question is no longer "how do I make the large model faster?" It is "how do I find a small model whose predictions closely match the large one?"
That is a data and training problem, not a hardware problem. And that is a more tractable problem.
The fast path to lower latency is not a better GPU. It is a better draft model.
1. Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast Inference from Transformers via Speculative Decoding. Proceedings of the 40th International Conference on Machine Learning (ICML 2023). https://proceedings.mlr.press/v202/leviathan23a/leviathan23a.pdf
2. Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., & Jumper, J. (2023). Accelerating Large Language Model Decoding with Speculative Sampling. arXiv:2302.01318. https://arxiv.org/abs/2302.01318
3. Xia, H., Ge, T., Wang, P., Chen, S.-Q., Wei, F., & Sui, Z. (2022). Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation. arXiv:2203.16487. https://arxiv.org/abs/2203.16487
4. Zhang, J., Wang, J., Li, H., Shou, L., Chen, K., & Chen, G. (2023). Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding. arXiv:2309.08168. https://arxiv.org/abs/2309.08168
5. Sun, Z., Cremer, J., & Kolter, J.Z. (2023). SpecTr: Fast Speculative Decoding via Optimal Transport. NeurIPS 2023. https://papers.nips.cc/paper_files/paper/2023/file/6034a661584af6c28fd97a6f23e56c0a-Paper-Conference.pdf