#speculative-decoding
1 paper
-
inspiration
The Draft Model Does the Work
Speculative decoding uses a small draft model to propose tokens and a large model to verify them in parallel. The large model runs once per batch, not once per token. That single change converts a sequential bottleneck into a parallel verification step — and delivers 2–3x latency reduction at zero quality cost.