#speculative-decoding

inspiration
The Draft Model Does the Work

Speculative decoding uses a small draft model to propose tokens and a large model to verify them in parallel. The large model runs once per batch, not once per token. That single change converts a sequential bottleneck into a parallel verification step — and delivers 2–3x latency reduction at zero quality cost.
July 4, 2026

The Draft Model Does the Work