#latency — devinfo.dev

inspiration
Speculative Decoding Is a Bet on the Draft

Speculative decoding makes a large model generate faster by letting a small model guess ahead. It is lossless — the output is identical to decoding from the large model alone. But the entire speedup is a function of how often the draft is right, which makes the technique only as good as the match between your draft and target models.
July 15, 2026
inspiration
The Draft Model Does the Work

Speculative decoding uses a small draft model to propose tokens and a large model to verify them in parallel. The large model runs once per batch, not once per token. That single change converts a sequential bottleneck into a parallel verification step — and delivers 2–3x latency reduction at zero quality cost.
July 4, 2026
inspiration
Speculative Decoding: The Free Tokens

Speculative decoding cuts inference latency 2–3x without changing a single output token. The gain is real. So is the catch.
May 25, 2026