Speculative decoding is an inference technique for large language models where a fast “draft” model proposes multiple next tokens and a slower, higher-quality “target” model verifies or corrects them, increasing throughput and reducing latency without changing the final output distribution.
What is Speculative Decoding?
During standard autoregressive decoding, an LLM generates one token at a time, and each step requires a forward pass. Speculative decoding speeds this up by using two models (or two modes of the same model). First, a small draft model generates a short candidate sequence (e.g., 4–20 tokens). Then the target model runs once to validate those tokens. If the target model agrees with the draft for the first k tokens, those k tokens are accepted in bulk; when it disagrees, the process falls back to the target model’s preferred token at the divergence point. Because verification can accept multiple tokens per target-model call, the expensive model is invoked fewer times, improving tokens/second and time-to-first-answer.
Where it’s used and why it matters
Speculative decoding is used in high-traffic LLM serving (chat, coding copilots, summarization) where inference cost dominates. It matters because it improves latency and reduces GPU spend while preserving output quality, since the target model remains the authority. It also works well with batching and can be combined with KV cache optimizations, quantization, and paged attention to maximize serving efficiency.
Types / variants
- Two-model speculative decoding: separate draft and target models.
- Self-speculation: one model in a cheaper mode (e.g., fewer layers) drafts, full mode verifies.
- Lookahead decoding: related approaches that propose multiple tokens ahead, then validate.
FAQs
Does speculative decoding change model accuracy?
If implemented correctly, it preserves the target model’s distribution; quality should match baseline decoding.
How do I choose the draft model?
Pick a much faster model that is reasonably aligned with the target so many drafted tokens are accepted; acceptance rate drives speedup.
When is speculative decoding not helpful?
If the draft model is too weak (low acceptance) or the target model is already extremely fast, gains may be limited.