What is Speculative Decoding?
Speculative decoding is an LLM inference optimization where a small fast 'draft model' proposes multiple tokens, the large target model verifies them in a single forward pass, and accepted tokens are returned, typically achieving 1.5-3× speedup on token generation with no quality loss compared to running the target model alone.
Overview
Speculative decoding (Leviathan et al., 2022) is one of the most important production inference optimizations for LLMs. The core insight: most token generation is easy; the target model 'agrees' with cheaper alternatives. By using a small fast draft model to propose multiple tokens at once and verifying them in a single target model forward pass, you can effectively parallelize generation. The technique is mathematically guaranteed to produce identical output to running the target model alone (when implemented correctly). Production inference engines (vLLM, TensorRT-LLM, Together AI) use speculative decoding extensively. For applications where latency matters, speculative decoding is one of the highest-ROI inference optimizations available.
How speculative decoding works
Three components: (1) Target model, the large LLM you're actually trying to run inference with; (2) Draft model: a small fast model (often the same family as the target but much smaller, e.g., Llama 7B as draft for Llama 70B target); (3) Verification: accept or reject draft tokens. Process: draft model generates K candidate tokens (typically 4-8) sequentially. Target model evaluates all K tokens in a single forward pass: much faster than K separate forward passes. For each token, accept if it matches what the target would have generated; reject after the first mismatch and use the target's actual token. Repeat. The expected speedup depends on draft model accuracy, typically 1.5-3× in practice with no quality loss vs the target model alone.
Speculative decoding variants
Multiple variants address different needs. (1) Standard speculative decoding (Leviathan 2022): small draft model + target model verification. (2) EAGLE / Medusa: train special draft heads attached to the target model rather than using a separate draft model. Tighter integration, often higher speedup. (3) Lookahead decoding: generate candidates from N-grams rather than a draft model. (4) Self-speculative decoding: use partial layers of the target model as the draft, no separate model needed. Production frameworks support various combinations. Modern frontier model inference (Anthropic Claude, OpenAI GPT, Google Gemini) almost certainly uses some form of speculative decoding internally; open-source production serving (vLLM, TensorRT-LLM) supports speculative decoding as a configurable feature.
When speculative decoding helps and when it doesn't
Speculative decoding helps most when generation is latency-critical and the workload has high acceptance rate. Well-aligned domains (typical chat, RAG over documents) see good speedups. Speculative decoding helps less when: workloads have low draft model acceptance (highly creative or domain-specific tasks where the small draft model frequently disagrees with the target); throughput matters more than latency (large batch sizes already amortize per-request cost); or memory is the binding constraint (running a draft model adds memory pressure). For most production LLM serving in 2026, speculative decoding is enabled by default in modern inference engines.
Use cases
- Latency-critical LLM serving (chat applications, real-time generation)
- Production inference at scale (faster generation = higher throughput per GPU)
- Token-by-token streaming UX (faster TTFT and inter-token latency)
- Cost-optimized inference (faster generation = lower cost per token at fixed GPU utilization)
Examples in production
Leviathan et al. (Google, 2022)
Original 'Fast Inference from Transformers via Speculative Decoding' paper: introduced the technique that has become production standard.
SourcevLLM
vLLM supports speculative decoding as a configurable production feature: typical deployments enable it for latency-critical workloads.
SourceTogether AI
Together AI uses speculative decoding extensively in their managed inference: contributes to the cost-quality advantage they offer over frontier API alternatives.
Speculative Decoding compared to alternatives
| Alternative | Choose Speculative Decoding when | Choose alternative when |
|---|---|---|
Standard autoregressive generation Generate one token at a time with target model only | Speculative decoding for latency-critical or cost-sensitive serving | Standard generation when memory is the binding constraint or workload doesn't benefit from speculation |
Continuous batching Dynamically batch concurrent requests for throughput | Speculative decoding for per-request latency | Continuous batching for throughput; both can be combined |
Common pitfalls
- Mismatched draft and target models: speedup depends on draft acceptance rate; draft too different from target produces poor speedup
- Memory pressure from draft model on smaller GPUs: adding draft model uses memory that could serve more concurrent requests
- Incorrect implementation that breaks the mathematical equivalence guarantee
- Assuming speculative decoding always helps: workload characteristics matter
Questions about Speculative Decoding.
Typically 1.5-3× in practice. Higher speedups (3-5×) achievable with well-tuned draft models and high-acceptance workloads. Lower speedups (1.2-1.5×) on workloads where draft acceptance is low.
Typically 5-15% the size of the target model. For Llama 70B target, Llama 7B or 8B as draft works well. The draft must be fast enough to make speculation worthwhile but accurate enough to have high acceptance rate.
For latency-critical workloads, yes. Modern inference engines (vLLM, TensorRT-LLM) make this a configuration flag. Trade-off: speculative decoding uses additional GPU memory for the draft model, which can reduce concurrent serving capacity. Benchmark on your specific workload.
Same idea, different implementation. EAGLE / Medusa train special draft heads attached to the target model rather than using a separate draft model. Tighter integration, often higher speedup, less memory overhead. Newer techniques; less universally supported than standard speculative decoding.
Almost certainly yes (though specifics aren't published). Modern frontier model serving (Anthropic Claude, OpenAI GPT, Google Gemini) uses sophisticated inference optimization that almost certainly includes speculative decoding or close variants.
Need help implementing Speculative Decoding?
BearPlex builds production AI systems that use Speculative Decoding for Fortune 500s and high-growth scale-ups. Outcome-based pricing. 90-day embedded sprints.