How much speedup can we expect?

Typically 1.5-3× in practice. Higher speedups (3-5×) achievable with well-tuned draft models and high-acceptance workloads. Lower speedups (1.2-1.5×) on workloads where draft acceptance is low.

What's the right draft model size?

Typically 5-15% the size of the target model. For Llama 70B target, Llama 7B or 8B as draft works well. The draft must be fast enough to make speculation worthwhile but accurate enough to have high acceptance rate.

Should we enable speculative decoding in production?

For latency-critical workloads, yes. Modern inference engines (vLLM, TensorRT-LLM) make this a configuration flag. Trade-off: speculative decoding uses additional GPU memory for the draft model, which can reduce concurrent serving capacity. Benchmark on your specific workload.

Do frontier models use speculative decoding?

Almost certainly yes (though specifics aren't published). Modern frontier model serving (Anthropic Claude, OpenAI GPT, Google Gemini) uses sophisticated inference optimization that almost certainly includes speculative decoding or close variants.

Start a conversation

AI engineering glossary

What is Speculative Decoding?

Q: What's the difference between speculative decoding and EAGLE / Medusa?

Same idea, different implementation. EAGLE / Medusa train special draft heads attached to the target model rather than using a separate draft model. Tighter integration, often higher speedup, less memory overhead. Newer techniques; less universally supported than standard speculative decoding.

Speculative decoding is an LLM inference optimization where a small fast 'draft model' proposes multiple tokens, the large target model verifies them in a single forward pass, and accepted tokens are returned, typically achieving 1.5-3× speedup on token generation with no quality loss compared to running the target model alone.

Last updated 2026-04-29BearPlex AI Engineering Team

Overview

Speculative decoding (Leviathan et al., 2022) is one of the most important production inference optimizations for LLMs. The core insight: most token generation is easy; the target model 'agrees' with cheaper alternatives. By using a small fast draft model to propose multiple tokens at once and verifying them in a single target model forward pass, you can effectively parallelize generation. The technique is mathematically guaranteed to produce identical output to running the target model alone (when implemented correctly). Production inference engines (vLLM, TensorRT-LLM, Together AI) use speculative decoding extensively. For applications where latency matters, speculative decoding is one of the highest-ROI inference optimizations available.

How speculative decoding works

Three components: (1) Target model, the large LLM you're actually trying to run inference with; (2) Draft model: a small fast model (often the same family as the target but much smaller, e.g., Llama 7B as draft for Llama 70B target); (3) Verification: accept or reject draft tokens. Process: draft model generates K candidate tokens (typically 4-8) sequentially. Target model evaluates all K tokens in a single forward pass: much faster than K separate forward passes. For each token, accept if it matches what the target would have generated; reject after the first mismatch and use the target's actual token. Repeat. The expected speedup depends on draft model accuracy, typically 1.5-3× in practice with no quality loss vs the target model alone.

Speculative decoding variants

Multiple variants address different needs. (1) Standard speculative decoding (Leviathan 2022): small draft model + target model verification. (2) EAGLE / Medusa: train special draft heads attached to the target model rather than using a separate draft model. Tighter integration, often higher speedup. (3) Lookahead decoding: generate candidates from N-grams rather than a draft model. (4) Self-speculative decoding: use partial layers of the target model as the draft, no separate model needed. Production frameworks support various combinations. Modern frontier model inference (Anthropic Claude, OpenAI GPT, Google Gemini) almost certainly uses some form of speculative decoding internally; open-source production serving (vLLM, TensorRT-LLM) supports speculative decoding as a configurable feature.

When speculative decoding helps and when it doesn't

Speculative decoding helps most when generation is latency-critical and the workload has high acceptance rate. Well-aligned domains (typical chat, RAG over documents) see good speedups. Speculative decoding helps less when: workloads have low draft model acceptance (highly creative or domain-specific tasks where the small draft model frequently disagrees with the target); throughput matters more than latency (large batch sizes already amortize per-request cost); or memory is the binding constraint (running a draft model adds memory pressure). For most production LLM serving in 2026, speculative decoding is enabled by default in modern inference engines.

Use cases

Latency-critical LLM serving (chat applications, real-time generation)
Production inference at scale (faster generation = higher throughput per GPU)
Token-by-token streaming UX (faster TTFT and inter-token latency)
Cost-optimized inference (faster generation = lower cost per token at fixed GPU utilization)

Examples in production

Leviathan et al. (Google, 2022)

Original 'Fast Inference from Transformers via Speculative Decoding' paper: introduced the technique that has become production standard.

Source

vLLM

vLLM supports speculative decoding as a configurable production feature: typical deployments enable it for latency-critical workloads.

Source

Together AI

Together AI uses speculative decoding extensively in their managed inference: contributes to the cost-quality advantage they offer over frontier API alternatives.

Speculative Decoding compared to alternatives

Alternative	Choose Speculative Decoding when	Choose alternative when
Standard autoregressive generation Generate one token at a time with target model only	Speculative decoding for latency-critical or cost-sensitive serving	Standard generation when memory is the binding constraint or workload doesn't benefit from speculation
Continuous batching Dynamically batch concurrent requests for throughput	Speculative decoding for per-request latency	Continuous batching for throughput; both can be combined

Common pitfalls

Mismatched draft and target models: speedup depends on draft acceptance rate; draft too different from target produces poor speedup
Memory pressure from draft model on smaller GPUs: adding draft model uses memory that could serve more concurrent requests
Incorrect implementation that breaks the mathematical equivalence guarantee
Assuming speculative decoding always helps: workload characteristics matter

Related BearPlex services

Model Engineering & Fine-Tuning Sovereign Cloud Infrastructure

Full AI glossary

FAQ

Questions about Speculative Decoding.

No: when implemented correctly, speculative decoding is mathematically guaranteed to produce identical output to running the target model alone. The technique is exact, not approximate.

Need help implementing Speculative Decoding?

BearPlex builds production AI systems that use Speculative Decoding for Fortune 500s and high-growth scale-ups. Outcome-based pricing. 90-day embedded sprints.

Talk to BearPlex See case studies