Skip to main content
AI engineering glossary

What is LLM Inference?

Inference is the process of running a trained machine learning model on new input to generate output: for LLMs specifically, this means taking a prompt, processing it through the model's billions of parameters, and producing the next tokens in the response, with cost and latency determined by model size, input length, and output length.

Last updated 2026-04-28BearPlex AI Engineering Team

Overview

Inference is what happens every time you send a prompt to an LLM. It's distinct from training (which builds the model from data) and fine-tuning (which adjusts a pre-trained model for a specific task). Inference is the cost line on every production AI system: at scale, inference compute often exceeds training compute spend by 10×+ over a model's lifetime. Inference engineering (the discipline of making inference fast, cheap, and reliable at scale) has become its own specialty. Frameworks like vLLM, TensorRT-LLM, llama.cpp, MLC, and managed services like Together AI, Fireworks, and Anyscale have professionalized inference optimization.

How LLM inference works

LLM inference has two distinct phases: (1) Prefill, the entire input prompt is processed through the model in parallel, producing a key-value cache that captures what the model has 'understood' so far; (2) Decoding: the model generates output tokens one at a time, each token requiring a full forward pass through the model's layers (with the KV cache reused to avoid reprocessing the prompt). Prefill is fast (compute-bound) and parallelizable. Decoding is slow (memory-bandwidth-bound) and serial. This is why time-to-first-token depends mostly on input length, while inter-token latency is roughly constant, and why batching matters so much for throughput.

Inference performance metrics

Production teams measure inference performance with: (1) TTFT (time to first token), how long from request submission to the first response token; this dominates perceived latency for chat applications; (2) TPS or TPOT (tokens per second / time per output token): how fast tokens stream after the first one; (3) Throughput (requests per second the system can handle); (4) Concurrency (how many simultaneous requests the system serves before degrading). Different applications optimize different metrics: chat apps want low TTFT; batch processing wants high throughput; code completion wants both low TTFT and high TPS.

Inference optimization techniques

Production inference uses several optimization layers: Quantization (running the model in INT8, INT4, or FP8 instead of FP16/BF16) reduces memory bandwidth and improves throughput at small accuracy cost. Speculative decoding uses a small draft model to propose tokens that the large model verifies in parallel, often producing 1.5-2× speedup. Continuous batching dynamically batches incoming requests instead of waiting for fixed batch sizes. KV cache management spreads the cache across GPU memory hierarchies. Tensor parallelism and pipeline parallelism distribute large models across multiple GPUs. Production systems combine multiple techniques to hit performance targets.

Use cases

  • Real-time chat where TTFT must be under 500ms
  • Batch document processing where throughput matters more than latency
  • Edge deployment of small models on consumer hardware (llama.cpp, MLC)
  • Cost-optimized backend services using quantized models
  • Multi-tenant SaaS that needs predictable per-request economics

Examples in production

vLLM

Open-source inference engine from UC Berkeley that introduced PagedAttention: a key-value cache management technique that dramatically improved throughput.

Source

Together AI

Managed inference platform offering optimized serving for open-source models with quantization, speculative decoding, and continuous batching.

Source

Anthropic

Claude is served on custom inference infrastructure including AWS Trainium and Google TPU, with prompt caching and extended thinking as production-facing features.

Source

Inference compared to alternatives

AlternativeChoose Inference whenChoose alternative when
Training
The process of building or updating model weights from data
Inference happens every time you use the model, typically the dominant production costTraining is one-time (or periodic) and dominates initial compute investment
Fine-tuning
Adjusting a pre-trained model on task-specific data
Inference is what runs after fine-tuning is done: every user request hits inferenceFine-tuning is preparation; inference is the live system

Common pitfalls

  • Optimizing TPS without measuring TTFT: your user-perceived latency may not improve
  • Aggressive quantization (INT4) on tasks that require high precision (math, coding) without benchmark validation
  • Ignoring batch dynamics: single-request latency is misleading for production sizing
  • Underprovisioning GPU memory: out-of-memory errors during long contexts kill production reliability
  • Self-hosting open-source models without realizing the hidden ops cost vs managed inference services
FAQ

Questions about Inference.

Self-host when you have either (a) very high volume that justifies the engineering investment, (b) data residency requirements that prevent sending requests to a third party, or (c) a need for custom optimization. Use managed services (OpenAI, Anthropic, Together, Fireworks, Anyscale) when you want predictable economics without an inference engineering team. Most BearPlex clients use managed services for the first 6-12 months and only consider self-hosting if cost crosses a meaningful threshold.

Sub-500ms feels instant; 500-1000ms feels responsive; 1000-2000ms is noticeably slow; 2000ms+ feels broken. For frontier model chat (GPT-4, Claude Sonnet), TTFT under 1 second is achievable for short prompts. Long prompts (10K+ tokens) push TTFT to 2-5 seconds; prompt caching can drop this back under 1 second.

Some, but less than people often assume. INT8 quantization typically loses <1% on most benchmarks. INT4 loses 1-3% on most tasks but can be much worse on math, coding, and reasoning. Always benchmark your specific use case before deploying a quantized model: generic accuracy numbers don't predict performance on your task.

Dramatically. Anthropic and OpenAI both offer prompt caching that drops cached prefix cost by 90%. For applications where the same long system prompt or document context is sent across many requests, prompt caching often pays for itself within hours. The optimal architecture for cache-friendly inference: stable prefix (system prompt, retrieved documents), variable suffix (user message).

Work with BearPlex

Need help implementing Inference?

BearPlex builds production AI systems that use Inference for Fortune 500s and high-growth scale-ups. Outcome-based pricing. 90-day embedded sprints.