Skip to main content
AI engineering glossary

What is Quantization in LLMs?

Quantization is the process of reducing the numerical precision of a neural network's weights and activations (typically from 16-bit floating point (FP16/BF16) to 8-bit integer (INT8) or 4-bit integer (INT4)) to reduce model size, memory bandwidth, and inference cost, with modest accuracy trade-offs that are usually acceptable for production deployment.

Last updated 2026-04-29BearPlex AI Engineering Team

Overview

Quantization is one of the most important production deployment optimizations for LLMs. The economics are compelling: INT8 quantization typically halves memory bandwidth and significantly improves throughput at <1% accuracy cost; INT4 quantization typically quarters memory at 1-3% accuracy cost. For LLM serving, where memory bandwidth is often the dominant constraint, this translates to 1.5-3× more requests per second per GPU. Production quantization techniques have matured rapidly: GPTQ (2022), AWQ (2023), and SmoothQuant (2022) provide post-training quantization methods that work without retraining; QLoRA (2023) enables training quantized models. At BearPlex, quantization is a standard part of the production deployment toolkit: we benchmark quantized vs full-precision performance on every self-hosted deployment.

How quantization works

Neural network weights are typically stored as 16-bit or 32-bit floating point numbers. Quantization replaces them with lower-precision integers (INT8 with 256 possible values, INT4 with 16 values) using a calibration process that maps the floating-point distribution to the integer range. The math during inference happens in the lower precision (INT8 matrix multiplications, for example) with periodic dequantization back to float for operations that require it. The key technical challenge is preserving accuracy: naive quantization loses information about the weight distribution, especially in the tails. Modern quantization methods (GPTQ, AWQ) use sophisticated calibration with sample data to preserve accuracy on important weight regions, and they're now mature enough that INT8 quantization is essentially free for most production use cases.

Quantization methods in production

(1) GPTQ: post-training quantization that minimizes layer-by-layer reconstruction error using calibration data; widely supported, good quality. (2) AWQ (Activation-aware Weight Quantization): quantizes weights based on activation magnitudes; often higher quality than GPTQ at INT4. (3) SmoothQuant: applies pre-quantization scaling to make activations more quantization-friendly. (4) FP8: emerging hardware-supported 8-bit floating point that retains float-like properties; supported by H100 and newer GPUs. (5) GGUF (llama.cpp format): quantization formats for CPU and consumer GPU inference, common for edge deployment. (6) bitsandbytes: Hugging Face's quantization library, simpler to use than GPTQ/AWQ but typically lower quality. For production GPU serving, AWQ and GPTQ dominate; for edge and consumer deployment, GGUF dominates.

When quantization helps and when it hurts

Quantization helps most when memory bandwidth is the constraint, which is true for most LLM inference workloads, especially at the decoding stage. Production wins: 1.5-3× higher throughput per GPU, ability to fit larger models on the same hardware (e.g., serving 70B models on a single A100 with INT4), lower per-request cost. Quantization hurts on: math and reasoning tasks where small precision matters more (INT4 can lose 5-10% on math benchmarks while staying within 1-2% on general benchmarks), tasks requiring long-context recall (quantization can degrade attention behavior at the long-context tail), and very small models (a 1.5B INT4 model often loses meaningfully more than a 70B INT4 model). Always benchmark on your specific task before committing to quantization in production.

Use cases

  • Self-hosted LLM serving where GPU cost is a major operational expense
  • Edge and consumer-hardware deployment via llama.cpp / GGUF formats
  • Fitting larger models on smaller GPUs (e.g., 70B on a single A100 80GB)
  • Reducing inference latency for memory-bandwidth-bound workloads
  • QLoRA training of large models on consumer hardware

Examples in production

GPTQ paper (IST Austria)

Frantar et al.'s 'GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers' (2022) introduced the dominant post-training quantization method.

Source

AWQ paper (MIT)

Lin et al.'s 'AWQ: Activation-aware Weight Quantization for LLM Compression' (2023) introduced activation-aware quantization that often beats GPTQ at INT4.

Source

llama.cpp

Open-source inference engine that pioneered consumer-hardware LLM deployment via aggressive quantization (GGUF format with 2-bit through 8-bit options).

Source

Quantization compared to alternatives

AlternativeChoose Quantization whenChoose alternative when
Distillation
Train a smaller dense model on outputs from a larger model
Use quantization for cheap inference cost reduction without retrainingUse distillation when you need a fundamentally smaller model architecture
Pruning
Remove unimportant weights to reduce model size
Use quantization for broader hardware support and easier production toolingPruning rarely used in production LLMs vs quantization; quantization tooling is more mature

Common pitfalls

  • Aggressive quantization (INT4) on tasks requiring high precision (math, coding) without benchmark validation
  • Assuming quantized models behave identically to full precision on all tasks: they often differ subtly
  • Quantizing very small models: accuracy loss is disproportionate at small scales
  • Ignoring activation precision: weight-only quantization (the common case) doesn't quantize activations; mixed strategies need care
  • Benchmarking only on standard tasks: your specific use case may be more or less quantization-sensitive
FAQ

Questions about Quantization.

Typically less than 1% on most benchmarks. INT8 has become essentially free for production deployment of most modern LLMs. We always benchmark to confirm, but the historical pattern is reliable: INT8 wins on cost-quality trade-off for almost all production cases.

1-3% on most general benchmarks. On math, coding, and reasoning tasks, sometimes 5-10%. Always benchmark on your specific task: INT4 sensitivity varies meaningfully by task type. AWQ tends to handle INT4 better than other methods.

AWQ usually wins on INT4 quality; GPTQ has broader tooling support and is well-tested at INT8. Both are good choices. We benchmark both on the specific task and pick whichever performs better: the quality differences are typically modest.

Yes: quantization is a post-training step that works with any model including fine-tuned ones. Standard pattern: fine-tune in BF16 or FP16 → quantize the result for deployment. QLoRA is a related but different approach: train a LoRA adapter on a quantized base model, then deploy the quantized base + the LoRA adapter together.

Work with BearPlex

Need help implementing Quantization?

BearPlex builds production AI systems that use Quantization for Fortune 500s and high-growth scale-ups. Outcome-based pricing. 90-day embedded sprints.