What is PEFT (Parameter-Efficient Fine-Tuning)?
Parameter-Efficient Fine-Tuning (PEFT) is a family of fine-tuning techniques (including LoRA, QLoRA, prefix tuning, and prompt tuning) that update only a small subset of a model's parameters (typically 0.1-3%) while keeping the rest frozen, dramatically reducing memory requirements, training time, and storage cost compared to full fine-tuning.
Overview
PEFT is the dominant approach for fine-tuning large language models in production because the economics of full fine-tuning are prohibitive at modern model scales. A 70B-parameter model requires hundreds of gigabytes of GPU memory to fine-tune all parameters, putting it out of reach for most engineering teams. PEFT methods add small trainable components (typically 0.1-3% of total parameters) while keeping the base model frozen: enabling fine-tuning on a single consumer GPU for smaller models or modest enterprise hardware for large models. The Hugging Face PEFT library has standardized these methods, and the open-source community has standardized on LoRA / QLoRA as the production defaults. At BearPlex, PEFT is the starting point for nearly every fine-tuning engagement: full fine-tuning is reserved for the rare cases where it's clearly justified.
PEFT methods overview
(1) LoRA (Low-Rank Adaptation): adds small trainable matrices that adapt the model's behavior; the most widely-used PEFT method; trains 0.1-3% of parameters typically. (2) QLoRA (Quantized LoRA): combines LoRA with INT4 quantization of the base model, dramatically reducing memory; enables 70B-model fine-tuning on a single 24GB GPU. (3) Prefix Tuning: prepends learnable prefix tokens to every layer's input; smaller parameter footprint than LoRA but often weaker results. (4) Prompt Tuning: learns soft prompts (continuous embedding vectors) prepended to the input; very small parameter footprint, lower quality than LoRA. (5) Adapter Tuning: inserts small trainable adapter modules between transformer layers; predates LoRA, less common in production now. (6) IA³: Infused Adapter by Inhibiting and Amplifying Inner Activations; even smaller footprint than LoRA. LoRA and QLoRA dominate production usage; the others are niche.
PEFT vs full fine-tuning trade-offs
PEFT advantages: dramatically lower memory (often 10-100× less GPU RAM), faster training (sometimes 3-5×), much smaller artifacts (a LoRA adapter for a 70B model is often <1GB vs the model's 140GB), easier to swap multiple adapters at inference time (one base model + many specialized adapters for different tasks). PEFT trade-offs: usually slightly lower quality than full fine-tuning (typically 1-3% on benchmarks, sometimes equivalent), can't change the base model's fundamental behavior as deeply, may require more tuning of hyperparameters (rank, alpha, target modules) for best results. For 95%+ of production fine-tuning use cases, PEFT (specifically LoRA or QLoRA) wins on the cost-quality trade-off. Full fine-tuning is reserved for cases where the additional 1-3% quality matters and the budget supports it.
Production PEFT patterns
Common production patterns: (1) Single LoRA per task, fine-tune one LoRA adapter per use case, swap at inference time using LoRA-aware serving (vLLM supports this); (2) QLoRA for cost-optimized training: INT4 base model + LoRA adapter trains on dramatically smaller hardware; (3) DPO + LoRA: preference tuning via DPO using LoRA adapters; standard pattern for open-source preference alignment; (4) Multi-LoRA serving: single base model serving many tenants, each with their own LoRA adapter. Hugging Face PEFT, Unsloth (faster training), Axolotl (configuration-driven), and LLaMA-Factory (UI-driven) are the main libraries; vLLM and TGI both support LoRA-aware inference for production deployment.
Use cases
- Fine-tuning open-source LLMs on a single consumer or enterprise GPU
- Training many task-specific adapters that share a single base model
- Multi-tenant SaaS where each customer gets a custom LoRA adapter
- Domain-specific fine-tuning when full fine-tuning is too expensive
- Rapid iteration during fine-tuning experiments where training speed matters
Examples in production
Hugging Face PEFT
Open-source library that standardized PEFT methods for the LLM community; used in nearly every open-source fine-tuning project.
SourceMicrosoft (LoRA paper)
Hu et al.'s 'LoRA: Low-Rank Adaptation of Large Language Models' (2021) introduced the technique that became the dominant PEFT approach.
SourceQLoRA paper (UW)
Dettmers et al.'s 'QLoRA: Efficient Finetuning of Quantized LLMs' (2023) made 65B-model fine-tuning accessible on a single 48GB GPU, democratizing large-model fine-tuning.
SourcePEFT compared to alternatives
| Alternative | Choose PEFT when | Choose alternative when |
|---|---|---|
Full fine-tuning Train all model parameters | Use PEFT for 95%+ of production fine-tuning: much better cost-quality trade-off | Use full fine-tuning when 1-3% additional quality matters and budget supports it |
Prompt engineering Modify model behavior via system prompts and few-shot examples | Use PEFT when prompting can't reliably achieve desired behavior or per-call cost matters | Use prompting first; reach for PEFT when prompting hits its limits |
Common pitfalls
- Setting LoRA rank too low, often underfits; rank 16-64 is a common starting point
- Targeting wrong modules: only attention layers vs all linear layers vs MLP makes meaningful difference
- Skipping hyperparameter tuning: LoRA hyperparameters (rank, alpha, dropout, target modules) matter more than people often realize
- Mixing LoRA adapters trained on different base models: adapters are model-version-specific
- Trying full fine-tuning when PEFT would have worked: wasting budget for marginal quality gains
Questions about PEFT.
Yes: with QLoRA on a 48GB or 80GB GPU. The base model is loaded in INT4 quantization (~35GB for 70B), the LoRA adapter trains alongside it. This was impossible without QLoRA: full fine-tuning a 70B model requires 8+ A100 80GB GPUs.
Start with rank 16 or 32. Higher ranks (64-128) provide more capacity but more parameters to train. For most tasks, rank 16-32 is sufficient; for tasks requiring deep behavior changes, rank 64-128 may help. Always measure on your eval set rather than guessing.
Yes: modern inference engines (vLLM, TGI) support multi-LoRA serving where one base model handles requests with many different LoRA adapters loaded. This is the standard pattern for multi-tenant SaaS where each customer needs custom behavior: single base model serves all customers, per-customer LoRA loads at request time.
Need help implementing PEFT?
BearPlex builds production AI systems that use PEFT for Fortune 500s and high-growth scale-ups. Outcome-based pricing. 90-day embedded sprints.