Can I fine-tune a 70B model with PEFT on a single GPU?

Yes: with QLoRA on a 48GB or 80GB GPU. The base model is loaded in INT4 quantization (~35GB for 70B), the LoRA adapter trains alongside it. This was impossible without QLoRA: full fine-tuning a 70B model requires 8+ A100 80GB GPUs.

What LoRA rank should I use?

Start with rank 16 or 32. Higher ranks (64-128) provide more capacity but more parameters to train. For most tasks, rank 16-32 is sufficient; for tasks requiring deep behavior changes, rank 64-128 may help. Always measure on your eval set rather than guessing.

Can I serve multiple LoRA adapters with a single base model?

Yes: modern inference engines (vLLM, TGI) support multi-LoRA serving where one base model handles requests with many different LoRA adapters loaded. This is the standard pattern for multi-tenant SaaS where each customer needs custom behavior: single base model serves all customers, per-customer LoRA loads at request time.

Start a conversation

AI engineering glossary

What is PEFT (Parameter-Efficient Fine-Tuning)?

Q: Is LoRA the same as PEFT?

LoRA is a specific PEFT method, not the full category. PEFT is the broader family: LoRA, QLoRA, prefix tuning, prompt tuning, adapters, IA³. LoRA happens to be the most widely-used PEFT method in production, so the terms get used interchangeably.

Parameter-Efficient Fine-Tuning (PEFT) is a family of fine-tuning techniques (including LoRA, QLoRA, prefix tuning, and prompt tuning) that update only a small subset of a model's parameters (typically 0.1-3%) while keeping the rest frozen, dramatically reducing memory requirements, training time, and storage cost compared to full fine-tuning.

Last updated 2026-04-29BearPlex AI Engineering Team

Overview

PEFT is the dominant approach for fine-tuning large language models in production because the economics of full fine-tuning are prohibitive at modern model scales. A 70B-parameter model requires hundreds of gigabytes of GPU memory to fine-tune all parameters, putting it out of reach for most engineering teams. PEFT methods add small trainable components (typically 0.1-3% of total parameters) while keeping the base model frozen: enabling fine-tuning on a single consumer GPU for smaller models or modest enterprise hardware for large models. The Hugging Face PEFT library has standardized these methods, and the open-source community has standardized on LoRA / QLoRA as the production defaults. At BearPlex, PEFT is the starting point for nearly every fine-tuning engagement: full fine-tuning is reserved for the rare cases where it's clearly justified.

PEFT methods overview

(1) LoRA (Low-Rank Adaptation): adds small trainable matrices that adapt the model's behavior; the most widely-used PEFT method; trains 0.1-3% of parameters typically. (2) QLoRA (Quantized LoRA): combines LoRA with INT4 quantization of the base model, dramatically reducing memory; enables 70B-model fine-tuning on a single 24GB GPU. (3) Prefix Tuning: prepends learnable prefix tokens to every layer's input; smaller parameter footprint than LoRA but often weaker results. (4) Prompt Tuning: learns soft prompts (continuous embedding vectors) prepended to the input; very small parameter footprint, lower quality than LoRA. (5) Adapter Tuning: inserts small trainable adapter modules between transformer layers; predates LoRA, less common in production now. (6) IA³: Infused Adapter by Inhibiting and Amplifying Inner Activations; even smaller footprint than LoRA. LoRA and QLoRA dominate production usage; the others are niche.

PEFT vs full fine-tuning trade-offs

PEFT advantages: dramatically lower memory (often 10-100× less GPU RAM), faster training (sometimes 3-5×), much smaller artifacts (a LoRA adapter for a 70B model is often <1GB vs the model's 140GB), easier to swap multiple adapters at inference time (one base model + many specialized adapters for different tasks). PEFT trade-offs: usually slightly lower quality than full fine-tuning (typically 1-3% on benchmarks, sometimes equivalent), can't change the base model's fundamental behavior as deeply, may require more tuning of hyperparameters (rank, alpha, target modules) for best results. For 95%+ of production fine-tuning use cases, PEFT (specifically LoRA or QLoRA) wins on the cost-quality trade-off. Full fine-tuning is reserved for cases where the additional 1-3% quality matters and the budget supports it.

Production PEFT patterns

Common production patterns: (1) Single LoRA per task, fine-tune one LoRA adapter per use case, swap at inference time using LoRA-aware serving (vLLM supports this); (2) QLoRA for cost-optimized training: INT4 base model + LoRA adapter trains on dramatically smaller hardware; (3) DPO + LoRA: preference tuning via DPO using LoRA adapters; standard pattern for open-source preference alignment; (4) Multi-LoRA serving: single base model serving many tenants, each with their own LoRA adapter. Hugging Face PEFT, Unsloth (faster training), Axolotl (configuration-driven), and LLaMA-Factory (UI-driven) are the main libraries; vLLM and TGI both support LoRA-aware inference for production deployment.

Use cases

Fine-tuning open-source LLMs on a single consumer or enterprise GPU
Training many task-specific adapters that share a single base model
Multi-tenant SaaS where each customer gets a custom LoRA adapter
Domain-specific fine-tuning when full fine-tuning is too expensive
Rapid iteration during fine-tuning experiments where training speed matters

Examples in production

Hugging Face PEFT

Open-source library that standardized PEFT methods for the LLM community; used in nearly every open-source fine-tuning project.

Source

Microsoft (LoRA paper)

Hu et al.'s 'LoRA: Low-Rank Adaptation of Large Language Models' (2021) introduced the technique that became the dominant PEFT approach.

Source

QLoRA paper (UW)

Dettmers et al.'s 'QLoRA: Efficient Finetuning of Quantized LLMs' (2023) made 65B-model fine-tuning accessible on a single 48GB GPU, democratizing large-model fine-tuning.

Source

PEFT compared to alternatives

Alternative	Choose PEFT when	Choose alternative when
Full fine-tuning Train all model parameters	Use PEFT for 95%+ of production fine-tuning: much better cost-quality trade-off	Use full fine-tuning when 1-3% additional quality matters and budget supports it
Prompt engineering Modify model behavior via system prompts and few-shot examples	Use PEFT when prompting can't reliably achieve desired behavior or per-call cost matters	Use prompting first; reach for PEFT when prompting hits its limits

Common pitfalls

Setting LoRA rank too low, often underfits; rank 16-64 is a common starting point
Targeting wrong modules: only attention layers vs all linear layers vs MLP makes meaningful difference
Skipping hyperparameter tuning: LoRA hyperparameters (rank, alpha, dropout, target modules) matter more than people often realize
Mixing LoRA adapters trained on different base models: adapters are model-version-specific
Trying full fine-tuning when PEFT would have worked: wasting budget for marginal quality gains

Related terms

LoRA Fine-tuning Quantization DPO

Related BearPlex services

Model Engineering & Fine-Tuning RLHF & AI Alignment

Full AI glossary

FAQ

Questions about PEFT.

LoRA is a specific PEFT method, not the full category. PEFT is the broader family: LoRA, QLoRA, prefix tuning, prompt tuning, adapters, IA³. LoRA happens to be the most widely-used PEFT method in production, so the terms get used interchangeably.

Need help implementing PEFT?

BearPlex builds production AI systems that use PEFT for Fortune 500s and high-growth scale-ups. Outcome-based pricing. 90-day embedded sprints.

Talk to BearPlex See case studies

What is PEFT (Parameter-Efficient Fine-Tuning)?

Overview

PEFT methods overview

PEFT vs full fine-tuning trade-offs

Production PEFT patterns

Use cases

Examples in production

Hugging Face PEFT

Microsoft (LoRA paper)

QLoRA paper (UW)

PEFT compared to alternatives

Common pitfalls

Related terms

Related BearPlex services

Questions about PEFT.

Related reading

Need help implementing PEFT?