What are the disadvantages of LoRA?

Four matter in production. It underperforms full fine-tuning when training data exceeds adapter capacity. It pays a loss penalty at very large batch sizes regardless of rank. It learns structurally different solutions (intruder dimensions, per NeurIPS 2025 work) that accumulate across sequential fine-tunes and hurt continual learning. And it adds hyperparameter surface: rank, alpha, target modules, plus an optimal learning rate around 10x the full fine-tuning rate.

What is the difference between LoRA and QLoRA?

LoRA trains low-rank adapters on top of a frozen 16-bit base model. QLoRA quantizes that frozen base to 4-bit NF4 first, then trains the same adapters, cutting memory enough that the original paper fine-tuned a 65B model on a single 48GB GPU; its Guanaco model reached 99.3% of ChatGPT's Vicuna benchmark score after 24 hours of training on one GPU. Use QLoRA when GPU memory is the binding constraint; use plain LoRA when you have headroom.

What rank should I use for LoRA?

Start at 16 for style and format work, 32 to 64 for harder domain tasks, with alpha set equal to or double the rank. Treat rank as a capacity dial: Thinking Machines found high ranks (256 and up) were needed to match full fine-tuning on larger datasets. Two settings matter more than rank: put adapters on all layers rather than attention only, and raise the learning rate to roughly 10x what you would use for full fine-tuning.

When should you use full fine-tuning instead of LoRA?

Three cases with evidence behind them: continued pretraining or any dataset large enough to exceed adapter capacity, training plans built on very large batch sizes, and programs that fine-tune the same model repeatedly, where LoRA's intruder dimensions accumulate. A fourth is operational rather than quality-driven: shipping one standalone checkpoint to people who will never manage adapters.

How much GPU memory do you need to fine-tune a 7B model?

Roughly 60GB for a standard full fine-tune of Llama-2-7B versus about 23GB with LoRA, per the LoRA survey (arXiv 2407.11046). That difference is why LoRA fits on a single 24GB consumer card and full fine-tuning does not. QLoRA pushes lower still by holding the base model in 4-bit. Add headroom for long sequences and larger batches.

Can you serve multiple LoRA adapters on one GPU?

Yes. vLLM serves adapters per request on a shared base model, loads and unloads them dynamically at runtime, and as of 2026 supports multi-LoRA on mixture-of-experts bases as well. This is the standard multi-tenant pattern: one deployment, one base model, one adapter per customer or task, selected at request time.

Does LoRA cause catastrophic forgetting?

Less than full fine-tuning on the same data; that finding has held up repeatedly. The subtlety from the NeurIPS 2025 intruder-dimensions paper is that LoRA's forgetting concentrates in specific singular directions the adapter introduces, which degrade the model's fit to its pretraining distribution and stack up if you fine-tune sequentially. For a one-shot adapter this is rarely a practical problem; for repeated rounds on the same weights it is.

Do OpenAI and other managed fine-tuning services use LoRA?

OpenAI does not say. Its remaining lineup is supervised fine-tuning and DPO on the GPT-4.1 family and reinforcement fine-tuning on o4-mini, with the method undisclosed, no weights returned to you, and the platform winding down (no new training jobs after January 6, 2027). Thinking Machines' Tinker is explicitly LoRA-only, which is how it shares one compute pool across many customers' training runs, and it covers open-weight families up to Qwen3-235B MoE scale.

How much does LoRA fine-tuning cost in 2026?

Ground it in GPU-hours: on-demand H100s rent for about $4 per GPU-hour and A100 80GB for $2.79 (Lambda, July 2026). Adapter runs on small and mid-size models typically finish in single-digit GPU-hours, so most experiments land in the tens of dollars. Full fine-tuning multiplies both the hardware (an 8x H100 node is about $32 per hour) and the duration, which is what pushes a single experiment into the thousands.

Start a conversation

Decision framework

LoRA vs Full Fine-Tuning: Which to Choose in 2026

TL;DR

Default to LoRA for production fine-tuning in 2026 and treat full fine-tuning as a deliberate exception, not a gold standard you are settling below. Thinking Machines' September 2025 research showed LoRA matches full fine-tuning on typical post-training datasets when adapters are applied to every layer (not just attention) with a learning rate about 10x the full fine-tuning optimum, and it matches even at rank 1 for reinforcement learning. Full fine-tuning still wins in three regimes: training data that exceeds adapter capacity (continued pretraining on billions of domain tokens), very large batch sizes, and repeated sequential fine-tuning of the same model, where LoRA's 'intruder dimensions' accumulate. The economics land hard on LoRA's side: at July 2026 on-demand rates (about $4 per H100 GPU-hour), an adapter experiment costs tens of dollars while a multi-day full fine-tune on an 8x H100 node runs into the thousands.

Side-by-side comparison

Dimension	LoRA / QLoRA (Parameter-Efficient Fine-Tuning)	Full Fine-Tuning
Parameters updated	Low-rank adapter matrices, a fraction of a percent of weights	All weights
Quality on typical post-training datasets	Parity with full fine-tuning when adapters cover all layers (Thinking Machines, Sept 2025)	The baseline LoRA is measured against
Quality when data exceeds adapter capacity	Falls behind; training efficiency degrades	Keeps scaling; the clear winner
RL post-training (RLHF, verifier RL)	Matches full fine-tuning even at rank 1	No measured advantage over LoRA
Large-batch training	Pays a loss penalty as batch size grows, at any rank	Tolerates large batches better
GPU memory to fine-tune a 7B model	About 23GB, fits a 24GB card; QLoRA lower still	About 60GB (LoRA survey, arXiv 2407.11046)
Largest model on a single GPU	65B on one 48GB GPU via QLoRA (original paper)	Not viable; 70B-class needs multi-GPU nodes
Compute per training pass	Roughly two-thirds of full fine-tuning FLOPs	Full forward and backward on every parameter
Cost per experiment at July 2026 rental rates	Tens of dollars: single-digit H100 GPU-hours at about $4/hr	Thousands: multi-day runs on an 8x H100 node at about $32/hr
Artifact size, 70B class	Megabytes to low gigabytes per adapter	About 140GB per checkpoint in 16-bit
Multi-tenant serving	One base model, per-request adapters in vLLM, dynamic runtime loading, MoE bases supported	One full deployment per task
Forgetting of base capabilities	Forgets less on the same training data	Forgets more, but with no intruder dimensions
Repeated sequential fine-tuning	Intruder dimensions accumulate and degrade continual learning (NeurIPS 2025)	More robust across repeated rounds
Hyperparameters	Rank, alpha, target modules; optimal learning rate about 10x the full fine-tuning rate	Standard training knobs only
Managed options in 2026	Tinker (LoRA-only, open-weight families up to Qwen3-235B MoE); Unsloth, Axolotl, TRL for self-serve	Mostly DIY on your own cluster; OpenAI's managed SFT/DPO/RFT does not disclose its method or return weights

LoRA / QLoRA (Parameter-Efficient Fine-Tuning)

Train small adapters on a frozen base. The production default, now with research to back it.

LoRA (Low-Rank Adaptation) freezes the base model and trains small low-rank matrices injected into its layers, updating a fraction of a percent of the parameters. QLoRA quantizes the frozen base to 4-bit NF4 first; the original paper fine-tuned a 65B model on a single 48GB GPU. The research picture sharpened in 2025. Thinking Machines' 'LoRA Without Regret' showed LoRA matches full fine-tuning on typical post-training datasets when adapters cover all layers (MLP and MoE blocks included, not attention-only), and their LoRA-only Tinker API now runs managed fine-tuning on open-weight families up to Qwen3-235B MoE scale on exactly that bet. The trade is not free: NeurIPS 2025 work on 'intruder dimensions' showed LoRA reaches similar task performance through structurally different weight updates that behave worse under repeated sequential fine-tuning. In production, the operational upsides usually dominate: adapters are megabytes instead of a full model copy, vLLM loads them per request so one base model serves many customers or tasks, and each experiment costs single-digit GPU-hours at July 2026 rental rates.

Pros

Matches full fine-tuning quality on typical post-training datasets when adapters cover all layers, per Thinking Machines' September 2025 study
Roughly two-thirds of the training FLOPs of full fine-tuning per pass
QLoRA fits very large models on one card: the original paper tuned a 65B model on a single 48GB GPU
Adapter artifacts are megabytes to low gigabytes, versus about 140GB for a 70B model checkpoint in 16-bit
Multi-adapter serving is built into vLLM: per-request adapters, dynamic runtime loading, and MoE base support as of 2026
Matches full fine-tuning in policy-gradient RL post-training even at rank 1
Forgets less of the base model's pretraining than full fine-tuning on the same data
Cheap iteration: at about $4 per on-demand H100 GPU-hour (Lambda, July 2026), most adapter experiments cost tens of dollars

Cons

Falls behind full fine-tuning once the dataset exceeds adapter capacity, with worsening training efficiency
Pays a growing loss penalty at very large batch sizes, independent of rank
Learns structurally different solutions (intruder dimensions) that degrade the model as a fit to its pretraining distribution
Intruder dimensions accumulate across sequential fine-tunes, hurting continual learning
More hyperparameter surface: rank, alpha, target modules, and an optimal learning rate about 10x the full fine-tuning rate
Attention-only adapter placement, still the default in many older tutorials, measurably underperforms
Adapters are pinned to one base model version; a base upgrade means retraining every adapter

Best for

→ Most production fine-tunes: tone, format, domain workflows, classification, and agent behavior on roughly 500 to 50,000 examples
→ Multi-tenant SaaS where each customer gets an adapter and one base model serves everyone via vLLM per-request loading
→ RL-based post-training (RLHF, verifier-based RL), where low-rank adapters match full fine-tuning

Worst for

→ Continued pretraining on billions of domain tokens, where the data exceeds what adapters can absorb
→ Programs that fine-tune the same model sequentially many times without resetting to the base
→ Shipping a single standalone checkpoint to third parties who will not manage base-plus-adapter pairs

Cost model

GPU rental, July 2026 on-demand (Lambda): about $4 per H100 GPU-hour, $2.79 per A100 80GB GPU-hour. Typical adapter runs finish in single-digit GPU-hours, so experiments land in the tens of dollars; the QLoRA paper's Guanaco run took 24 hours on a single GPU.

Time to value

Hours per experiment; a first production-grade adapter typically ships within days, including evaluation.

Full Fine-Tuning

Update every weight. The expensive exception with real, defensible territory.

Full fine-tuning updates every parameter of the model, which makes it the highest-capacity form of adaptation and the only one that fully reshapes what the model knows. In 2026 it is the exception in applied work, but its territory is real and backed by evidence rather than habit. When training data outgrows adapter capacity (continued pretraining on a large legal, clinical, or codebase corpus), full fine-tuning pulls ahead and LoRA's training efficiency deteriorates. It tolerates very large batch sizes better. And NeurIPS 2025 analysis showed its weight updates stay spectrally similar to the pretrained model, avoiding the intruder dimensions that make LoRA-adapted models drift under repeated sequential fine-tuning. The bill is the deterrent: a standard full fine-tune of even a 7B model needs roughly 60GB of training memory per the LoRA survey, 70B-class models are multi-GPU distributed-training territory (FSDP or DeepSpeed on one or more 8x 80GB nodes), and every experiment repeats that spend. Managed providers hide the infrastructure but not the constraint: OpenAI's endpoints (SFT and DPO on the GPT-4.1 family, reinforcement fine-tuning on o4-mini) trade control for convenience, never hand you the weights, and are winding down anyway (no new training jobs after January 6, 2027).

Pros

Highest adaptation capacity: keeps improving as training data scales past what adapters can absorb
Handles very large batch training with less quality penalty than LoRA
Weight updates stay structurally close to the pretrained model: no intruder dimensions
More robust across repeated sequential fine-tuning rounds on the same model
Produces one standalone checkpoint, the simplest artifact to distribute, archive, or audit
No adapter-specific hyperparameters: standard training knobs only

Cons

Memory-hungry: roughly 60GB to fully fine-tune a 7B model per the LoRA survey, versus about 23GB with LoRA
70B-class models require multi-GPU distributed training; there is no single-card path
Every experiment re-spends the full training cost, so iteration is slow and expensive
Artifacts are full model copies (about 140GB for a 70B model in 16-bit), multiplying storage and deployment per task
One model per task: nothing equivalent to vLLM's per-request multi-adapter serving
Forgets more of the base model's general capabilities than LoRA trained on the same data

Best for

→ Continued pretraining or domain adaptation on corpora far beyond adapter capacity
→ Building a model you will distribute as a single open checkpoint to people who will never manage adapters
→ Long-running programs that re-fine-tune the same model repeatedly, where LoRA's intruder dimensions accumulate

Worst for

→ Style, format, and workflow adaptation on small and mid-size datasets, where LoRA reaches parity for a fraction of the cost
→ Multi-tenant products that need per-customer behavior on shared serving infrastructure
→ Teams that need cheap, fast iteration loops to find the right data mix before committing compute

Cost model

Compute-bound: an 8x H100 node rents for about $32 per hour on demand (8 x $3.99, Lambda, July 2026), and serious runs take days, so budgets run from thousands to tens of thousands of dollars per training cycle before evaluation reruns.

Time to value

Days to weeks per cycle once distributed training is set up; the setup itself often takes longer than a first LoRA experiment end to end.

Decision scenarios

Teaching Llama 4 Scout or Qwen3 a support workflow with 20,000 labeled tickets

→ LoRA / QLoRA (Parameter-Efficient Fine-Tuning)

Squarely inside adapter capacity. Apply LoRA to all layers, start at rank 16 to 32, and expect parity with full fine-tuning at a small fraction of the cost.

Multi-tenant SaaS where 200 customers each need their own model behavior

→ LoRA / QLoRA (Parameter-Efficient Fine-Tuning)

Train one adapter per customer and serve them all from a single base model with vLLM's per-request LoRA loading. Two hundred full fine-tunes would mean two hundred separate deployments.

Continued pretraining on tens of billions of tokens of legal or clinical text

→ Full Fine-Tuning

This is the capacity regime where LoRA measurably falls behind and its training efficiency degrades. Budget for multi-node full fine-tuning, or scope down to retrieval if the budget is not there.

RLHF or verifier-based RL on top of an instruction-tuned model

→ LoRA / QLoRA (Parameter-Efficient Fine-Tuning)

Thinking Machines showed policy-gradient RL matches full fine-tuning even at rank 1, because RL absorbs far less information per episode than supervised training does per token. Paying for full fine-tuning here buys nothing.

Releasing an open-weight vertical model that customers download and run themselves

→ Full Fine-Tuning

Ship one standalone checkpoint. Asking third parties to manage base-plus-adapter pairing creates support burden and version drift you will own forever.

A model you plan to re-fine-tune every quarter as the product changes

→ Both

If you must stack updates onto the same weights, full fine-tuning is more robust: sequential LoRA rounds accumulate intruder dimensions that hurt continual learning. The cheaper pattern that usually works is to keep the base frozen and train each quarter's adapter fresh from the clean base instead of fine-tuning sequentially.

One 48GB GPU, a large open-weight model to adapt, no cluster budget

→ LoRA / QLoRA (Parameter-Efficient Fine-Tuning)

QLoRA exists for exactly this: the original paper fine-tuned a 65B model on a single 48GB card by holding the base in 4-bit NF4 and training adapters on top.

A large training set plus a training plan built on very large batch sizes

→ Full Fine-Tuning

LoRA pays a growing loss penalty as batch size increases, and the gap does not close with higher rank. If throughput targets force large batches, full fine-tuning holds quality better.

FAQ

Common questions

On typical post-training datasets, yes, if it is configured correctly. Thinking Machines' September 2025 study found LoRA matches full fine-tuning when adapters are applied to all layers (especially MLP and MoE blocks, not just attention) and the dataset fits within adapter capacity. It stops being equal in two regimes: datasets large enough to exceed what the adapter can absorb, and very large batch sizes. Benchmark on your own eval set before deciding either way.

Related comparisons

Related services

Featured case studies

Get a recommendation tailored to your situation

BearPlex builds production AI systems using both approaches. We'll tell you which fits your case in a 30-minute scoping call.

Talk to BearPlex See case studies

LoRA vs Full Fine-Tuning: Which to Choose in 2026

Side-by-side comparison

LoRA / QLoRA (Parameter-Efficient Fine-Tuning)

Pros

Cons

Best for

Worst for

Full Fine-Tuning

Pros

Cons

Best for

Worst for

Decision scenarios

Common questions

Related comparisons

Related services

Featured case studies

Related reading

Get a recommendation tailored to your situation