What's the difference between LoRA, QLoRA, and full fine-tuning?

Full fine-tuning updates all model weights (best quality, highest cost, needs huge GPU clusters). LoRA freezes the base model and trains small adapter matrices (90% of the quality, 1% of the cost). QLoRA quantizes the base model to 4-bit and trains LoRA adapters (fits on a single consumer GPU). We use LoRA for most production work.

Which open models do you fine-tune?

Llama 4 (Meta license), Qwen 3 (strong reasoning), Gemma 3 (Google), Phi-4 (SLM for edge deployment), DeepSeek, and Mistral (commercial use allowed). For closed models we fine-tune via OpenAI and Anthropic fine-tuning APIs where available.

How much training data do we need?

For style/format fine-tuning: 500-2,000 high-quality examples typically suffice. For knowledge injection: you probably want RAG instead of fine-tuning. For domain reasoning: 5,000-50,000 examples + synthetic data augmentation. Quality >> quantity: 500 curated examples beat 50,000 messy ones.

Where do you deploy fine-tuned models?

Sovereign deployment by default: your VPC on AWS/GCP/Azure via vLLM, TGI, or TensorRT-LLM. Anyscale or Modal for serverless inference. Together.ai or Fireworks for managed endpoints. Full stack on-prem for regulated industries. We optimize for your specific latency/cost/compliance constraints.

What does a fine-tuning engagement cost?

Fine-tuning engagements run a 4-8 week cycle at a fixed scope quoted up front. Included: the data curation pipeline, training infrastructure setup, LoRA/QLoRA training, the evaluation harness, deployment automation, and handover. Compute costs are passthrough at our discounted GPU rates.

Start a conversation

Model engineering

Generic in. Yours out.

A foundation model knows everything in general and your business in particular not at all. We tune, evaluate, and serve models shaped to your data, your domain, and your latency budget.

Scope the fine-tune

See the same prompt twice

Of full fine-tune quality, via LoRA

Of full fine-tune cost

4 to 8 wks

Fixed scope to handover

Yours

Weights, evals, and pipeline

Same prompt, two models

One prompt. Two different machines.

The cheapest way to see what tuning buys is to ask both models the same question. The base model gives you prose and a hedge. Your model gives you the record the next system in the chain expects.

The prompt, identical both timesIllustrative

Classify this claim note and produce the triage record: “Insured reports water damage in the kitchen after an appliance failure. No structural issues visible. Property is tenant occupied.”

Stock open-weight model

This appears to be a property damage claim, most likely related to water damage. Depending on the policy wording, it could fall under accidental damage or escape of water, and coverage may vary if the property is tenanted. It would be best to review the policy schedule and consult a claims handler before assigning a category. Would you like a general overview of how water damage claims are typically handled?

No category assignedAsks a question backGeneric vocabulary

Both outputs are illustrative. The point is structural: the format, the taxonomy, and the routing rules live in the tuned weights, not in an ever-longer prompt.

A raw unshaped block of dim light, unworked potential

A finished lens focusing scattered light into one sharp point

The ingot: raw, capable, unshaped

Fine-tune when prompts stop being enough. Not before.

The honest rule we work to: tune when you need an output format prompts cannot enforce, when you hold more than a thousand high-quality examples, or when a small tuned model has to meet a latency or cost budget a large one cannot. Otherwise prompt engineering and RAG come first; they are cheaper and faster to iterate.

The engagement

The only deadline is a number going down.

A fine-tuning engagement is a 4 to 8 week cycle at a fixed scope, and it is run off one chart: the tuned model's error on your golden set. Every milestone lives on that curve.

Data curated

Your examples cleaned, deduplicated, and formatted into instruction sets. Quality decides the ceiling: 500 curated examples beat 50,000 messy ones.

First adapter trained

LoRA or QLoRA adapters on the chosen open-weight base, scored against the golden set on every run, so the curve starts moving early.

Eval gate

Nothing ships until the tuned model beats the base on your golden set without regressing on general logic or inventing answers.

Shipped

Deployed to your infrastructure, with the eval harness and the training pipeline handed over so the next run is yours to make.

The curve’s shape is illustrative; the sequence and the gate are how every engagement runs. If the number refuses to come down, you find out in week two, not at handover.

Which lever

Four ways to tune. One honest ledger.

Fine-tuning is not one technique, it is a set of levers with very different bills. Open each one: when it is the right tool, and what it actually costs in data and compute.

The right tool when prompts cannot enforce the output format you need, when the model must speak your domain natively, or when a small tuned model has to hit a latency budget a large one cannot.

What it costs in data

Modest. Style and format tuning typically lands with 500 to 2,000 high-quality examples.

What it costs in compute

Light. The base stays frozen while small adapter matrices train; QLoRA quantises the base to 4-bit and fits on a single GPU. Roughly 90% of full fine-tune quality at about 1% of the cost.

The workhorse. This is what we use for most production work.

The bases we pull from the shelf: Llama 4, Qwen 3, Gemma 3, Phi-4, DeepSeek, Mistral. And the closed-model fine-tuning APIs where they fit better.

Serve it

Tuned is half the job. Served is the other half.

A model that only runs on our machines is not yours. Every engagement ends with the instrument deployed where you need it and the workshop handed over with it.

Off the bench

Your VPC

vLLM, TGI, or TensorRT-LLM on AWS, GCP, or Azure. Sovereign by default: nothing leaves your perimeter.

Your provider

Serverless inference on Anyscale or Modal; managed endpoints on Together.ai or Fireworks when you want the ops handled.

On-prem and edge

Full stack on-premise for regulated industries. Quantised to 4-bit or 8-bit so it runs on smaller GPUs and edge devices.

Production

What travels with the model

Weights and adaptersThe eval harnessThe training pipelineThe data curation pipelineDeployment automation

A 4 to 8 week cycle at a fixed scope quoted up front. Compute is passthrough at our discounted GPU rates, and everything the engagement produces is yours from the first commit.

Scope the fine-tune

The evidence

A model ships with its evaluation record.

You should never have to take a tuned model on faith. Every engagement is gated by the same sheet, and the sheet ships with the model so the next team can re-run every line.

Evaluation recordships with every engagement

RefCheckMethodThe gate

E-01Golden setA held-out set of examples your team signs off as correct, never seen in training.The tuned model must beat the base model on it, run after run.

E-02Format complianceEvery output parsed against the schema it has to satisfy downstream.Outputs that do not parse do not ship.

E-03Domain accuracyLLM-as-judge scoring against reference answers, spot-checked by humans.Judged accuracy holds across the full golden set, not a demo subset.

E-04General-logic regressionStandard reasoning checks re-run on the tuned model after every training run.No regression on general logic. Specialising must not cost the model its reasoning.

E-05Hallucination probesStress tests built to tempt the model into inventing answers.Invented answers fail the run, however fluent they sound.

The harness, the golden set, and the regression suite are part of the handover. Re-run every line whenever you like.

See what we have shipped

FAQ

Common questions about model engineering.

What teams ask before they tune their first model.

Fine-tune when: (1) you need consistent output format that prompts can't enforce, (2) you have >1000 high-quality training examples, (3) latency or cost of a large model is prohibitive and a fine-tuned small model suffices. Otherwise, invest in prompt engineering and RAG first, cheaper and faster to iterate.

Make the model yours.

If a prompt can do the job, we will tell you and save you the engagement. If it cannot, we will tune a model that can, gate it on your golden set, and hand you the weights.

Scope the fine-tune

Read the case studies

Generic in. Yours out.

One prompt. Two different machines.

The only deadline is a number going down.

Data curated

First adapter trained

Eval gate

Shipped

Four ways to tune. One honest ledger.

Tuned is half the job. Served is the other half.

A model ships with its evaluation record.

Common questions about model engineering.

Related reading

Make the model yours.