Generic in.Yours out.
A foundation model knows everything in general and your business in particular not at all. We tune, evaluate, and serve models shaped to your data, your domain, and your latency budget.
One prompt.Two different machines.
The cheapest way to see what tuning buys is to ask both models the same question. The base model gives you prose and a hedge. Your model gives you the record the next system in the chain expects.
Classify this claim note and produce the triage record: “Insured reports water damage in the kitchen after an appliance failure. No structural issues visible. Property is tenant occupied.”
This appears to be a property damage claim, most likely related to water damage. Depending on the policy wording, it could fall under accidental damage or escape of water, and coverage may vary if the property is tenanted. It would be best to review the policy schedule and consult a claims handler before assigning a category. Would you like a general overview of how water damage claims are typically handled?
Both outputs are illustrative. The point is structural: the format, the taxonomy, and the routing rules live in the tuned weights, not in an ever-longer prompt.


Fine-tune when prompts stop being enough. Not before.
The honest rule we work to: tune when you need an output format prompts cannot enforce, when you hold more than a thousand high-quality examples, or when a small tuned model has to meet a latency or cost budget a large one cannot. Otherwise prompt engineering and RAG come first; they are cheaper and faster to iterate.
The only deadlineis a number going down.
A fine-tuning engagement is a 4 to 8 week cycle at a fixed scope, and it is run off one chart: the tuned model's error on your golden set. Every milestone lives on that curve.
Data curated
Your examples cleaned, deduplicated, and formatted into instruction sets. Quality decides the ceiling: 500 curated examples beat 50,000 messy ones.
First adapter trained
LoRA or QLoRA adapters on the chosen open-weight base, scored against the golden set on every run, so the curve starts moving early.
Eval gate
Nothing ships until the tuned model beats the base on your golden set without regressing on general logic or inventing answers.
Shipped
Deployed to your infrastructure, with the eval harness and the training pipeline handed over so the next run is yours to make.
The curve’s shape is illustrative; the sequence and the gate are how every engagement runs. If the number refuses to come down, you find out in week two, not at handover.
Four ways to tune.One honest ledger.
Fine-tuning is not one technique, it is a set of levers with very different bills. Open each one: when it is the right tool, and what it actually costs in data and compute.
The right tool when prompts cannot enforce the output format you need, when the model must speak your domain natively, or when a small tuned model has to hit a latency budget a large one cannot.
Modest. Style and format tuning typically lands with 500 to 2,000 high-quality examples.
Light. The base stays frozen while small adapter matrices train; QLoRA quantises the base to 4-bit and fits on a single GPU. Roughly 90% of full fine-tune quality at about 1% of the cost.
The workhorse. This is what we use for most production work.
The bases we pull from the shelf: Llama 4, Qwen 3, Gemma 3, Phi-4, DeepSeek, Mistral. And the closed-model fine-tuning APIs where they fit better.
Tuned is half the job.Served is the other half.
A model that only runs on our machines is not yours. Every engagement ends with the instrument deployed where you need it and the workshop handed over with it.
Off the bench
vLLM, TGI, or TensorRT-LLM on AWS, GCP, or Azure. Sovereign by default: nothing leaves your perimeter.
Serverless inference on Anyscale or Modal; managed endpoints on Together.ai or Fireworks when you want the ops handled.
Full stack on-premise for regulated industries. Quantised to 4-bit or 8-bit so it runs on smaller GPUs and edge devices.
Production
What travels with the model
Weights and adaptersThe eval harnessThe training pipelineThe data curation pipelineDeployment automation
A 4 to 8 week cycle at a fixed scope quoted up front. Compute is passthrough at our discounted GPU rates, and everything the engagement produces is yours from the first commit.
A model ships withits evaluation record.
You should never have to take a tuned model on faith. Every engagement is gated by the same sheet, and the sheet ships with the model so the next team can re-run every line.
The harness, the golden set, and the regression suite are part of the handover. Re-run every line whenever you like.
See what we have shippedCommon questions about model engineering.
What teams ask before they tune their first model.
Full fine-tuning updates all model weights (best quality, highest cost, needs huge GPU clusters). LoRA freezes the base model and trains small adapter matrices (90% of the quality, 1% of the cost). QLoRA quantizes the base model to 4-bit and trains LoRA adapters (fits on a single consumer GPU). We use LoRA for most production work.
Llama 4 (Meta license), Qwen 3 (strong reasoning), Gemma 3 (Google), Phi-4 (SLM for edge deployment), DeepSeek, and Mistral (commercial use allowed). For closed models we fine-tune via OpenAI and Anthropic fine-tuning APIs where available.
For style/format fine-tuning: 500-2,000 high-quality examples typically suffice. For knowledge injection: you probably want RAG instead of fine-tuning. For domain reasoning: 5,000-50,000 examples + synthetic data augmentation. Quality >> quantity: 500 curated examples beat 50,000 messy ones.
Sovereign deployment by default: your VPC on AWS/GCP/Azure via vLLM, TGI, or TensorRT-LLM. Anyscale or Modal for serverless inference. Together.ai or Fireworks for managed endpoints. Full stack on-prem for regulated industries. We optimize for your specific latency/cost/compliance constraints.
Fine-tuning engagements run a 4-8 week cycle at a fixed scope quoted up front. Included: the data curation pipeline, training infrastructure setup, LoRA/QLoRA training, the evaluation harness, deployment automation, and handover. Compute costs are passthrough at our discounted GPU rates.
Make the modelyours.
If a prompt can do the job, we will tell you and save you the engagement. If it cannot, we will tune a model that can, gate it on your golden set, and hand you the weights.





