Question 1

When should we fine-tune vs use prompting?

Accepted Answer

Prompt first, fine-tune second. We recommend fine-tuning when (1) you'd need 20+ few-shot examples to get prompting accuracy you need, making prompts unwieldy and expensive; (2) per-call cost matters at scale (millions of requests/month where a smaller fine-tuned model dominates GPT-4 economics); (3) you need rigid format compliance that prompting can't reliably achieve; (4) you have a specialized domain where the base model performance is borderline. We don't fine-tune for fast-changing knowledge: that's what RAG is for.

Question 2

Do BearPlex fine-tuning engineers work with closed-source frontier models?

Accepted Answer

Yes: when the platform supports it. OpenAI fine-tuning (GPT-4o, GPT-3.5-turbo), AWS Bedrock custom models (Claude variants in private preview), Azure OpenAI fine-tuning, Google Vertex AI fine-tuning. Closed-source fine-tuning is more constrained (less control over hyperparameters, longer iteration loop) but sometimes the right answer for clients already on those platforms.

Question 3

How long does a typical fine-tuning project take?

Accepted Answer

Initial production fine-tuned model: 4-8 weeks for a typical engagement. The training itself is often less than a day; the time goes to dataset construction (weeks), evaluation harness design (1-2 weeks), and iteration based on eval results (weeks). Engagements that try to skip the dataset and evaluation phases reliably ship worse models.

Question 4

What model sizes do you typically fine-tune?

Accepted Answer

Most production work is in the 7B-13B range (Llama 3.1 8B, Mistral 7B, Qwen 2.5 7B, Phi-3) with LoRA or QLoRA. We fine-tune 70B+ models when client task complexity justifies it (typically agent and reasoning tasks). For closed-source platforms, we fine-tune whatever the platform supports, typically GPT-4o or smaller GPT variants on OpenAI, Claude variants where available on Bedrock.

Question 5

Can you fine-tune on our private data without it leaving our infrastructure?

Accepted Answer

Yes: for clients with strict data residency requirements (healthcare, financial-services, government), we run training on customer infrastructure (AWS, GCP, Azure account they own; on-prem GPU cluster) rather than third-party fine-tuning APIs. Models stay in customer accounts; we provide the engineering, customer keeps the artifacts.

Question 6

Do you handle DPO and RLHF, or just supervised fine-tuning?

Accepted Answer

Both. SFT is the starting point for most engagements. DPO (and increasingly ORPO and KTO) come into play when you have preference data: pairs of 'good' and 'bad' responses from real usage. Full RLHF with reward modeling is rare in client work because the data and infrastructure investment is significant; most clients hit their goals with SFT + DPO.

Question 7

How do you measure fine-tuned model quality vs the base model?

Accepted Answer

Always against a held-out evaluation set the model has never seen. We design the evaluation harness BEFORE training, then measure: task accuracy on held-out data, regression on general capabilities (we use MMLU-Pro and a domain-specific subset to catch capability loss), and inference cost/latency. A fine-tuned model that beats the base on the target task but breaks on out-of-distribution inputs is not a win.

Question 8

Where are BearPlex fine-tuning engineers based?

Accepted Answer

Primarily Lahore, Pakistan (HQ) with client-facing presence in Austin and Doha. Time zone overlap with US clients is 5-9 hours; we structure engagements with daily 2-3 hour overlap windows for synchronous work, async handoff for the rest.

Question 9

Can your fine-tuning engineers work with our existing MLOps infrastructure?

Accepted Answer

Yes. We work with whatever you have: Kubeflow, SageMaker, Vertex AI Pipelines, custom Airflow + Slurm setups, Modal, Ray. If you don't have MLOps infrastructure, we can pair a fine-tuning engineer with one of our MLOps engineers to stand it up alongside the model work.

Skill	Proficiency	Typical tools
LoRA / QLoRA fine-tuning	Expert	Hugging Face PEFT · Unsloth · LLaMA-Factory
Full fine-tuning at scale	Expert	DeepSpeed ZeRO · FSDP · Megatron-LM
DPO / ORPO preference alignment	Expert	TRL · Axolotl · OpenRLHF
Supervised fine-tuning (SFT)	Expert	TRL · Hugging Face Transformers · Axolotl
Dataset construction and curation	Expert	Argilla · Label Studio · custom pipelines
Evaluation harness design	Expert	lm-evaluation-harness · OpenAI Evals · custom rubrics
Hyperparameter sweeps and tracking	Expert	Weights & Biases · MLflow · Ray Tune
Production model serving	Advanced	vLLM · TGI · Triton Inference Server
Continued pre-training	Advanced	Megatron-LM · DeepSpeed · custom data pipelines
Quantization (GPTQ, AWQ, INT4)	Advanced	AutoGPTQ · AutoAWQ · bitsandbytes
Fine-tuned model evaluation against base model	Expert	custom A/B harnesses · MMLU subsets · domain benchmarks
GPU cluster management and cost optimization	Advanced	Slurm · Ray · Modal · spot instance orchestration

Hire Fine-Tuning Engineers in 2 weeks

What a fine-tuning engineer actually does at BearPlex

Sample engineer profiles

Skills matrix

How we vet fine-tuning engineers

Technical screen

Live training exercise

Architecture interview

Reference checks + paid trial

What clients say

Hiring fine-tuning engineers: questions answered

Related roles

Related services

Featured case studies

Related reading

Get matched with a fine-tuning engineer in 14 days