When does self-hosted economics become favorable?

Roughly 1M requests/month is the typical break-even point. Below that, the operational overhead of self-hosting outweighs cost savings. Above that, self-hosted starts to win. By 10M requests/month, self-hosted is dramatically cheaper. Specific break-even depends on model choice, prompt caching usage, and infrastructure sizing.

Can we self-host Claude or GPT?

No: both are closed-source and managed-only. Self-hosting requires open-source models. The good news: open-source frontier models (Llama 3.3, Qwen 2.5, DeepSeek-V3) have reached production quality competitive with managed frontier on most tasks.

Should we use AWS Bedrock or self-hosted vLLM?

Different trade-offs. Bedrock provides managed access to frontier models (Claude, Llama, Mistral) with HIPAA / FedRAMP options: useful when you want managed simplicity but need cloud-specific compliance. Self-hosted vLLM gives you full control and lowest cost at scale but requires real ops investment. We use Bedrock for managed deployment with cloud-specific compliance; self-hosted vLLM for full sovereignty or scale economics.

What infrastructure do we need for self-hosted Llama 3.3 70B?

Minimum: 1× A100 80GB or H100 80GB GPU per inference replica (with INT4 quantization) for moderate throughput. For higher throughput: H100 cluster. For very high throughput: TensorRT-LLM optimization plus multi-GPU setup. We size infrastructure based on the customer's specific request volume and latency requirements.

Should we run a hybrid (managed + self-hosted)?

Often yes. Common pattern: self-hosted smaller fine-tuned model for high-volume routine tasks (customer support classification, simple chatbot responses); managed frontier API for complex tasks where quality matters more than cost; centralized routing layer that decides which to use per request. This gives the best of both: economics where they matter, quality where it matters.

How does BearPlex help with this decision?

We do build vs buy analysis as part of our Discovery Sprint engagements. We model TCO under multiple scenarios (managed-only, self-hosted-only, hybrid), benchmark task quality on the customer's specific workload, and recommend a path. The right answer is data-driven and depends on the customer's specific scale, sovereignty needs, and operational capacity.

Start a conversation

Decision framework

Self-Hosted vs Managed LLM: Which to Choose in 2026

TL;DR

Use managed LLMs (Anthropic API, OpenAI, AWS Bedrock, Vertex AI) for the first 6-18 months of any AI initiative: the operational simplicity is dramatic. Switch to self-hosted (open-source models on vLLM / TGI / SageMaker) only when (a) data residency / sovereignty requires it, (b) cost crosses a threshold where self-hosted economics dominate (typically 1M+ requests/month), or (c) you need customization that managed APIs don't support. The hybrid path (managed for some workloads, self-hosted for others) wins more often than either pure approach.

Side-by-side comparison

Dimension	Managed LLMs (API-based)	Self-Hosted LLMs (Open-Source)
Setup time	Hours	Weeks to months
Operational burden	Zero	Real (GPU mgmt, capacity, serving)
Access to frontier quality	Yes (Claude Opus, GPT-5, Gemini 2.5)	Open-source frontier (Llama 3.3, Qwen 2.5, DeepSeek-V3), often within 5-15% of managed frontier
Cost at low volume (10K req/month)	$10-100/month	$2K-5K/month minimum
Cost at high volume (10M req/month)	$5K-50K/month	$5K-15K/month
Data sovereignty	Data leaves your infrastructure	Full sovereignty
Compliance (HIPAA, FedRAMP, etc.)	Limited: depends on provider BAAs	Whatever you certify yourself
Customization	Limited (mostly prompting + light fine-tuning)	Full (fine-tuning, weight modification, custom models)
Scaling	Provider handles	Your responsibility
Engineering expertise required	Standard backend engineering	ML / inference infrastructure expertise
Best for	First 6-18 months, simplicity wins	Sovereignty, scale, customization needs

Managed LLMs (API-based)

Anthropic, OpenAI, Bedrock, Vertex AI: zero infrastructure ownership.

Managed LLM APIs let you use frontier models without operating any infrastructure. Anthropic Claude (Sonnet, Opus, Haiku), OpenAI (GPT-4o, GPT-5, o-series), Google Gemini, AWS Bedrock (multi-vendor), and Azure OpenAI provide the dominant managed options. The simplicity is dramatic: sign up, get an API key, ship code. You get frontier model quality without standing up GPU clusters, managing model serving, handling capacity planning, or maintaining inference infrastructure. Most production AI in 2026 starts on managed APIs, and most stays there permanently. Costs scale linearly with usage, which is good at low-to-medium scale and increasingly painful at very high scale.

Pros

Zero infrastructure burden: start in hours, not weeks
Access to frontier model quality (GPT-5, Claude Opus, Gemini 2.5)
Predictable per-token economics at small-to-medium scale
Provider handles capacity planning, scaling, reliability
Faster access to new model capabilities (released to managed APIs first)
Compliance certifications managed by provider (SOC 2, HIPAA BAA, etc.)

Cons

Vendor lock-in (model-specific code patterns; switching providers requires rework)
Cost scales linearly: at very high volume, becomes dominant cost line
Data leaves your infrastructure (problematic for sovereignty / residency requirements)
Limited customization (can't modify model weights, limited fine-tuning options)
Rate limits and capacity constraints during demand spikes
Some regulated environments preclude managed deployment entirely

Best for

→ First 6-18 months of any AI initiative
→ Workloads where simplicity and frontier quality matter more than per-call cost
→ Teams without ML / inference infrastructure expertise

Worst for

→ Sovereignty / data residency requirements that prevent sending data to third party
→ Workloads at very high volume where per-call cost dominates economics
→ Use cases requiring deep model customization beyond fine-tuning

Cost model

Per-token pricing: $0.15-$15 per 1M tokens depending on model. Prompt caching can reduce by 50-90% on cached prefixes.

Time to value

Hours to days from sign-up to production-ready integration.

Self-Hosted LLMs (Open-Source)

Llama, Mistral, Qwen, DeepSeek: full control, real ops.

Self-hosted LLMs run open-source models (Llama 3.3, Mistral, Qwen 2.5, DeepSeek-V3) on your infrastructure using inference engines (vLLM, TGI, Triton Inference Server). The cost economics improve dramatically at scale: once you're past 1M requests/month, self-hosted Llama 3.3 70B on appropriate infrastructure can cost 5-20× less than equivalent managed API usage. The trade-off is real ops investment: capacity planning, GPU management, model versioning, serving optimization, and the engineering work that goes into running production LLM infrastructure. For sovereign deployment requirements, self-hosted is often the only viable architecture.

Pros

Full data sovereignty: data never leaves your infrastructure
Cost economics dominate at scale (1M+ requests/month)
Full control over model behavior (can fine-tune, modify, customize)
No rate limits: capacity is whatever you provision
Required for many regulated environments (sovereignty, air-gapped)
Open-source models are now competitive with frontier (Llama 3.3, Qwen 2.5, DeepSeek-V3)

Cons

Real operational burden (GPU clusters, capacity planning, model serving)
Slower access to the newest capabilities (open-source typically lags frontier 6-12 months)
Engineering investment required (inference engineers, MLOps capacity)
Higher upfront cost (capacity provisioning) before scale economics kick in
Compliance certifications are your responsibility (not vendor's)
Self-hosted Claude / GPT not possible: open-source models only

Best for

→ Sovereign deployment requirements (healthcare, financial services, government)
→ Workloads at high volume (1M+ requests/month) where cost matters
→ Use cases requiring deep model customization

Worst for

→ Early-stage AI initiatives where simplicity matters more than cost optimization
→ Teams without ML / inference infrastructure expertise
→ Workloads requiring frontier model quality beyond what open-source provides

Cost model

Infrastructure cost: $2K-50K+ monthly for production deployments depending on scale. Per-request cost approaches zero at high volume.

Time to value

Weeks to months for production-ready deployment.

Decision scenarios

Series B SaaS adding AI features for the first time, 100K requests/month

→ Managed LLMs (API-based)

Managed APIs (Anthropic Claude or OpenAI). Volume too low for self-hosted economics; team likely lacks inference infrastructure expertise; ship fast.

Healthcare client requiring all PHI processing to stay in their VPC

→ Self-Hosted LLMs (Open-Source)

Self-hosted Llama 3.3 or Qwen 2.5 in customer VPC. Managed APIs (even with HIPAA BAA) don't satisfy this sovereignty requirement.

Production system at 5M requests/month with cost pressure

→ Self-Hosted LLMs (Open-Source)

Self-hosted economics dominate at this scale. Self-hosted Llama 3.3 70B can serve this workload at 1/5 to 1/10 the cost of frontier API usage.

Government agency with FedRAMP High requirements and sovereignty constraints

→ Self-Hosted LLMs (Open-Source)

Self-hosted in GovCloud or on-prem. Most managed AI services don't have FedRAMP High authorization yet.

Mixed workload: customer support chatbot (high volume) + document analysis (low volume, high value)

→ Both

Hybrid: self-hosted fine-tuned 7B-13B model for high-volume chatbot deflection, frontier managed API (Claude or GPT) for high-value document analysis. Common production pattern.

Fast-moving AI startup needing to iterate on product-market fit

→ Managed LLMs (API-based)

Managed APIs. Optimize for iteration speed; switch to self-hosted only after PMF and economics justify the operational investment.

Enterprise with multiple AI initiatives wanting consistent governance and economics

→ Both

Build internal AI platform supporting both managed (with BAA) and self-hosted models. Route requests to whichever fits the use case best. Common pattern for large enterprise AI platforms.

FAQ

Common questions

Increasingly yes. Llama 3.3 70B, Qwen 2.5 72B, and DeepSeek-V3 are competitive with frontier managed models on most general tasks. The gap is most visible on the newest capabilities (extended thinking, very long context, brand-new features) where managed providers ship first. For most production tasks, open-source frontier is within 5-15% of managed frontier, sometimes effectively equivalent for the specific task.

Related comparisons

Related services

Featured case studies

Get a recommendation tailored to your situation

BearPlex builds production AI systems using both approaches. We'll tell you which fits your case in a 30-minute scoping call.

Talk to BearPlex See case studies

Self-Hosted vs Managed LLM: Which to Choose in 2026

Side-by-side comparison

Managed LLMs (API-based)

Pros

Cons

Best for

Worst for

Self-Hosted LLMs (Open-Source)

Pros

Cons

Best for

Worst for

Decision scenarios

Common questions

Related comparisons

Related services

Featured case studies

Related reading

Get a recommendation tailored to your situation