Skip to main content
Decision framework

Self-Hosted vs Managed LLM: Which to Choose in 2026

TL;DR

Use managed LLMs (Anthropic API, OpenAI, AWS Bedrock, Vertex AI) for the first 6-18 months of any AI initiative: the operational simplicity is dramatic. Switch to self-hosted (open-source models on vLLM / TGI / SageMaker) only when (a) data residency / sovereignty requires it, (b) cost crosses a threshold where self-hosted economics dominate (typically 1M+ requests/month), or (c) you need customization that managed APIs don't support. The hybrid path (managed for some workloads, self-hosted for others) wins more often than either pure approach.

Side-by-side comparison

DimensionManaged LLMs (API-based)Self-Hosted LLMs (Open-Source)
Setup timeHoursWeeks to months
Operational burdenZeroReal (GPU mgmt, capacity, serving)
Access to frontier qualityYes (Claude Opus, GPT-5, Gemini 2.5)Open-source frontier (Llama 3.3, Qwen 2.5, DeepSeek-V3), often within 5-15% of managed frontier
Cost at low volume (10K req/month)$10-100/month$2K-5K/month minimum
Cost at high volume (10M req/month)$5K-50K/month$5K-15K/month
Data sovereigntyData leaves your infrastructureFull sovereignty
Compliance (HIPAA, FedRAMP, etc.)Limited: depends on provider BAAsWhatever you certify yourself
CustomizationLimited (mostly prompting + light fine-tuning)Full (fine-tuning, weight modification, custom models)
ScalingProvider handlesYour responsibility
Engineering expertise requiredStandard backend engineeringML / inference infrastructure expertise
Best forFirst 6-18 months, simplicity winsSovereignty, scale, customization needs

Managed LLMs (API-based)

Anthropic, OpenAI, Bedrock, Vertex AI: zero infrastructure ownership.

Managed LLM APIs let you use frontier models without operating any infrastructure. Anthropic Claude (Sonnet, Opus, Haiku), OpenAI (GPT-4o, GPT-5, o-series), Google Gemini, AWS Bedrock (multi-vendor), and Azure OpenAI provide the dominant managed options. The simplicity is dramatic: sign up, get an API key, ship code. You get frontier model quality without standing up GPU clusters, managing model serving, handling capacity planning, or maintaining inference infrastructure. Most production AI in 2026 starts on managed APIs, and most stays there permanently. Costs scale linearly with usage, which is good at low-to-medium scale and increasingly painful at very high scale.

Pros

  • Zero infrastructure burden: start in hours, not weeks
  • Access to frontier model quality (GPT-5, Claude Opus, Gemini 2.5)
  • Predictable per-token economics at small-to-medium scale
  • Provider handles capacity planning, scaling, reliability
  • Faster access to new model capabilities (released to managed APIs first)
  • Compliance certifications managed by provider (SOC 2, HIPAA BAA, etc.)

Cons

  • Vendor lock-in (model-specific code patterns; switching providers requires rework)
  • Cost scales linearly: at very high volume, becomes dominant cost line
  • Data leaves your infrastructure (problematic for sovereignty / residency requirements)
  • Limited customization (can't modify model weights, limited fine-tuning options)
  • Rate limits and capacity constraints during demand spikes
  • Some regulated environments preclude managed deployment entirely

Best for

  • First 6-18 months of any AI initiative
  • Workloads where simplicity and frontier quality matter more than per-call cost
  • Teams without ML / inference infrastructure expertise

Worst for

  • Sovereignty / data residency requirements that prevent sending data to third party
  • Workloads at very high volume where per-call cost dominates economics
  • Use cases requiring deep model customization beyond fine-tuning
Cost model

Per-token pricing: $0.15-$15 per 1M tokens depending on model. Prompt caching can reduce by 50-90% on cached prefixes.

Time to value

Hours to days from sign-up to production-ready integration.

Self-Hosted LLMs (Open-Source)

Llama, Mistral, Qwen, DeepSeek: full control, real ops.

Self-hosted LLMs run open-source models (Llama 3.3, Mistral, Qwen 2.5, DeepSeek-V3) on your infrastructure using inference engines (vLLM, TGI, Triton Inference Server). The cost economics improve dramatically at scale: once you're past 1M requests/month, self-hosted Llama 3.3 70B on appropriate infrastructure can cost 5-20× less than equivalent managed API usage. The trade-off is real ops investment: capacity planning, GPU management, model versioning, serving optimization, and the engineering work that goes into running production LLM infrastructure. For sovereign deployment requirements, self-hosted is often the only viable architecture.

Pros

  • Full data sovereignty: data never leaves your infrastructure
  • Cost economics dominate at scale (1M+ requests/month)
  • Full control over model behavior (can fine-tune, modify, customize)
  • No rate limits: capacity is whatever you provision
  • Required for many regulated environments (sovereignty, air-gapped)
  • Open-source models are now competitive with frontier (Llama 3.3, Qwen 2.5, DeepSeek-V3)

Cons

  • Real operational burden (GPU clusters, capacity planning, model serving)
  • Slower access to the newest capabilities (open-source typically lags frontier 6-12 months)
  • Engineering investment required (inference engineers, MLOps capacity)
  • Higher upfront cost (capacity provisioning) before scale economics kick in
  • Compliance certifications are your responsibility (not vendor's)
  • Self-hosted Claude / GPT not possible: open-source models only

Best for

  • Sovereign deployment requirements (healthcare, financial services, government)
  • Workloads at high volume (1M+ requests/month) where cost matters
  • Use cases requiring deep model customization

Worst for

  • Early-stage AI initiatives where simplicity matters more than cost optimization
  • Teams without ML / inference infrastructure expertise
  • Workloads requiring frontier model quality beyond what open-source provides
Cost model

Infrastructure cost: $2K-50K+ monthly for production deployments depending on scale. Per-request cost approaches zero at high volume.

Time to value

Weeks to months for production-ready deployment.

Decision scenarios

Series B SaaS adding AI features for the first time, 100K requests/month

Managed LLMs (API-based)

Managed APIs (Anthropic Claude or OpenAI). Volume too low for self-hosted economics; team likely lacks inference infrastructure expertise; ship fast.

Healthcare client requiring all PHI processing to stay in their VPC

Self-Hosted LLMs (Open-Source)

Self-hosted Llama 3.3 or Qwen 2.5 in customer VPC. Managed APIs (even with HIPAA BAA) don't satisfy this sovereignty requirement.

Production system at 5M requests/month with cost pressure

Self-Hosted LLMs (Open-Source)

Self-hosted economics dominate at this scale. Self-hosted Llama 3.3 70B can serve this workload at 1/5 to 1/10 the cost of frontier API usage.

Government agency with FedRAMP High requirements and sovereignty constraints

Self-Hosted LLMs (Open-Source)

Self-hosted in GovCloud or on-prem. Most managed AI services don't have FedRAMP High authorization yet.

Mixed workload: customer support chatbot (high volume) + document analysis (low volume, high value)

Both

Hybrid: self-hosted fine-tuned 7B-13B model for high-volume chatbot deflection, frontier managed API (Claude or GPT) for high-value document analysis. Common production pattern.

Fast-moving AI startup needing to iterate on product-market fit

Managed LLMs (API-based)

Managed APIs. Optimize for iteration speed; switch to self-hosted only after PMF and economics justify the operational investment.

Enterprise with multiple AI initiatives wanting consistent governance and economics

Both

Build internal AI platform supporting both managed (with BAA) and self-hosted models. Route requests to whichever fits the use case best. Common pattern for large enterprise AI platforms.

FAQ

Common questions

Increasingly yes. Llama 3.3 70B, Qwen 2.5 72B, and DeepSeek-V3 are competitive with frontier managed models on most general tasks. The gap is most visible on the newest capabilities (extended thinking, very long context, brand-new features) where managed providers ship first. For most production tasks, open-source frontier is within 5-15% of managed frontier, sometimes effectively equivalent for the specific task.

Roughly 1M requests/month is the typical break-even point. Below that, the operational overhead of self-hosting outweighs cost savings. Above that, self-hosted starts to win. By 10M requests/month, self-hosted is dramatically cheaper. Specific break-even depends on model choice, prompt caching usage, and infrastructure sizing.

No: both are closed-source and managed-only. Self-hosting requires open-source models. The good news: open-source frontier models (Llama 3.3, Qwen 2.5, DeepSeek-V3) have reached production quality competitive with managed frontier on most tasks.

Different trade-offs. Bedrock provides managed access to frontier models (Claude, Llama, Mistral) with HIPAA / FedRAMP options: useful when you want managed simplicity but need cloud-specific compliance. Self-hosted vLLM gives you full control and lowest cost at scale but requires real ops investment. We use Bedrock for managed deployment with cloud-specific compliance; self-hosted vLLM for full sovereignty or scale economics.

Minimum: 1× A100 80GB or H100 80GB GPU per inference replica (with INT4 quantization) for moderate throughput. For higher throughput: H100 cluster. For very high throughput: TensorRT-LLM optimization plus multi-GPU setup. We size infrastructure based on the customer's specific request volume and latency requirements.

Often yes. Common pattern: self-hosted smaller fine-tuned model for high-volume routine tasks (customer support classification, simple chatbot responses); managed frontier API for complex tasks where quality matters more than cost; centralized routing layer that decides which to use per request. This gives the best of both: economics where they matter, quality where it matters.

We do build vs buy analysis as part of our Discovery Sprint engagements. We model TCO under multiple scenarios (managed-only, self-hosted-only, hybrid), benchmark task quality on the customer's specific workload, and recommend a path. The right answer is data-driven and depends on the customer's specific scale, sovereignty needs, and operational capacity.

Get a recommendation tailored to your situation

BearPlex builds production AI systems using both approaches. We'll tell you which fits your case in a 30-minute scoping call.