What is the current GPT-5 model in the API?

As of July 2026, the flagship is gpt-5.5 (snapshot gpt-5.5-2026-04-23), with a 1,050,000-token context window, 128K max output tokens, and a December 2025 knowledge cutoff. The workhorse tier is the gpt-5.4 family (standard, mini, nano). The original gpt-5 from August 2025 was deprecated in June 2026 and shuts down on December 11, 2026, so new builds should not target it.

How much does GPT-5.5 cost via the API?

As of July 2026, gpt-5.5 is $5.00 per million input tokens, $0.50 for cached input, and $30.00 per million output tokens. Batch processing halves that ($2.50/$15.00), priority processing raises it to $12.50/$75.00, and prompts beyond 272K input tokens are billed at 2x input and 1.5x output. gpt-5.4-mini ($0.75/$4.50) and gpt-5.4-nano ($0.20/$1.25) cover the volume tiers.

Does GPT-5 still use a router?

The router was a ChatGPT product feature, not an API feature. At launch, OpenAI's system card described a real-time router choosing between fast and reasoning models based on conversation type, complexity, and tool needs. In the API, that decision is yours: gpt-5.5 exposes a reasoning_effort parameter with five levels (none, low, medium, high, xhigh), which is the main cost, latency, and reliability lever in any production design.

How reliable is GPT-5 for tool calling in production agents?

gpt-5.5 supports structured outputs, MCP, hosted shell, computer use, and the rest of OpenAI's first-party tool surface, and in our engagements it is one of the strongest tool-calling platforms available. But OpenAI publishes no per-effort-level reliability numbers, and tool-call accuracy varies with reasoning effort and schema design. We treat it as an empirical question: run task-level evals at each effort level on your own tools before committing an architecture.

How big a risk is OpenAI's deprecation policy?

It is the defining operational characteristic of the platform. OpenAI guarantees at least 6 months of notice for GA models, and the record shows they use roughly that: the original GPT-5 goes from launch to shutdown in 16 months, and the 5.2 and 5.3 generations lived under a year. Plan for a forced migration every 12 to 18 months, pin snapshots, keep golden-set evals runnable on demand, and route all model access through a gateway so a migration is configuration, not code.

Can GPT-5 run on-premises or in our VPC?

No. GPT-5 is API-only; there are no weights to host, and that does not change with any pricing tier. If your constraint is data residency or infrastructure control, the decision is not GPT-5 versus another hosted frontier model, it is hosted-API versus open weights. That is the evaluation where models like DeepSeek V3 or Mistral Large 3 enter, and we run it as a routing decision rather than an either-or in most client architectures.

Does BearPlex build on GPT-5?

Yes, where the evaluation supports it. We treat OpenAI as one candidate lane in a model-agnostic gateway: effort-tiered routing across the 5.4 and 5.5 families, cached prefixes and batch pricing exploited by default, and a standing deprecation playbook so vendor velocity never becomes a production incident. Whether GPT-5 wins a given lane is decided by the client's own task evals, rerun quarterly.

GPT-5 in Production: The Engineering Decision Brief

Most GPT-5 coverage is a benchmark story. This brief is a procurement and architecture story, because that is what actually decides whether GPT-5 belongs in your stack. As of July 2026, "GPT-5" is not one model. It is a fast-moving platform: the original August 2025 release has already been deprecated, four point-releases have shipped and been retired behind it, and the current flagship is gpt-5.5. If you are evaluating OpenAI for a production system, the deprecation section of this brief matters at least as much as the capability section.

What it actually is

GPT-5 launched on August 7, 2025 (the retired API snapshot ID, gpt-5-2025-08-07, carries the date). The GPT-5 System Card described the consumer product as a unified system: a fast model for most questions, a deeper reasoning model for harder problems, and "a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent." The router was trained on live signals, including when users switched models and measured correctness, and OpenAI stated the plan was to eventually integrate everything into a single model.

That integration is essentially what the 5.x line delivered. The current flagship, gpt-5.5 (snapshot gpt-5.5-2026-04-23), exposes the routing decision to you as a parameter: reasoning_effort accepts none, low, medium (default), high, and xhigh. There is no hidden router in the API path. You choose, per request, how much thinking you buy. Specs that matter: a 1,050,000-token context window, 128,000 max output tokens, and a knowledge cutoff of December 1, 2025.

The engineering consequence of the router history is worth stating plainly: in the API, routing is your job. The consumer product made "which model, how much reasoning" an invisible platform decision; the API hands it back to you as the single biggest cost and latency lever you control.

Commercial terms

There is no license to negotiate in the open-weights sense; this is a hosted API under OpenAI's business terms. What you are actually agreeing to, structurally, is a dependency on OpenAI's model lifecycle. Their published policy commits to at least 6 months of notice before retiring generally available models and at least 3 months for specialized variants, per the deprecations page. Six months is the contractual floor you should plan around, because it is the number OpenAI plans around.

Real API cost

Verified against the official pricing page as of July 2026, per million tokens:

| Model | Input | Cached input | Output | |---|---|---|---| | gpt-5.5 | $5.00 | $0.50 | $30.00 | | gpt-5.5-pro | $30.00 | n/a | $180.00 | | gpt-5.4 | $2.50 | $0.25 | $15.00 | | gpt-5.4-mini | $0.75 | $0.075 | $4.50 | | gpt-5.4-nano | $0.20 | $0.02 | $1.25 |

Three modifiers change the real bill more than the headline rates:

Batch is a flat 50% off (gpt-5.5 at $2.50/$15.00). Any workload that tolerates asynchronous turnaround should be there by default.
Priority processing costs roughly 2.5x (gpt-5.5 at $12.50/$75.00). If your latency SLO forces the priority tier, your model is 2.5x more expensive than the number in your business case.
Long context is a premium product. Prompts beyond 272K input tokens are billed at 2x input and 1.5x output on gpt-5.5. The million-token window exists, but the economics push you toward retrieval and context discipline rather than stuffing the window.

The cached-input rate (10% of base) is the quiet workhorse: agent systems that keep a stable prompt prefix routinely see the majority of their input tokens billed at that rate.

Tool use and eval behavior that matters

gpt-5.5's tool surface is broad and first-party: web search, file search, code interpreter, hosted shell, apply-patch, computer use, MCP, and structured outputs are all supported per the model page. For production agents, the details we test for in every model engineering evaluation:

Reasoning effort is the reliability dial, not just a cost dial. Tool-selection and argument-construction errors drop as effort rises, and none turns the model into a fast, cheaper executor that is only appropriate for well-constrained tool schemas. Run your own task-level evals per effort level; the deltas are workload-specific and OpenAI publishes no per-effort reliability numbers.
Structured outputs are the contract. Schema-constrained generation is the single most effective control we know for keeping multi-step agents parseable at step forty.
The platform tools create soft lock-in. Hosted shell, file search, and computer use are excellent, and every one you adopt makes the eventual portability conversation harder. Wrap them behind your own interfaces from day one.

Deprecation history as platform risk

This is the section buyers skip and regret. The verified record from OpenAI's own deprecations page, as of July 2026:

The original GPT-5 (gpt-5-2025-08-07), plus its mini and nano variants, was deprecated on June 11, 2026 and shuts down December 11, 2026. Sixteen months from launch to shutdown.
gpt-5-chat-latest, gpt-5-codex, and the entire 5.1 codex family were deprecated April 22, 2026 and shut down July 23, 2026.
gpt-5.2-chat-latest and gpt-5.3-chat-latest shut down August 10, 2026. The 5.2 and 5.3 generations lived well under a year.
Even the GPT-4 era finally ends: gpt-4-0613, gpt-4-turbo, and the original gpt-4o snapshot shut down October 23, 2026.

The pattern is consistent: OpenAI ships fast and retires fast, honoring the 6-month floor and rarely much more. Practical implications: pin snapshots, budget a re-evaluation cycle roughly every two quarters, keep your eval suite runnable on demand so a forced migration is a regression test rather than a research project, and never hardcode a model ID deeper than one config file. Prompt behavior shifts between point releases; the deprecation notice is also a behavior-change notice.

When to use it, and when not

Use the GPT-5 platform when:

You want the broadest first-party tool ecosystem (hosted shell, computer use, MCP) with one vendor and one bill.
Your workload spans wildly different difficulty levels and you can exploit reasoning_effort plus the 5.4 mini/nano ladder to route cost.
Batch-eligible volume work dominates: $2.50/$15.00 for frontier-class capability is genuinely hard to beat.

Do not use it when:

Data cannot leave your infrastructure boundary. There are no weights; an open-weights model is the answer to that constraint, not a different API vendor.
Your organization cannot absorb a forced model migration every 12 to 18 months. The deprecation record above is the base rate, not a worst case.
You need multi-year behavioral stability for a regulated, validated workflow; pinned snapshots still retire.

How we would architect it for a client

The same routing-gateway discipline we apply to open-weights deployments applies here, with vendor risk added to the design inputs:

A model-agnostic gateway owns model IDs, so a deprecation is a config change plus an eval run, not a code change. Every prompt lives in version control with a golden-set eval attached.
Effort-tiered routing: gpt-5.4-nano or mini at low effort for extraction and classification volume, gpt-5.5 at medium for standard agent steps, high or xhigh reserved for the requests that measurably need it. The comparison with Anthropic's lineup is rerun quarterly, because both vendors move.
Caching and batch by default: stable prompt prefixes to exploit the $0.50 cached rate, and every non-interactive pipeline on the batch tier.
A deprecation playbook, not a deprecation panic: subscribe to the deprecations feed, hold the previous snapshot and the new one in A/B during migration windows, and treat the 6-month notice as the start of a scheduled project.

GPT-5 is a strong default choice in mid-2026. It stays a strong choice only for teams that engineer for the platform's velocity instead of pretending it is infrastructure.

GPT-5Frontier LLM

What it actually is

Commercial terms

Real API cost

Tool use and eval behavior that matters

Deprecation history as platform risk

When to use it, and when not

How we would architect it for a client

Frequently asked

Related work

Related reading

Shipping frontier llm in production?