What does Gemini 3.5 Flash cost?

Verified against the Gemini API pricing page as of July 2026: $1.50 per million input tokens and $9.00 per million output tokens on the paid tier, with batch processing at half price ($0.75/$4.50) and context-cache reads at $0.15 per million plus $1.00 per hour of storage. A free tier exists on the Developer API. Note the tier shift: 3.5 Flash costs about 5x the input rate of Gemini 2.5 Flash, so Flash no longer means the budget lane.

How big is Gemini 3.5 Flash's context window?

1,048,576 input tokens with a 65,536-token output limit, per the official model page. Unlike the Pro models (Gemini 3.1 Pro bills input at $4.00 instead of $2.00 beyond 200K tokens), 3.5 Flash has no published long-context price tier, which makes it the economical choice for genuinely long inputs. We still cap context per request and prefer retrieval where it works: a full-window call is roughly $1.57 of input before output costs.

Should we access Gemini through AI Studio or Vertex AI?

AI Studio and the Gemini Developer API give you API-key access and a free tier, which is ideal for prototyping and fine for many production products. Vertex AI serves the same model family with enterprise controls: IAM, VPC Service Controls, customer-managed encryption keys, Private Service Connect, audit logging, and Provisioned Throughput for reserved capacity. Our rule: regulated data, contractual capacity needs, or an existing GCP governance perimeter mean you build on Vertex from day one.

Is Gemini 3.5 Flash really better than Gemini 3.1 Pro?

That is Google's claim, and the numbers behind it (76.2% Terminal-Bench 2.1, 83.6% MCP Atlas, outperforming 3.1 Pro on agentic coding benchmarks) are Google's own launch figures, so we treat them as directional. Two facts are independently verifiable: 3.5 Flash is GA while 3.1 Pro remains preview, and 3.5 Flash is cheaper ($1.50/$9.00 versus $2.00/$12.00). For production work in July 2026, the GA flagship is the sensible default and your own task evals settle the rest.

How does Gemini 3.5 Flash pricing compare to GPT-5.5 and Claude?

On July 2026 list prices, aggressively: $1.50/$9.00 per million tokens versus $5.00/$30.00 for gpt-5.5 and $3/$15 for Sonnet-class Claude models. List price is not cost of ownership: token consumption per task, caching behavior, output verbosity, and failure-retry rates differ by model, so we compare cost per completed task on the client's own workload rather than per-token rates. But the headline gap is real and is exactly the pressure Google priced for.

Does BearPlex build on Gemini?

Yes, where evals support it, most often for long-input multimodal workloads (video, audio, large document sets) and for clients already inside the GCP perimeter where Vertex governance is the path of least resistance. We run it as one lane in a model-agnostic gateway with 3.1 Flash-Lite underneath for volume traffic, context budgets enforced at the gateway, and a quarterly cross-vendor bake-off, because Google's lineup changed three times in the year to July 2026 and stale assumptions are the real cost driver.

Gemini 3.5 Flash: The Engineering Decision Brief

Q: What is the current flagship Gemini model?

As of July 2026, the flagship stable model on ai.google.dev is Gemini 3.5 Flash (model ID gemini-3.5-flash), which Google describes as its most intelligent model for agentic and coding tasks. The strongest Pro-branded model, Gemini 3.1 Pro, is still in preview, and Gemini 3.5 Pro was announced as in progress at I/O 2026 but had not reached the API model list. The Gemini 2.5 line remains available as the previous stable generation.

The first thing to verify about Gemini in mid-2026 is the name, because Google's lineup inverted the usual hierarchy. The current flagship is not a Pro model. Per ai.google.dev, the top stable model is Gemini 3.5 Flash, which Google describes as its "most intelligent model for sustained frontier performance on agentic and coding tasks." The strongest Pro-branded model, Gemini 3.1 Pro, is still in preview, and the announced Gemini 3.5 Pro had not reached the API model list as of July 2026. If your evaluation matrix still says "Gemini 2.5 Pro" in the flagship slot, it is a year out of date; if it says "Flash means the cheap tier," it is two months out of date.

What it actually is

Gemini 3.5 Flash (announced May 19, 2026 at Google I/O) is a generally available multimodal model: text, image, video, audio, and PDF in, text out. The verified spec sheet from the official model page: model ID gemini-3.5-flash, 1,048,576-token input window, 65,536-token output limit, knowledge cutoff January 2025, with thinking, function calling, structured outputs, context caching, code execution, computer use (preview), search grounding, and a batch API all supported.

Google's own launch numbers position it above its Pro sibling: 76.2% on Terminal-Bench 2.1, 83.6% on MCP Atlas, and an "outperforms Gemini 3.1 Pro on challenging coding and agentic benchmarks" framing, plus a claim of roughly 4x the output speed of other frontier models. Those are vendor-reported; we cite them as Google's claims, not as findings. The credible core is the strategy: Google collapsed "frontier intelligence" and "high throughput" into one SKU and priced it accordingly.

Commercial terms

Hosted API, no weights, two distinct front doors, and choosing the door is a real architectural decision:

AI Studio / Gemini Developer API: API-key access, a genuinely useful free tier (verified on the pricing page), minimal setup. This is the prototyping path, and for many products it remains the production path.
Vertex AI: the same model family behind Google Cloud's enterprise controls. Per the Vertex model docs, this is where you get IAM-based access control, VPC Service Controls, customer-managed encryption keys, Private Service Connect, audit logging, and Provisioned Throughput for reserved capacity.

The rule we apply in client work: if the workload touches regulated data, needs contractual capacity, or must live inside an existing GCP governance perimeter, build on Vertex from day one. Migrating from an AI-Studio-keyed codebase to Vertex auth and quotas later is not hard, but it is never free, and it always lands in the week you least want it.

Real API cost

Verified against the Gemini API pricing page as of July 2026, paid tier, per million tokens:

| Model | Input | Output | Status | |---|---|---|---| | Gemini 3.5 Flash | $1.50 | $9.00 | GA | | Gemini 3.1 Pro (≤200K / >200K) | $2.00 / $4.00 | $12.00 / $18.00 | Preview | | Gemini 3.1 Flash-Lite | $0.25 | $1.50 | GA | | Gemini 2.5 Flash | $0.30 | $2.50 | GA | | Gemini 2.5 Pro (≤200K / >200K) | $1.25 / $2.50 | $10.00 / $15.00 | GA |

Batch runs at 50% off across the line (3.5 Flash at $0.75/$4.50), and context caching bills cached 3.5 Flash tokens at $0.15 with $1.00/hour storage.

Two readings matter. First, the flagship price point is aggressive: $1.50/$9.00 sits well under gpt-5.5's $5/$30 and under Sonnet-class $3/$15, which is exactly the pressure Google intends. Second, "Flash" no longer means cheap: 3.5 Flash costs 5x the input of 2.5 Flash. Teams that treated Flash as the bargain tier need to re-point that role at 3.1 Flash-Lite ($0.25/$1.50) and re-run the math.

The long-context story

The entire current Gemini line runs a 1,048,576-token input window, and Gemini has held the 1M position longer than any competitor, since early 2024. The 2026 nuance is in the billing: on the Pro models, tokens beyond 200K are billed at a higher tier (3.1 Pro doubles to $4.00 input above 200K), while 3.5 Flash has no published long-context surcharge, making it the economically interesting choice for genuinely long inputs.

Our standing engineering advice survives the big window: a million tokens of context is a capability, not an architecture. Retrieval that puts the right 30K tokens in front of the model beats 800K tokens of haystack on cost, latency, and usually accuracy. Where the 1M window genuinely earns its keep is whole-corpus reasoning you cannot pre-chunk well: full codebases, long video, discovery-style document sets, and as the fallback lane when RAG misses. Use context caching aggressively either way; at $0.15 per million cached tokens, a stable long prefix costs 10% of resending it.

When to use it, and when not

Use Gemini 3.5 Flash when:

Long-input multimodality is the workload: video, audio, PDFs, and mixed document sets at 1M-token scale with no long-context price cliff.
You want frontier-class agent capability at $1.50/$9.00, the lowest verified flagship price point of July 2026, and you will validate quality on your own evals.
You are already a GCP shop: Vertex governance, Provisioned Throughput, and existing IAM make it the path of least organizational resistance.

Do not use it when:

Weights-on-your-hardware is the constraint; that conversation belongs to DeepSeek V3 and Mistral Large 3.
You need the Pro-branded top end today: 3.1 Pro is still preview-stage, and preview models are not something we let clients build load-bearing systems on.
Your budget tier was actually 2.5-Flash-shaped. The new bargain lane is 3.1 Flash-Lite; benchmark it before assuming 3.5 Flash is the drop-in upgrade.

How we would architect it for a client

Decide the front door first. Regulated or capacity-sensitive workloads go Vertex (VPC-SC, CMEK, Provisioned Throughput); everything else can start on the Developer API with the free tier doing evaluation duty. Wrap the client SDK so the door can change without touching product code.
Two-lane routing inside the family: 3.1 Flash-Lite for extraction and classification volume, 3.5 Flash for agentic and long-context work. The 6x price gap between lanes dominates most optimization you could do inside either lane.
Long-context with a budget guard. Even without a surcharge, a full-window 3.5 Flash call is roughly $1.57 of input; we cap per-request context in the gateway, cache stable prefixes, and route to retrieval when the requested window exceeds the cap.
Quarterly cross-vendor bake-off. Google's lineup moved three times in the twelve months to July 2026 (2.5 to 3 to 3.1 to 3.5); the managed-versus-self-hosted question and the vendor question get rerun on your tasks, on schedule, in every model engineering engagement.

The naming will keep churning. The engineering posture that survives it: verify the lineup at the docs page, price the tiers from the pricing page, and let your own evals, not the brand suffix, assign each model its lane.

Gemini 3.5 FlashFrontier LLM

What it actually is

Commercial terms

Real API cost

The long-context story

When to use it, and when not

How we would architect it for a client

Frequently asked

Related work

Related reading

Shipping frontier llm in production?