Skip to main content
AI engineering glossary

What is a Context Window in LLMs?

A context window is the maximum number of tokens an LLM can process in a single inference call (including both the input prompt (system instructions, conversation history, retrieved documents) and the model's generated output), measured in tokens, where roughly 1 token ≈ 0.75 English words.

Last updated 2026-04-28BearPlex AI Engineering Team

Overview

Context window size is the single most-discussed LLM specification because it bounds what the model can 'see' at once. GPT-3.5 launched with 4K tokens; GPT-4 expanded to 8K then 32K; Claude 3 introduced 200K; Gemini 1.5 Pro introduced 1M then 2M; and 2025-2026 saw models cross the 10M-token threshold. Larger context windows enabled new application patterns (entire codebases in a prompt, full quarters of financial filings, hour-long meeting transcripts), but they also introduced new failure modes (context rot, lost-in-the-middle, attention dilution) that engineering teams now design around.

How context windows work

When you send a request to an LLM API, every token in the prompt (system message, prior conversation, retrieved documents, tools, the user's current message) counts against the context window. The model's output also consumes from the same budget (often capped separately as 'max output tokens' for cost predictability). Internally, the transformer architecture computes attention across every token pair in the context, which is why context grew slowly at first: attention is O(n²) in compute and memory. Architectural innovations like sliding window attention, ring attention, FlashAttention, and Mamba-style state-space models changed the economics, enabling the recent jump to multi-million-token windows. Despite the marketing, effective context (the range over which the model actually attends well) usually lags the advertised window size: a 1M-token model might lose recall on facts in the middle of a 500K-token prompt.

Why context window size matters in production

Larger context lets you avoid two awkward engineering problems: (1) summarization loss, when you compress conversation history to fit, you lose details the model might have needed; (2) RAG complexity: when the relevant context can fit in the prompt, you can skip vector search and retrieval entirely. But larger context isn't always better. Long contexts are expensive (most providers charge per token, so a 500K-token input is 500× the cost of 1K), slow (latency scales with context size), and degrade in quality past the model's effective context length. In production we usually right-size context per task: short context for chat replies, medium context with RAG for knowledge queries, long context only when the task genuinely requires it (full-document analysis, codebase reasoning).

Context window vs effective context

The advertised context window is the maximum the model will accept; the effective context is the range across which the model maintains good recall and reasoning. Benchmarks like 'needle in a haystack' (planting a fact deep in a long context and asking the model to retrieve it) and the more rigorous RULER and LongBench benchmarks measure effective context. As of 2026, top frontier models maintain near-perfect recall to roughly 200K-500K tokens, with gradual degradation past that. The practical implication: if you're building on a 1M-token model, don't assume you can stuff 950K tokens into the prompt and get GPT-4-quality reasoning, benchmark first.

Use cases

  • Long document analysis (entire contracts, full financial filings, complete medical records in one prompt)
  • Codebase-wide reasoning (load the full repo, ask the model to refactor across files)
  • Meeting transcript summarization (hour-long calls without chunking)
  • Multi-document synthesis (compare 50 research papers in a single inference call)
  • Reduced-RAG architectures where retrieval is replaced by direct context loading for small corpora

Examples in production

Anthropic

Claude 3 launched with 200K-token context in March 2024 (at the time, the largest in any production frontier model) enabling whole-codebase analysis in a single call.

Source

Google DeepMind

Gemini 1.5 Pro introduced a 1M-token context window in February 2024, later extended to 2M, demonstrating near-perfect recall on the needle-in-a-haystack benchmark.

Source

Magic.dev

Released LTM-2-Mini in 2024 with a 100M-token context window: explicitly targeted at codebase-wide reasoning use cases.

Source

Context Window compared to alternatives

AlternativeChoose Context Window whenChoose alternative when
RAG
Retrieval-augmented generation: search a knowledge base, inject only relevant chunks into the prompt
Use long context when the corpus is small enough to fit in one prompt and you want zero retrieval failure modesUse RAG when the corpus is large, growing, or requires permission-aware retrieval
Fine-tuning
Modifying model weights to bake knowledge or behavior into the model
Use long context for knowledge that changes frequently or differs per user/tenantUse fine-tuning for stable behavior changes or domain-specific style

Common pitfalls

  • Assuming advertised context = usable context: most models degrade well before their nominal limit
  • Stuffing the prompt with everything 'just in case': hurts both cost and accuracy
  • Ignoring lost-in-the-middle effects: facts placed in the middle of long context are recalled less reliably than facts at the start or end
  • Forgetting that output tokens count against the budget: a 200K input + a 50K output requires a 250K context model
  • Trusting needle-in-a-haystack scores as evidence of full reasoning capability: recall ≠ reasoning
FAQ

Questions about Context Window.

Roughly 0.75 English words on average. 'Hello world' is 2 tokens. 'Anthropic' is typically 1 token. Code, non-English text, and rare words tokenize less efficiently: a Python file might be 1.3-1.5 tokens per word.

Yes: almost all providers charge per token. A 100K-token input costs 100× a 1K-token input at the same model. Some providers offer prompt caching (Anthropic, OpenAI) that drops cached prefix cost by 90%, which makes long-context architectures dramatically more affordable for repeated queries.

Both have roles. We use long context when the corpus is small enough to fit and we want zero retrieval failures. We use RAG when the corpus is large, when retrieval permissions matter, or when cost-per-query needs to stay flat as the corpus grows. Many production systems use both: RAG for the bulk of queries, long-context fallback for hard cases.

Magic.dev's LTM-2-Mini reaches 100M tokens. Among the major frontier models, Gemini 1.5 Pro at 2M and Claude at 1M lead the pack as of Q1 2026. New entrants regularly extend the frontier: check provider docs for current limits.

Work with BearPlex

Need help implementing Context Window?

BearPlex builds production AI systems that use Context Window for Fortune 500s and high-growth scale-ups. Outcome-based pricing. 90-day embedded sprints.