The Context Window Arms Race — What 1 Million Tokens Actually Means

What is a context window?

A context window is the maximum amount of text an AI model can "see" at once — its working memory for a single conversation or task. Everything outside the context window is invisible to the model. This is why AI has no memory between conversations: each new conversation starts with an empty context.

One token is roughly 0.75 words. So a 128,000 token context window holds approximately 96,000 words — a novel-length document. A 1 million token window holds roughly 750,000 words — more than War and Peace.

2023 — Tight limits

GPT-3.5: 4,000 tokens (~3,000 words). GPT-4 early: 8,000 tokens. A standard business report often exceeded the limit. Documents had to be chunked, summarised, and processed in pieces. Complex retrieval pipelines were essential.

2026 — Effectively unlimited for most tasks

GPT-5.5: 1M tokens (API). Claude Opus 4.8: 1M tokens, no surcharge since Anthropic eliminated long-context pricing in March 2026. Gemini 3.1 Pro: 1M tokens. The "arms race" framing has effectively ended in a tie — the frontier labs have converged on roughly the same ceiling. An entire codebase, a year of emails, or a complete legal contract archive can often fit in a single prompt regardless of which provider you use.

What larger context windows actually enable

Whole-document analysis. A 300-page contract, a complete annual report, or a lengthy research paper can now be fed to a model in its entirety and analysed holistically — rather than summarised in sections and reassembled. Subtle cross-references and contradictions that span large documents become detectable.

Large codebase comprehension. Developers can now give a model an entire repository and ask questions about architecture, dependencies, or bugs with full context. Previously, the model could only see the files you explicitly included.

Extended conversations. Long research sessions, complex multi-turn negotiations, or iterative creative projects can run for hours without the model losing track of what was established early in the conversation.

Reduced need for RAG. Retrieval Augmented Generation (RAG) — the technique of retrieving relevant document chunks before querying the model — is less essential when the entire document fits in context. For smaller knowledge bases, direct inclusion is now often simpler and more accurate than building a retrieval pipeline.

Where the limitations still are

Large context windows don't eliminate all problems — they shift them.

The "lost in the middle" problem. Research has consistently shown that models perform worse on information in the middle of very long contexts than at the beginning or end. A model given a 500-page document may answer questions about page 1 and page 500 accurately while missing something critical on page 250.

Cost. Processing large contexts is expensive. Sending 1 million tokens to an API costs significantly more than a short query — even at 2026 pricing. For high-volume applications, context window size and cost need to be balanced carefully.

Speed. Processing very long contexts takes longer. For real-time applications, there is often a practical limit well below the theoretical maximum.

Practical guidance

For documents under ~50 pages: include the full document in context. For larger corpora: RAG is still the better architectural choice. For anything critical in a very long context: ask specific targeted questions rather than open-ended ones, and verify answers against the source.

The Context Window Arms Race —What 1 Million Tokens Actually Means

What is a context window?

What larger context windows actually enable

Where the limitations still are

The Context Window Arms Race —
What 1 Million Tokens Actually Means