A context window is the maximum amount of input and intermediate text (tokens) a transformer-based language model can attend to at once when producing an output. It defines the upper limit on how much prior conversation, documents, tool results, and the model’s own generated tokens can be included in a single inference request.
What is Context Window?
Large language models process text as tokens and use self-attention to relate each token to others in the current sequence. The context window is the model’s fixed (or configured) limit on that sequence length, often described as 8K, 32K, 128K tokens, etc. When the combined length of the prompt plus generated output exceeds this limit, older tokens must be truncated, summarized, or otherwise removed, because the model cannot “see” them anymore.
Context windows affect both capability and cost. Larger windows allow the model to reference long documents, follow complex multi-turn conversations, and keep more task state in-view. However, attention computation and memory, especially the KV cache, grow with sequence length, so long contexts increase latency and GPU memory usage. In production systems, the context window becomes an engineering constraint: you must decide what to include, what to drop, and how to compress information while preserving correctness.
Where it’s used and why it matters
Context window management matters in chat assistants, RAG applications, and agentic workflows that accumulate tool traces and documents. If relevant instructions or evidence fall outside the window, the model may ignore requirements, lose earlier decisions, or produce inconsistent answers. Teams use strategies like retrieval (bring back only relevant chunks), prompt compression, conversation summarization, and prefix caching to keep important information inside the window while controlling cost.
Examples
- Multi-turn chat: After many messages, an assistant may forget early constraints because the oldest turns are truncated.
- Long-document Q&A: A 200-page PDF may not fit; a RAG system retrieves only the most relevant sections to include.
- Agents with tool logs: Tool outputs can be verbose, so orchestration may store full traces externally and inject only the necessary parts.
FAQs
Is a context window the same as “memory”? Not exactly. The context window is what the model can attend to in one request. “Memory” often refers to external storage (databases, vector stores) that can be retrieved into the context.
Does a larger context window always improve accuracy? It can help, but not always. Very long prompts can include noise, distract attention, and increase cost. Retrieval and good ranking still matter.
How do I handle prompts longer than the window? Common options are truncation, summarization, chunking + RAG retrieval, or using a model with a larger window.
Why does serving long contexts get expensive? Longer prompts increase prefill compute and KV cache memory, reducing concurrency and raising latency.