Context Window Management

Intermediate🎯 Context EngineeringAnthropic (2025)

Intent

Strategically curate what goes into the LLM's context window to maximize output quality within token limits.

Problem

Context windows are large but not infinite. In long conversations or complex tasks, you'll exceed the limit. But even before hitting the limit, cramming everything in degrades quality — the model's attention gets diluted. Not all context is equally important, and the model can't tell you what it needs.

Solution

Treat context as a precious, finite resource to be actively managed. Use strategies like: - Summarization: Compress old conversation turns into concise summaries - Sliding window: Keep only the N most recent turns plus a summary of earlier ones - Priority injection: Always include high-priority context (system prompt, key facts) first - Token budgeting: Allocate token budgets to different context sections - Auto-compaction: Automatically compress when approaching limits

Diagram

Available Context Window: [████████████████████████]

Allocation:
[System Prompt ███] [Key Facts ██] [Recent History ████████] [Tools ██] [Buffer ██]
  (always kept)     (always kept)  (sliding window)          (dynamic)  (for response)

As conversation grows → older history summarized → key info preserved

When to Use

Any agent with conversations or tasks that exceed context limits
Long-running agents that accumulate tool outputs over time
Multi-turn conversations where early context matters
Systems where context quality directly impacts output quality

When NOT to Use

Short, single-turn interactions that fit easily in context
When you can simply use a model with a larger context window

Pros & Cons

Pros

Maintains output quality in long interactions
Prevents context overflow errors
Forces intentional curation of what the model sees
Can reduce costs by using smaller context windows effectively

Cons

Summarization can lose important details
Complexity of managing context budgets
Different strategies work for different use cases — no one-size-fits-all
Context management itself consumes tokens

Implementation Steps

1Measure your context usage: how much do conversations typically consume?
2Define priority levels for different context types
3Implement a sliding window for conversation history
4Build summarization for old conversation turns
5Set token budgets per context section
6Monitor: are important details being lost? Is output quality stable?

Real-World Example

Long-Running Coding Agent

After 50 tool calls, the context is full. The system: keeps the system prompt and current task description (always), summarizes the first 40 tool calls into key findings, keeps the last 10 tool calls in full detail, and preserves the working memory scratchpad. The agent continues working without quality degradation.

PythonSliding Window with Summarization

from openai import OpenAI

client = OpenAI()

class SlidingWindowContext:
    def __init__(self, max_recent: int = 10):
        self.max_recent = max_recent
        self.messages: list[dict] = []
        self.summary = ""

    def add(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        if len(self.messages) > self.max_recent:
            self._compress()

    def _compress(self):
        old_messages = self.messages[:-5]
        history = "\n".join(f"{m['role']}: {m['content'][:100]}" for m in old_messages)

        self.summary = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": f"Summarize this conversation concisely:\n{history}"}],
        ).choices[0].message.content

        self.messages = self.messages[-5:]

    def get_messages(self) -> list[dict]:
        msgs = []
        if self.summary:
            msgs.append({"role": "system", "content": f"Previous context: {self.summary}"})
        msgs.extend(self.messages)
        return msgs

References

Effective Context Engineering for AI Agents — Anthropic