ptrnsai

Context Window Management

Intermediate🎯 Context EngineeringAnthropic (2025)

Intent

Strategically curate what goes into the LLM's context window to maximize output quality within token limits.

Problem

Context windows are large but not infinite. In long conversations or complex tasks, you'll exceed the limit. But even before hitting the limit, cramming everything in degrades quality — the model's attention gets diluted. Not all context is equally important, and the model can't tell you what it needs.

Solution

Treat context as a precious, finite resource to be actively managed. Use strategies like: - Summarization: Compress old conversation turns into concise summaries - Sliding window: Keep only the N most recent turns plus a summary of earlier ones - Priority injection: Always include high-priority context (system prompt, key facts) first - Token budgeting: Allocate token budgets to different context sections - Auto-compaction: Automatically compress when approaching limits

Diagram

Available Context Window: [████████████████████████]

Allocation:
[System Prompt ███] [Key Facts ██] [Recent History ████████] [Tools ██] [Buffer ██]
  (always kept)     (always kept)  (sliding window)          (dynamic)  (for response)

As conversation grows → older history summarized → key info preserved

When to Use

  • Any agent with conversations or tasks that exceed context limits
  • Long-running agents that accumulate tool outputs over time
  • Multi-turn conversations where early context matters
  • Systems where context quality directly impacts output quality

When NOT to Use

  • Short, single-turn interactions that fit easily in context
  • When you can simply use a model with a larger context window

Pros & Cons

Pros

  • Maintains output quality in long interactions
  • Prevents context overflow errors
  • Forces intentional curation of what the model sees
  • Can reduce costs by using smaller context windows effectively

Cons

  • Summarization can lose important details
  • Complexity of managing context budgets
  • Different strategies work for different use cases — no one-size-fits-all
  • Context management itself consumes tokens

Implementation Steps

  1. 1Measure your context usage: how much do conversations typically consume?
  2. 2Define priority levels for different context types
  3. 3Implement a sliding window for conversation history
  4. 4Build summarization for old conversation turns
  5. 5Set token budgets per context section
  6. 6Monitor: are important details being lost? Is output quality stable?

Real-World Example

Long-Running Coding Agent

After 50 tool calls, the context is full. The system: keeps the system prompt and current task description (always), summarizes the first 40 tool calls into key findings, keeps the last 10 tool calls in full detail, and preserves the working memory scratchpad. The agent continues working without quality degradation.

PythonSliding Window with Summarization
from openai import OpenAI

client = OpenAI()

class SlidingWindowContext:
    def __init__(self, max_recent: int = 10):
        self.max_recent = max_recent
        self.messages: list[dict] = []
        self.summary = ""

    def add(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        if len(self.messages) > self.max_recent:
            self._compress()

    def _compress(self):
        old_messages = self.messages[:-5]
        history = "\n".join(f"{m['role']}: {m['content'][:100]}" for m in old_messages)

        self.summary = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": f"Summarize this conversation concisely:\n{history}"}],
        ).choices[0].message.content

        self.messages = self.messages[-5:]

    def get_messages(self) -> list[dict]:
        msgs = []
        if self.summary:
            msgs.append({"role": "system", "content": f"Previous context: {self.summary}"})
        msgs.extend(self.messages)
        return msgs

References