ptrnsai

Prompt Caching

Intermediate🎯 Context EngineeringAnthropic / Industry practice

Intent

Reuse previously processed prompt prefixes across API calls to reduce latency and cost.

Problem

Agentic systems make many LLM calls, often with the same long system prompt, tool definitions, and context. Re-processing this shared prefix on every call wastes computation, increases latency, and drives up costs — especially when the shared prefix is thousands of tokens.

Solution

Structure your prompts so that the static parts (system prompt, tool definitions, few-shot examples, reference documents) come first, followed by dynamic parts (user input, conversation history). The LLM provider caches the processed prefix and reuses it on subsequent calls, only processing the new dynamic portion. This requires keeping the prefix exactly the same across calls — even minor changes invalidate the cache.

Diagram

Call 1: [System Prompt + Tools + Examples | User Message 1]
         ←── cached prefix ──→  ← new content →
         (processed & cached)   (processed fresh)

Call 2: [System Prompt + Tools + Examples | User Message 2]
         ←── cache HIT ──→       ← new content →
         (reused, ~0 cost)       (processed fresh)

         Savings: 90%+ on prefix tokens across calls

When to Use

  • Agentic systems with long, repeated system prompts
  • When tool definitions are large and stable
  • Multi-turn conversations with growing history
  • Any high-volume system where the same prefix is used repeatedly

When NOT to Use

  • One-off calls where there's no prefix reuse
  • When the prompt changes significantly between calls
  • Low-volume systems where caching overhead exceeds savings

Pros & Cons

Pros

  • Major cost reduction (up to 90% on cached tokens)
  • Reduced latency for prefix processing
  • No quality impact — identical to uncached processing
  • Transparent: just restructure your prompts

Cons

  • Requires careful prompt structure (static before dynamic)
  • Any change to the prefix invalidates the cache
  • Cache expiration varies by provider
  • Not all providers support it

Implementation Steps

  1. 1Identify the static vs. dynamic parts of your prompts
  2. 2Restructure prompts: all static content first, dynamic content last
  3. 3Ensure exact prefix match across calls (watch for whitespace, ordering)
  4. 4Enable caching in your LLM provider's API settings
  5. 5Monitor cache hit rates to verify savings
  6. 6Design your system to maximize prefix reuse across calls

Real-World Example

Coding Agent with Large Context

A coding agent has a 4,000-token system prompt + 2,000 tokens of tool definitions + 10,000 tokens of codebase context. This 16K prefix is identical across the ~50 tool calls in a typical task. With prompt caching, the prefix is processed once and reused 49 times, saving ~784K tokens of processing.

PythonAnthropic Prompt Caching
import anthropic

client = anthropic.Anthropic()

STYLE_GUIDE = """You are a code reviewer. Follow this style guide:
- Use TypeScript strict mode
- Prefer functional components over class components
- Maximum function length: 50 lines
- Always handle errors explicitly
- Use meaningful variable names
(... imagine 2000+ tokens of detailed guidelines ...)"""

def review_code(code: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system=[{
            "type": "text",
            "text": STYLE_GUIDE,
            "cache_control": {"type": "ephemeral"},  # Cache this prefix
        }],
        messages=[{"role": "user", "content": f"Review this code:\n\n{code}"}],
    )
    # First call: full processing (cache miss)
    # Subsequent calls: reuses cached prefix (~90% cost savings)
    return response.content[0].text

References