Prompt Caching

Intermediate🎯 Context EngineeringAnthropic / Industry practice

Intent

Reuse previously processed prompt prefixes across API calls to reduce latency and cost.

Problem

Agentic systems make many LLM calls, often with the same long system prompt, tool definitions, and context. Re-processing this shared prefix on every call wastes computation, increases latency, and drives up costs — especially when the shared prefix is thousands of tokens.

Solution

Structure your prompts so that the static parts (system prompt, tool definitions, few-shot examples, reference documents) come first, followed by dynamic parts (user input, conversation history). The LLM provider caches the processed prefix and reuses it on subsequent calls, only processing the new dynamic portion. This requires keeping the prefix exactly the same across calls — even minor changes invalidate the cache.

Diagram

Call 1: [System Prompt + Tools + Examples | User Message 1]
         ←── cached prefix ──→  ← new content →
         (processed & cached)   (processed fresh)

Call 2: [System Prompt + Tools + Examples | User Message 2]
         ←── cache HIT ──→       ← new content →
         (reused, ~0 cost)       (processed fresh)

         Savings: 90%+ on prefix tokens across calls

When to Use

Agentic systems with long, repeated system prompts
When tool definitions are large and stable
Multi-turn conversations with growing history
Any high-volume system where the same prefix is used repeatedly

When NOT to Use

One-off calls where there's no prefix reuse
When the prompt changes significantly between calls
Low-volume systems where caching overhead exceeds savings

Pros & Cons

Pros

Major cost reduction (up to 90% on cached tokens)
Reduced latency for prefix processing
No quality impact — identical to uncached processing
Transparent: just restructure your prompts

Cons

Requires careful prompt structure (static before dynamic)
Any change to the prefix invalidates the cache
Cache expiration varies by provider
Not all providers support it

Implementation Steps

1Identify the static vs. dynamic parts of your prompts
2Restructure prompts: all static content first, dynamic content last
3Ensure exact prefix match across calls (watch for whitespace, ordering)
4Enable caching in your LLM provider's API settings
5Monitor cache hit rates to verify savings
6Design your system to maximize prefix reuse across calls

Real-World Example

Coding Agent with Large Context

A coding agent has a 4,000-token system prompt + 2,000 tokens of tool definitions + 10,000 tokens of codebase context. This 16K prefix is identical across the ~50 tool calls in a typical task. With prompt caching, the prefix is processed once and reused 49 times, saving ~784K tokens of processing.

PythonAnthropic Prompt Caching

import anthropic

client = anthropic.Anthropic()

STYLE_GUIDE = """You are a code reviewer. Follow this style guide:
- Use TypeScript strict mode
- Prefer functional components over class components
- Maximum function length: 50 lines
- Always handle errors explicitly
- Use meaningful variable names
(... imagine 2000+ tokens of detailed guidelines ...)"""

def review_code(code: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system=[{
            "type": "text",
            "text": STYLE_GUIDE,
            "cache_control": {"type": "ephemeral"},  # Cache this prefix
        }],
        messages=[{"role": "user", "content": f"Review this code:\n\n{code}"}],
    )
    # First call: full processing (cache miss)
    # Subsequent calls: reuses cached prefix (~90% cost savings)
    return response.content[0].text

References

Prompt Caching — Anthropic