Prompt Caching
Intent
Reuse previously processed prompt prefixes across API calls to reduce latency and cost.
Problem
Agentic systems make many LLM calls, often with the same long system prompt, tool definitions, and context. Re-processing this shared prefix on every call wastes computation, increases latency, and drives up costs — especially when the shared prefix is thousands of tokens.
Solution
Structure your prompts so that the static parts (system prompt, tool definitions, few-shot examples, reference documents) come first, followed by dynamic parts (user input, conversation history). The LLM provider caches the processed prefix and reuses it on subsequent calls, only processing the new dynamic portion. This requires keeping the prefix exactly the same across calls — even minor changes invalidate the cache.
Diagram
Call 1: [System Prompt + Tools + Examples | User Message 1]
←── cached prefix ──→ ← new content →
(processed & cached) (processed fresh)
Call 2: [System Prompt + Tools + Examples | User Message 2]
←── cache HIT ──→ ← new content →
(reused, ~0 cost) (processed fresh)
Savings: 90%+ on prefix tokens across callsWhen to Use
- Agentic systems with long, repeated system prompts
- When tool definitions are large and stable
- Multi-turn conversations with growing history
- Any high-volume system where the same prefix is used repeatedly
When NOT to Use
- One-off calls where there's no prefix reuse
- When the prompt changes significantly between calls
- Low-volume systems where caching overhead exceeds savings
Pros & Cons
Pros
- Major cost reduction (up to 90% on cached tokens)
- Reduced latency for prefix processing
- No quality impact — identical to uncached processing
- Transparent: just restructure your prompts
Cons
- Requires careful prompt structure (static before dynamic)
- Any change to the prefix invalidates the cache
- Cache expiration varies by provider
- Not all providers support it
Implementation Steps
- 1Identify the static vs. dynamic parts of your prompts
- 2Restructure prompts: all static content first, dynamic content last
- 3Ensure exact prefix match across calls (watch for whitespace, ordering)
- 4Enable caching in your LLM provider's API settings
- 5Monitor cache hit rates to verify savings
- 6Design your system to maximize prefix reuse across calls
Real-World Example
Coding Agent with Large Context
A coding agent has a 4,000-token system prompt + 2,000 tokens of tool definitions + 10,000 tokens of codebase context. This 16K prefix is identical across the ~50 tool calls in a typical task. With prompt caching, the prefix is processed once and reused 49 times, saving ~784K tokens of processing.
import anthropic
client = anthropic.Anthropic()
STYLE_GUIDE = """You are a code reviewer. Follow this style guide:
- Use TypeScript strict mode
- Prefer functional components over class components
- Maximum function length: 50 lines
- Always handle errors explicitly
- Use meaningful variable names
(... imagine 2000+ tokens of detailed guidelines ...)"""
def review_code(code: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[{
"type": "text",
"text": STYLE_GUIDE,
"cache_control": {"type": "ephemeral"}, # Cache this prefix
}],
messages=[{"role": "user", "content": f"Review this code:\n\n{code}"}],
)
# First call: full processing (cache miss)
# Subsequent calls: reuses cached prefix (~90% cost savings)
return response.content[0].text