Token Gluttony
The Anti-Pattern
Architectural choices that waste tokens systematically — verbose system prompts, full tool outputs dumped into context, unnecessary chain-of-thought on trivial tasks.
Why It Happens
Developers dump everything into context ‘just in case.’ Full API responses with 90% irrelevant fields, verbose system prompts repeated on every call, chain-of-thought reasoning forced on simple lookups. Token costs scale linearly, but value doesn’t — past a certain point, more tokens means more noise, not more signal. The worst part is that token gluttony often masquerades as thoroughness.
How to Fix It
Audit token usage per component and trim ruthlessly. Use prompt caching for static prefixes that don’t change between calls. Reserve chain-of-thought for tasks where reasoning genuinely improves accuracy. Apply structured output schemas to minimize response tokens. Trim tool outputs to only the fields the agent actually needs. The principle is simple: every token in context should earn its place. If you can’t explain why a token is there, it shouldn’t be.
Diagram
Token Glutton: Optimized: ┌──────────────────────┐ ┌──────────────────────┐ │████████████████████ │ │██░░░░░░░░░░░░░░░░░░░░│ │████ SYSTEM PROMPT ██│ │SP│ │ │████████████████████ │ │░░│ Available for │ │████ FULL API RESP ██│ │░░│ actual work │ │████████████████████ │ │░░│ │ │█ CoT on trivial task│ │░░░░░░░░░░░░░░░░░░░░░░│ │█████ task ██████████│ │░░░░░░░░ task █████████│ └──────────────────────┘ └──────────────────────┘ 80% waste, 20% task 15% overhead, 85% task
Symptoms
- Token costs are high relative to task complexity
- System prompts are thousands of tokens with instructions the agent rarely uses
- Full JSON API responses are dumped into context when only 2-3 fields matter
- Chain-of-thought is forced on every task regardless of difficulty
False Positives
- Complex tasks that genuinely require rich context to perform well
- Research tasks where broad context demonstrably improves output quality
- Early prototyping where optimization is premature
Warning Signs & Consequences
Warning Signs
- Token costs growing faster than the value delivered by the agent
- Latency disproportionate to the difficulty of the task
- Budget alerts or unexpectedly high API bills
- Context window filling up and truncating actually important information
Consequences
- Unnecessary API cost that scales with every single request
- Slower response times from processing irrelevant tokens
- Reduced effective context window for the actual task at hand
- Masking real performance issues behind a wall of unnecessary processing
Remediation Steps
- 1Audit token usage: measure tokens per component (system prompt, tools, history, task)
- 2Trim tool output schemas to include only fields the agent actually uses
- 3Implement prompt caching for static system prompt prefixes
- 4Use chain-of-thought selectively — only for tasks where it measurably helps
- 5Set token budgets per section and alert when thresholds are exceeded
Real-World Example
Expensive RAG Queries
A RAG application retrieves 10 full documents (50K tokens) to answer a simple factual question that only needed one paragraph. Each query costs $0.50 instead of $0.02. At 10K queries per day, the team is spending $5,000/day instead of $200/day — a 25x cost multiplier for identical answer quality. The fix was trimming retrieved chunks and limiting to the 2 most relevant passages.