Evaluation Theater
The Anti-Pattern
Using LLM-as-judge without proper calibration, or relying on self-evaluation that is systematically overconfident.
Why It Happens
Teams use LLM evaluators as a shortcut for human evaluation, but LLMs have predictable biases: they favor verbose answers, agree with their own outputs, and canβt reliably detect subtle factual errors. Self-evaluation is especially unreliable β a model that made an error in generation is unlikely to catch that same error when evaluating. The result is a quality metric that looks great on dashboards but doesnβt correlate with actual user experience.
How to Fix It
Calibrate LLM judges against human ground truth before trusting them. Use multiple diverse evaluators through consensus voting β different models catch different errors. Never rely solely on self-evaluation. Build evaluation rubrics with concrete, verifiable criteria rather than subjective quality judgments. The test: if your LLM evaluator agrees with human judges less than 80% of the time, itβs not evaluating β itβs rubber-stamping.
Diagram
Evaluation Theater: Real Evaluation:
ββββββββββ ββββββββββ ββββββββββ βββββββββββ
β Agent βββββΆβ Agent ββββΆ 'Great!' β Agent βββββΆβ Judge A ββββ
β output β β self- β (always) β output β βββββββββββ β
ββββββββββ β eval β ββββββββββ βββββββββββ ββββΆ Consensus
ββββββββββ ββββΆβ Judge B ββββ
β same model, βββββββββββ
same blind spots Different models,
different blind spotsSymptoms
- Evaluation scores donβt correlate with actual user satisfaction or task success
- Self-evaluation consistently rates output as high quality
- Known-bad test outputs receive passing scores from the evaluator
- Quality metrics look great but user complaints keep rising
False Positives
- Well-calibrated LLM judges that have been validated against human ground truth
- Preliminary automated screening before human review in the loop
- Low-stakes evaluations where approximate quality signals are sufficient
Warning Signs & Consequences
Warning Signs
- Eval scores consistently above 90% on tasks where humans find 30%+ errors
- Self-reported confidence scores that are always high regardless of actual quality
- No ground truth comparison β eval metrics exist in isolation
- Quality dashboards that tell a different story than user feedback
Consequences
- False confidence in agent quality leading to premature deployment
- Shipping poor-quality outputs to users based on misleading metrics
- Missing critical errors that a calibrated evaluator would catch
- Wasted engineering time βoptimizingβ against a meaningless metric
Remediation Steps
- 1Calibrate LLM judges against human ground truth on a representative test set
- 2Use multiple diverse evaluators β different models catch different failure modes
- 3Never use self-evaluation as the sole quality signal
- 4Build rubrics with concrete, verifiable criteria (not βis this good?β)
- 5Regularly audit eval accuracy β does the eval metric predict user satisfaction?
Real-World Example
The 95% Quality Illusion
A content generation pipeline uses GPT-4 to evaluate its own GPT-4 outputs. The evaluator rates 95% of outputs as βhigh quality.β The team celebrates and ships to production. User feedback surveys show 40% dissatisfaction. Investigation reveals the evaluator has the same blind spots as the generator β it canβt detect the subtle factual errors and awkward phrasing that users notice immediately.