Evaluation Theater

Intermediate🚫 Anti-Pattern🤯 Anti-Patterns: ReasoningIndustry observation

🚫Anti-Pattern— This describes a common mistake to avoid, not a pattern to follow.

The Anti-Pattern

Using LLM-as-judge without proper calibration, or relying on self-evaluation that is systematically overconfident.

Why It Happens

Teams use LLM evaluators as a shortcut for human evaluation, but LLMs have predictable biases: they favor verbose answers, agree with their own outputs, and can’t reliably detect subtle factual errors. Self-evaluation is especially unreliable — a model that made an error in generation is unlikely to catch that same error when evaluating. The result is a quality metric that looks great on dashboards but doesn’t correlate with actual user experience.

How to Fix It

Calibrate LLM judges against human ground truth before trusting them. Use multiple diverse evaluators through consensus voting — different models catch different errors. Never rely solely on self-evaluation. Build evaluation rubrics with concrete, verifiable criteria rather than subjective quality judgments. The test: if your LLM evaluator agrees with human judges less than 80% of the time, it’s not evaluating — it’s rubber-stamping.

Diagram

  Evaluation Theater:                     Real Evaluation:
  ┌────────┐    ┌────────┐              ┌────────┐    ┌─────────┐
  │ Agent  │───▶│ Agent  │──▶ 'Great!'  │ Agent  │───▶│ Judge A │──┐
  │ output │    │ self-  │    (always)   │ output │    └─────────┘  │
  └────────┘    │ eval   │              └────────┘    ┌─────────┐  ├──▶ Consensus
                └────────┘                        ───▶│ Judge B │──┘
                 ↑ same model,                        └─────────┘
                   same blind spots                   Different models,
                                                      different blind spots

Symptoms

Evaluation scores don’t correlate with actual user satisfaction or task success
Self-evaluation consistently rates output as high quality
Known-bad test outputs receive passing scores from the evaluator
Quality metrics look great but user complaints keep rising

False Positives

Well-calibrated LLM judges that have been validated against human ground truth
Preliminary automated screening before human review in the loop
Low-stakes evaluations where approximate quality signals are sufficient

Warning Signs & Consequences

Warning Signs

Eval scores consistently above 90% on tasks where humans find 30%+ errors
Self-reported confidence scores that are always high regardless of actual quality
No ground truth comparison — eval metrics exist in isolation
Quality dashboards that tell a different story than user feedback

Consequences

False confidence in agent quality leading to premature deployment
Shipping poor-quality outputs to users based on misleading metrics
Missing critical errors that a calibrated evaluator would catch
Wasted engineering time ‘optimizing’ against a meaningless metric

Remediation Steps

1Calibrate LLM judges against human ground truth on a representative test set
2Use multiple diverse evaluators — different models catch different failure modes
3Never use self-evaluation as the sole quality signal
4Build rubrics with concrete, verifiable criteria (not ‘is this good?’)
5Regularly audit eval accuracy — does the eval metric predict user satisfaction?

Real-World Example

The 95% Quality Illusion

A content generation pipeline uses GPT-4 to evaluate its own GPT-4 outputs. The evaluator rates 95% of outputs as ‘high quality.’ The team celebrates and ships to production. User feedback surveys show 40% dissatisfaction. Investigation reveals the evaluator has the same blind spots as the generator — it can’t detect the subtle factual errors and awkward phrasing that users notice immediately.

References

The Failure Modes of Agentic AI No One Warned You About