ptrnsai

Evaluation Theater

Intermediate🚫 Anti-Pattern🀯 Anti-Patterns: ReasoningIndustry observation
🚫Anti-Patternβ€” This describes a common mistake to avoid, not a pattern to follow.

The Anti-Pattern

Using LLM-as-judge without proper calibration, or relying on self-evaluation that is systematically overconfident.

Why It Happens

Teams use LLM evaluators as a shortcut for human evaluation, but LLMs have predictable biases: they favor verbose answers, agree with their own outputs, and can’t reliably detect subtle factual errors. Self-evaluation is especially unreliable β€” a model that made an error in generation is unlikely to catch that same error when evaluating. The result is a quality metric that looks great on dashboards but doesn’t correlate with actual user experience.

How to Fix It

Calibrate LLM judges against human ground truth before trusting them. Use multiple diverse evaluators through consensus voting β€” different models catch different errors. Never rely solely on self-evaluation. Build evaluation rubrics with concrete, verifiable criteria rather than subjective quality judgments. The test: if your LLM evaluator agrees with human judges less than 80% of the time, it’s not evaluating β€” it’s rubber-stamping.

Diagram

  Evaluation Theater:                     Real Evaluation:
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ Agent  │───▢│ Agent  │──▢ 'Great!'  β”‚ Agent  │───▢│ Judge A │──┐
  β”‚ output β”‚    β”‚ self-  β”‚    (always)   β”‚ output β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚ eval   β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”œβ”€β”€β–Ά Consensus
                β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        ───▢│ Judge B β”‚β”€β”€β”˜
                 ↑ same model,                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   same blind spots                   Different models,
                                                      different blind spots

Symptoms

  • Evaluation scores don’t correlate with actual user satisfaction or task success
  • Self-evaluation consistently rates output as high quality
  • Known-bad test outputs receive passing scores from the evaluator
  • Quality metrics look great but user complaints keep rising

False Positives

  • Well-calibrated LLM judges that have been validated against human ground truth
  • Preliminary automated screening before human review in the loop
  • Low-stakes evaluations where approximate quality signals are sufficient

Warning Signs & Consequences

Warning Signs

  • Eval scores consistently above 90% on tasks where humans find 30%+ errors
  • Self-reported confidence scores that are always high regardless of actual quality
  • No ground truth comparison β€” eval metrics exist in isolation
  • Quality dashboards that tell a different story than user feedback

Consequences

  • False confidence in agent quality leading to premature deployment
  • Shipping poor-quality outputs to users based on misleading metrics
  • Missing critical errors that a calibrated evaluator would catch
  • Wasted engineering time β€˜optimizing’ against a meaningless metric

Remediation Steps

  1. 1Calibrate LLM judges against human ground truth on a representative test set
  2. 2Use multiple diverse evaluators β€” different models catch different failure modes
  3. 3Never use self-evaluation as the sole quality signal
  4. 4Build rubrics with concrete, verifiable criteria (not β€˜is this good?’)
  5. 5Regularly audit eval accuracy β€” does the eval metric predict user satisfaction?

Real-World Example

The 95% Quality Illusion

A content generation pipeline uses GPT-4 to evaluate its own GPT-4 outputs. The evaluator rates 95% of outputs as β€˜high quality.’ The team celebrates and ships to production. User feedback surveys show 40% dissatisfaction. Investigation reveals the evaluator has the same blind spots as the generator β€” it can’t detect the subtle factual errors and awkward phrasing that users notice immediately.

References