Evaluator-Optimizer

Intermediate⛓️ Workflow PatternsAnthropic

Intent

One LLM generates a response while another evaluates it and provides feedback in an iterative loop until quality criteria are met.

Problem

First-draft LLM outputs are often good but not great. Important nuances get missed, code has subtle bugs, translations lose tone. The gap between 'acceptable' and 'excellent' often requires iteration — the same kind of revision cycle a human would go through.

Solution

Set up two roles: a Generator that produces output and an Evaluator that assesses it against defined criteria. The evaluator provides specific, actionable feedback that the generator uses to improve. This loop continues until the evaluator is satisfied or a maximum iteration count is reached. This is the Reflection pattern (Andrew Ng) implemented as a two-agent workflow. The separation of generator and evaluator allows each to be optimized for its role.

Diagram

Input → [Generator LLM] → Output
              ↑                    ↓
              │              [Evaluator LLM]
              │                    ↓
              └── Feedback ←── Pass? ──→ Final Output
                                (No)        (Yes)

When to Use

Tasks where LLM output demonstrably improves with human-like feedback
When you have clear, measurable evaluation criteria
Literary translation, polished writing, complex code generation
Search tasks that may need multiple rounds of refinement

When NOT to Use

Simple tasks where the first output is good enough
When evaluation criteria are vague or subjective
Latency-critical applications — each loop adds round-trip time

Pros & Cons

Pros

Iteratively improves output quality
Evaluator catches issues the generator misses
Clear stopping criteria make it predictable
Each role (generator vs evaluator) can be independently tuned

Cons

Multiple LLM calls increase latency and cost
Risk of infinite loops if criteria are too strict
Evaluator may not catch all issues or may hallucinate issues
Diminishing returns after 2-3 iterations

Implementation Steps

1Define clear evaluation criteria (rubric) for the evaluator
2Build the generator prompt optimized for producing good first drafts
3Build the evaluator prompt that produces specific, actionable feedback
4Implement the feedback loop with a maximum iteration count
5Parse evaluator output to determine pass/fail and extract feedback
6Monitor iteration counts — if most tasks need max iterations, criteria may be too strict

Real-World Example

Literary Translation

Translating a novel from English to Japanese: the Generator produces a translation. The Evaluator checks for natural phrasing, cultural nuances, tone preservation, and accuracy. Feedback like 'The formality level in paragraph 3 should be higher for this character' guides the next iteration.

PythonEssay Writing with Iterative Feedback Loop

import anthropic

client = anthropic.Anthropic()

def generate_and_optimize(topic: str, max_rounds: int = 3) -> str:
    # Generator: initial draft
    essay = client.messages.create(
        model="claude-sonnet-4-20250514", max_tokens=1024,
        messages=[{"role": "user", "content": f"Write a short essay on: {topic}"}]
    ).content[0].text

    for _ in range(max_rounds):
        # Evaluator: score and provide feedback
        evaluation = client.messages.create(
            model="claude-sonnet-4-20250514", max_tokens=256,
            messages=[{"role": "user", "content": f"Rate 1-10 on clarity, structure, evidence. If 8+, reply PASS. Otherwise give specific feedback.\n\n{essay}"}]
        ).content[0].text

        if "PASS" in evaluation:
            break

        # Generator: revise based on feedback
        essay = client.messages.create(
            model="claude-sonnet-4-20250514", max_tokens=1024,
            messages=[{"role": "user", "content": f"Revise this essay based on feedback.\n\nFeedback: {evaluation}\n\nEssay: {essay}"}]
        ).content[0].text

    return essay

References

Building Effective Agents — Anthropic