ptrnsai

Evaluator-Optimizer

Intermediate⛓️ Workflow PatternsAnthropic

Intent

One LLM generates a response while another evaluates it and provides feedback in an iterative loop until quality criteria are met.

Problem

First-draft LLM outputs are often good but not great. Important nuances get missed, code has subtle bugs, translations lose tone. The gap between 'acceptable' and 'excellent' often requires iteration — the same kind of revision cycle a human would go through.

Solution

Set up two roles: a Generator that produces output and an Evaluator that assesses it against defined criteria. The evaluator provides specific, actionable feedback that the generator uses to improve. This loop continues until the evaluator is satisfied or a maximum iteration count is reached. This is the Reflection pattern (Andrew Ng) implemented as a two-agent workflow. The separation of generator and evaluator allows each to be optimized for its role.

Diagram

Input → [Generator LLM] → Output
              ↑                    ↓
              │              [Evaluator LLM]
              │                    ↓
              └── Feedback ←── Pass? ──→ Final Output
                                (No)        (Yes)

When to Use

  • Tasks where LLM output demonstrably improves with human-like feedback
  • When you have clear, measurable evaluation criteria
  • Literary translation, polished writing, complex code generation
  • Search tasks that may need multiple rounds of refinement

When NOT to Use

  • Simple tasks where the first output is good enough
  • When evaluation criteria are vague or subjective
  • Latency-critical applications — each loop adds round-trip time

Pros & Cons

Pros

  • Iteratively improves output quality
  • Evaluator catches issues the generator misses
  • Clear stopping criteria make it predictable
  • Each role (generator vs evaluator) can be independently tuned

Cons

  • Multiple LLM calls increase latency and cost
  • Risk of infinite loops if criteria are too strict
  • Evaluator may not catch all issues or may hallucinate issues
  • Diminishing returns after 2-3 iterations

Implementation Steps

  1. 1Define clear evaluation criteria (rubric) for the evaluator
  2. 2Build the generator prompt optimized for producing good first drafts
  3. 3Build the evaluator prompt that produces specific, actionable feedback
  4. 4Implement the feedback loop with a maximum iteration count
  5. 5Parse evaluator output to determine pass/fail and extract feedback
  6. 6Monitor iteration counts — if most tasks need max iterations, criteria may be too strict

Real-World Example

Literary Translation

Translating a novel from English to Japanese: the Generator produces a translation. The Evaluator checks for natural phrasing, cultural nuances, tone preservation, and accuracy. Feedback like 'The formality level in paragraph 3 should be higher for this character' guides the next iteration.

PythonEssay Writing with Iterative Feedback Loop
import anthropic

client = anthropic.Anthropic()

def generate_and_optimize(topic: str, max_rounds: int = 3) -> str:
    # Generator: initial draft
    essay = client.messages.create(
        model="claude-sonnet-4-20250514", max_tokens=1024,
        messages=[{"role": "user", "content": f"Write a short essay on: {topic}"}]
    ).content[0].text

    for _ in range(max_rounds):
        # Evaluator: score and provide feedback
        evaluation = client.messages.create(
            model="claude-sonnet-4-20250514", max_tokens=256,
            messages=[{"role": "user", "content": f"Rate 1-10 on clarity, structure, evidence. If 8+, reply PASS. Otherwise give specific feedback.\n\n{essay}"}]
        ).content[0].text

        if "PASS" in evaluation:
            break

        # Generator: revise based on feedback
        essay = client.messages.create(
            model="claude-sonnet-4-20250514", max_tokens=1024,
            messages=[{"role": "user", "content": f"Revise this essay based on feedback.\n\nFeedback: {evaluation}\n\nEssay: {essay}"}]
        ).content[0].text

    return essay

References