ptrnsai

Constitutional AI

Advanced🛡️ Guardrails & SafetyAnthropic (Bai et al., 2022)

Intent

Guide agent behavior through an explicit set of principles (a 'constitution') that the agent self-enforces through critique and revision.

Problem

Hard-coded rules can't cover every situation. An agent needs to handle novel scenarios it wasn't explicitly programmed for, in a way that aligns with your values and policies. RLHF (learning from human feedback) is expensive and doesn't scale to every edge case.

Solution

Define a set of principles (the constitution) that describe desired behavior. The agent generates a response, then critiques its own response against the constitution, and revises it to better align with the principles. This self-supervision is cheaper and more scalable than human feedback for every output. The constitution can encode business rules, ethical guidelines, tone requirements, or any behavioral standard you want the agent to follow.

Diagram

Principles (Constitution):
  1. Be honest and accurate
  2. Respect user privacy
  3. Don't give harmful advice
  4. Stay on topic

Agent generates response
    ↓
[Self-critique against principles]
    ↓
"My response might violate principle 2 by mentioning the user's location."
    ↓
[Revise response]
    ↓
Aligned response delivered

When to Use

  • When you need nuanced behavioral guidance beyond simple rules
  • Systems that must handle diverse, unpredictable inputs
  • When consistency of behavior across edge cases is critical
  • As a complement to input/output guardrails for deeper alignment

When NOT to Use

  • When simple rules or classifiers are sufficient
  • Latency-critical applications (self-critique adds time)
  • When the principles are too vague to meaningfully critique against

Pros & Cons

Pros

  • Handles novel situations by reasoning from principles
  • More scalable than human review of every output
  • Principles are transparent and auditable
  • Self-improving: the critique step catches subtle issues

Cons

  • Self-critique is imperfect — the agent may miss violations
  • Writing good principles is hard — too vague and they're useless
  • Double LLM call (generate + critique) increases cost
  • Agent may over-restrict itself, being too cautious

Implementation Steps

  1. 1Write clear, specific, non-contradictory principles
  2. 2Implement the generate-critique-revise loop
  3. 3Test with adversarial inputs to verify principle enforcement
  4. 4Monitor for over-restriction (false positives) and under-restriction (misses)
  5. 5Iterate on principles based on real-world failures
  6. 6Version control your constitution — track changes and their effects

Real-World Example

Financial Advisor Agent

Constitution includes: 'Never provide specific investment advice,' 'Always include risk disclaimers,' 'Never promise returns.' Agent generates a response about stocks, self-critiques and notices it said 'you should buy' — revises to 'some investors consider' with appropriate disclaimers.

PromptConstitutional Principles
You are an AI assistant that follows these constitutional principles:

1. HARMLESSNESS: Never provide instructions for illegal or harmful activities
2. HONESTY: Acknowledge uncertainty — don't fabricate facts
3. PRIVACY: Never request or reveal personal identifying information
4. HELPFULNESS: Stay on-topic and provide clear, actionable guidance

For each response, follow this process:

<draft>Write your initial response</draft>
<critique>Check the draft against each principle above</critique>
<final>Revise if any violations were found, otherwise keep as-is</final>
PythonGenerate-Critique-Revise Loop
from openai import OpenAI

client = OpenAI()

CONSTITUTION = [
    "Never provide harmful or illegal instructions",
    "Acknowledge uncertainty — don't fabricate facts",
    "Respect privacy — don't request or reveal PII",
]

def constitutional_generate(prompt: str, max_revisions: int = 2) -> str:
    draft = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    ).choices[0].message.content

    principles = "\n".join(f"{i+1}. {p}" for i, p in enumerate(CONSTITUTION))

    for _ in range(max_revisions):
        critique = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": f"Principles:\n{principles}\n\nCritique this for violations:\n{draft}"}],
        ).choices[0].message.content

        if "no violation" in critique.lower():
            break

        draft = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": f"Revise based on critique:\n{critique}\n\nOriginal:\n{draft}"}],
        ).choices[0].message.content

    return draft

References