Constitutional AI

Advanced🛡️ Guardrails & SafetyAnthropic (Bai et al., 2022)

Intent

Guide agent behavior through an explicit set of principles (a 'constitution') that the agent self-enforces through critique and revision.

Problem

Hard-coded rules can't cover every situation. An agent needs to handle novel scenarios it wasn't explicitly programmed for, in a way that aligns with your values and policies. RLHF (learning from human feedback) is expensive and doesn't scale to every edge case.

Solution

Define a set of principles (the constitution) that describe desired behavior. The agent generates a response, then critiques its own response against the constitution, and revises it to better align with the principles. This self-supervision is cheaper and more scalable than human feedback for every output. The constitution can encode business rules, ethical guidelines, tone requirements, or any behavioral standard you want the agent to follow.

Diagram

Principles (Constitution):
  1. Be honest and accurate
  2. Respect user privacy
  3. Don't give harmful advice
  4. Stay on topic

Agent generates response
    ↓
[Self-critique against principles]
    ↓
"My response might violate principle 2 by mentioning the user's location."
    ↓
[Revise response]
    ↓
Aligned response delivered

When to Use

When you need nuanced behavioral guidance beyond simple rules
Systems that must handle diverse, unpredictable inputs
When consistency of behavior across edge cases is critical
As a complement to input/output guardrails for deeper alignment

When NOT to Use

When simple rules or classifiers are sufficient
Latency-critical applications (self-critique adds time)
When the principles are too vague to meaningfully critique against

Pros & Cons

Pros

Handles novel situations by reasoning from principles
More scalable than human review of every output
Principles are transparent and auditable
Self-improving: the critique step catches subtle issues

Cons

Self-critique is imperfect — the agent may miss violations
Writing good principles is hard — too vague and they're useless
Double LLM call (generate + critique) increases cost
Agent may over-restrict itself, being too cautious

Implementation Steps

1Write clear, specific, non-contradictory principles
2Implement the generate-critique-revise loop
3Test with adversarial inputs to verify principle enforcement
4Monitor for over-restriction (false positives) and under-restriction (misses)
5Iterate on principles based on real-world failures
6Version control your constitution — track changes and their effects

Real-World Example

Financial Advisor Agent

Constitution includes: 'Never provide specific investment advice,' 'Always include risk disclaimers,' 'Never promise returns.' Agent generates a response about stocks, self-critiques and notices it said 'you should buy' — revises to 'some investors consider' with appropriate disclaimers.

PromptConstitutional Principles

You are an AI assistant that follows these constitutional principles:

1. HARMLESSNESS: Never provide instructions for illegal or harmful activities
2. HONESTY: Acknowledge uncertainty — don't fabricate facts
3. PRIVACY: Never request or reveal personal identifying information
4. HELPFULNESS: Stay on-topic and provide clear, actionable guidance

For each response, follow this process:

<draft>Write your initial response</draft>
<critique>Check the draft against each principle above</critique>
<final>Revise if any violations were found, otherwise keep as-is</final>

PythonGenerate-Critique-Revise Loop

from openai import OpenAI

client = OpenAI()

CONSTITUTION = [
    "Never provide harmful or illegal instructions",
    "Acknowledge uncertainty — don't fabricate facts",
    "Respect privacy — don't request or reveal PII",
]

def constitutional_generate(prompt: str, max_revisions: int = 2) -> str:
    draft = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    ).choices[0].message.content

    principles = "\n".join(f"{i+1}. {p}" for i, p in enumerate(CONSTITUTION))

    for _ in range(max_revisions):
        critique = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": f"Principles:\n{principles}\n\nCritique this for violations:\n{draft}"}],
        ).choices[0].message.content

        if "no violation" in critique.lower():
            break

        draft = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": f"Revise based on critique:\n{critique}\n\nOriginal:\n{draft}"}],
        ).choices[0].message.content

    return draft

References

Constitutional AI — Anthropic (Bai et al., 2022)