Constitutional AI
Intent
Guide agent behavior through an explicit set of principles (a 'constitution') that the agent self-enforces through critique and revision.
Problem
Hard-coded rules can't cover every situation. An agent needs to handle novel scenarios it wasn't explicitly programmed for, in a way that aligns with your values and policies. RLHF (learning from human feedback) is expensive and doesn't scale to every edge case.
Solution
Define a set of principles (the constitution) that describe desired behavior. The agent generates a response, then critiques its own response against the constitution, and revises it to better align with the principles. This self-supervision is cheaper and more scalable than human feedback for every output. The constitution can encode business rules, ethical guidelines, tone requirements, or any behavioral standard you want the agent to follow.
Diagram
Principles (Constitution):
1. Be honest and accurate
2. Respect user privacy
3. Don't give harmful advice
4. Stay on topic
Agent generates response
↓
[Self-critique against principles]
↓
"My response might violate principle 2 by mentioning the user's location."
↓
[Revise response]
↓
Aligned response deliveredWhen to Use
- When you need nuanced behavioral guidance beyond simple rules
- Systems that must handle diverse, unpredictable inputs
- When consistency of behavior across edge cases is critical
- As a complement to input/output guardrails for deeper alignment
When NOT to Use
- When simple rules or classifiers are sufficient
- Latency-critical applications (self-critique adds time)
- When the principles are too vague to meaningfully critique against
Pros & Cons
Pros
- Handles novel situations by reasoning from principles
- More scalable than human review of every output
- Principles are transparent and auditable
- Self-improving: the critique step catches subtle issues
Cons
- Self-critique is imperfect — the agent may miss violations
- Writing good principles is hard — too vague and they're useless
- Double LLM call (generate + critique) increases cost
- Agent may over-restrict itself, being too cautious
Implementation Steps
- 1Write clear, specific, non-contradictory principles
- 2Implement the generate-critique-revise loop
- 3Test with adversarial inputs to verify principle enforcement
- 4Monitor for over-restriction (false positives) and under-restriction (misses)
- 5Iterate on principles based on real-world failures
- 6Version control your constitution — track changes and their effects
Real-World Example
Financial Advisor Agent
Constitution includes: 'Never provide specific investment advice,' 'Always include risk disclaimers,' 'Never promise returns.' Agent generates a response about stocks, self-critiques and notices it said 'you should buy' — revises to 'some investors consider' with appropriate disclaimers.
You are an AI assistant that follows these constitutional principles:
1. HARMLESSNESS: Never provide instructions for illegal or harmful activities
2. HONESTY: Acknowledge uncertainty — don't fabricate facts
3. PRIVACY: Never request or reveal personal identifying information
4. HELPFULNESS: Stay on-topic and provide clear, actionable guidance
For each response, follow this process:
<draft>Write your initial response</draft>
<critique>Check the draft against each principle above</critique>
<final>Revise if any violations were found, otherwise keep as-is</final>from openai import OpenAI
client = OpenAI()
CONSTITUTION = [
"Never provide harmful or illegal instructions",
"Acknowledge uncertainty — don't fabricate facts",
"Respect privacy — don't request or reveal PII",
]
def constitutional_generate(prompt: str, max_revisions: int = 2) -> str:
draft = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
).choices[0].message.content
principles = "\n".join(f"{i+1}. {p}" for i, p in enumerate(CONSTITUTION))
for _ in range(max_revisions):
critique = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"Principles:\n{principles}\n\nCritique this for violations:\n{draft}"}],
).choices[0].message.content
if "no violation" in critique.lower():
break
draft = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"Revise based on critique:\n{critique}\n\nOriginal:\n{draft}"}],
).choices[0].message.content
return draft