← Back to Demos

Constitutional AI: Harmlessness from AI Feedback

AnthropicarXiv (2022)

Responsibility Localization Shift

Watch how claims shift from agent-level ('models are dangerous') to boundary-level ('governance boundaries must be enforced'). This reframing preserves all concerns while clarifying where accountability actually lives.

Core Claims

These are the load-bearing claims that structure the paper's argument:

Claim A
FoundationalImportance: 10/10

Constitutional AI can train AI systems to be harmless without human feedback labels by using a set of principles as a 'constitution'.

Claim B
FoundationalImportance: 9/10

The method involves a two-stage process: a supervised learning phase for initial behavior control and a reinforcement learning phase for performance improvement.

Claim C
DownstreamImportance: 8/10

AI systems trained with Constitutional AI are preferred by crowdworkers over those trained with traditional human feedback methods for harmlessness.

Claim D
SupportingImportance: 7/10

Chain-of-thought reasoning enhances AI's ability to evaluate and improve its own responses, leading to better harmlessness and transparency.

Claim E
SupportingImportance: 6/10

The approach reduces the need for human feedback, potentially scaling AI supervision but also risks automating and obscuring decision-making processes.

⚠️ Risk Signal

The claim that AI can effectively self-supervise for harmlessness without human feedback may sit at a disagreement boundary regarding AI ethics and safety.