Iterated Amplification & AI Safety via Debate

Scalable oversight through recursive techniques

⏱️ 10 hoursAdvanced

Scalable Oversight Techniques

As AI systems become more capable, we need oversight methods that scale beyond human ability to directly evaluate outputs.

Iterated Amplification (IDA)

Core Idea: Use AI assistance to amplify human oversight
Process: Human + AI system supervises training of new AI
Iteration: Each generation helps train the next
Goal: Maintain alignment while increasing capability

AI Safety via Debate

Adversarial Setup: Two AIs debate, human judges
Truth-Seeking: Incentive to expose opponent's errors
Scalability: Humans can judge debates on complex topics
Limitations: Assumes truth has natural advantage

Recursive Reward Modeling

Use AI to help evaluate AI behavior
Break complex tasks into simpler pieces
Maintain human oversight at each level
Scale to superhuman performance safely

Challenges and Open Questions

Preserving alignment through amplification
Detecting manipulation in debates
Computational complexity
Philosophical questions about truth and judgment

← Back to Module

Loading...

⚡Pre-rendered at build time (instant load)