Iterated Amplification & AI Safety via Debate
Scalable oversight through recursive techniques
⏱️ 10 hoursAdvanced
Scalable Oversight Techniques
As AI systems become more capable, we need oversight methods that scale beyond human ability to directly evaluate outputs.
Iterated Amplification (IDA)
- Core Idea: Use AI assistance to amplify human oversight
- Process: Human + AI system supervises training of new AI
- Iteration: Each generation helps train the next
- Goal: Maintain alignment while increasing capability
AI Safety via Debate
- Adversarial Setup: Two AIs debate, human judges
- Truth-Seeking: Incentive to expose opponent's errors
- Scalability: Humans can judge debates on complex topics
- Limitations: Assumes truth has natural advantage
Recursive Reward Modeling
- Use AI to help evaluate AI behavior
- Break complex tasks into simpler pieces
- Maintain human oversight at each level
- Scale to superhuman performance safely
Challenges and Open Questions
- Preserving alignment through amplification
- Detecting manipulation in debates
- Computational complexity
- Philosophical questions about truth and judgment
← Back to Module
Loading...
⚡Pre-rendered at build time (instant load)