Iterated Amplification & AI Safety via Debate

Scalable oversight through recursive techniques

⏱️ 10 hoursAdvanced

Scalable Oversight Techniques

As AI systems become more capable, we need oversight methods that scale beyond human ability to directly evaluate outputs.

Iterated Amplification (IDA)

  • Core Idea: Use AI assistance to amplify human oversight
  • Process: Human + AI system supervises training of new AI
  • Iteration: Each generation helps train the next
  • Goal: Maintain alignment while increasing capability

AI Safety via Debate

  • Adversarial Setup: Two AIs debate, human judges
  • Truth-Seeking: Incentive to expose opponent's errors
  • Scalability: Humans can judge debates on complex topics
  • Limitations: Assumes truth has natural advantage

Recursive Reward Modeling

  • Use AI to help evaluate AI behavior
  • Break complex tasks into simpler pieces
  • Maintain human oversight at each level
  • Scale to superhuman performance safely

Challenges and Open Questions

  • Preserving alignment through amplification
  • Detecting manipulation in debates
  • Computational complexity
  • Philosophical questions about truth and judgment
Pre-rendered at build time (instant load)