Foundational AI Safety Papers
Essential papers that established the field of AI safety research
Foundational Papers
Essential papers that every AI safety researcher should understand deeply. These papers form the technical and conceptual foundation of the field.
1. "Attention Is All You Need" (Vaswani et al., 2017)
Why: Understanding transformers is non-negotiable for modern AI safety
Key concepts: Attention mechanisms, model architecture, scaling properties
Safety relevance: Interpretability, alignment techniques, capability understanding
Visit the following resources to learn more:
- @article@Attention Is All You Need (Original Paper)
- @article@The Illustrated Transformer (Visual Guide)
- @video@Transformer Neural Networks Explained (YouTube)
2. "Concrete Problems in AI Safety" (Amodei et al., 2016)
Why: Still the clearest articulation of the core technical safety challenges
Key concepts: Reward misspecification, safe exploration, robustness, interpretability, distributional shift
Safety relevance: Defines the problem space that most current work addresses
Visit the following resources to learn more:
- @article@Concrete Problems in AI Safety (Original Paper)
- @article@DeepMind's Summary of Concrete Problems
- @video@Concrete Problems in AI Safety Explained
3. "Training Language Models to Follow Instructions with Human Feedback" (Ouyang et al., 2022)
Why: RLHF is currently the dominant alignment technique in deployment
Key concepts: Human preference learning, reward modeling, policy optimization
Safety relevance: Practical alignment implementation, current best practices
Visit the following resources to learn more:
- @article@InstructGPT Paper (Original)
- @article@OpenAI's Blog Post on InstructGPT
- @article@Understanding RLHF (Hugging Face)
- @video@RLHF Explained Simply
Additional Essential Papers
Consider also reading: