Foundational AI Safety Papers

Essential papers that established the field of AI safety research

⏱️ 4 hoursBeginner

Foundational Papers

Essential papers that every AI safety researcher should understand deeply. These papers form the technical and conceptual foundation of the field.

1. "Attention Is All You Need" (Vaswani et al., 2017)

Why: Understanding transformers is non-negotiable for modern AI safety
Key concepts: Attention mechanisms, model architecture, scaling properties
Safety relevance: Interpretability, alignment techniques, capability understanding

Visit the following resources to learn more:

2. "Concrete Problems in AI Safety" (Amodei et al., 2016)

Why: Still the clearest articulation of the core technical safety challenges
Key concepts: Reward misspecification, safe exploration, robustness, interpretability, distributional shift
Safety relevance: Defines the problem space that most current work addresses

Visit the following resources to learn more:

3. "Training Language Models to Follow Instructions with Human Feedback" (Ouyang et al., 2022)

Why: RLHF is currently the dominant alignment technique in deployment
Key concepts: Human preference learning, reward modeling, policy optimization
Safety relevance: Practical alignment implementation, current best practices

Visit the following resources to learn more:

Additional Essential Papers

Consider also reading:

Pre-rendered at build time (instant load)