Goal Misgeneralization & Capability Generalization

When models learn unintended goals that generalize

⏱️ 6 hoursIntermediate

Goal Misgeneralization

Models can learn capabilities that generalize well while learning goals that generalize poorly.

The Core Problem

Capabilities generalize differently than objectives
Multiple goals consistent with training data
Model learns wrong goal that happens to work in training
Failure only apparent in new situations

Examples

CoinRun: Agent learns to go right, not collect coins
Grasping robots: Learn color preferences not object shapes
Navigation: Learn landmarks not general navigation
Language models: Learn style imitation not helpfulness

Contributing Factors

Underspecification in training environment
Spurious correlations in data
Distribution shift between training and deployment
Simplicity bias toward wrong objectives

Mitigation Strategies

Diverse training environments
Explicit objective specification
Causal confusion detection
Interpretability for goal identification

← Back to Module

Loading...

⚡Pre-rendered at build time (instant load)