Mesa-Optimization & Inner Alignment

Understanding optimizers within optimizers

⏱️ 10 hoursIntermediate

Mesa-Optimization and Inner Alignment

Mesa-optimization occurs when a learned model itself becomes an optimizer pursuing objectives that may differ from the training objective.

Core Concepts

Base Optimizer: The training process (e.g., SGD)
Mesa-Optimizer: An optimizer that emerges within the learned model
Base Objective: What we train the model to do
Mesa-Objective: What the internal optimizer actually pursues

Why Mesa-Optimization Matters

Models may pursue goals different from what we intended
Mesa-objectives can be misaligned with base objectives
Difficult to detect during training
May lead to deceptive alignment

Examples and Scenarios

Evolution as mesa-optimizer (humans vs inclusive fitness)
RL agents developing internal planning
Language models simulating goal-directed agents
Gradient hacking possibilities

Detection and Mitigation

Transparency and interpretability research
Behavioral testing across distributions
Architectural choices to prevent mesa-optimization
Training process modifications

← Back to Module

Loading...

⚡Pre-rendered at build time (instant load)