Mechanistic Interpretability Practice

Reverse engineer neural network behaviors

⏱️ 12 hoursAdvanced

Mechanistic Interpretability Practice

Learning Objectives

Understand what mechanistic interpretability is and why it matters for AI safety
Learn core techniques: activation analysis, circuit discovery, and feature visualization
Practice reverse engineering simple neural network behaviors
Apply sparse autoencoders (SAEs) to decompose model representations
Analyze transformer circuits and attention patterns in real models

Mechanistic interpretability is the art and science of reverse engineering neural networks - understanding not just what they do, but how they do it. Think of it as neuroscience for artificial minds: we open up the black box, trace the circuits, and map out the algorithms that emerge from billions of parameters.

This field, pioneered by researchers like Chris Olah at Anthropic and advanced by teams at Google DeepMind under Neel Nanda, represents one of our best hopes for understanding AI systems before they become too complex to comprehend. Unlike other interpretability methods that treat models as black boxes, mechanistic interpretability dives deep into the weights and activations to understand the actual computations being performed.

For AI safety, this matters enormously. If we can understand how models work internally, we can:

Detect deceptive behaviors before deployment
Identify when models are learning dangerous capabilities
Build better oversight tools for advanced AI systems
Design architectures that are inherently more interpretable

Core Concepts

1. Circuits and Features

Neural networks learn to implement algorithms through circuits - connected pathways of neurons that perform specific computations. These circuits encode features - meaningful patterns in the data that the model has learned to detect and use.

Key insight: Models don't just memorize; they develop sophisticated algorithms. A vision model doesn't store pictures of cats - it learns edge detectors, texture analyzers, and shape recognizers that combine to identify cats.

Practical approach:

Start with toy models where you can trace every connection
Use activation maximization to find what makes neurons fire
Look for motifs - repeated circuit patterns across different models
Map out information flow from input to output

Real example: InceptionV1 contains dedicated neurons for detecting curves, car wheels, and even specific dog breeds. These aren't programmed - they emerge from training.

2. Superposition and Polysemanticity

Models face a fundamental challenge: they need to represent more features than they have neurons. The solution? Superposition - encoding multiple features in overlapping ways across neurons.

This leads to polysemanticity - individual neurons responding to multiple, seemingly unrelated concepts. A single neuron might activate for both "academia" and "the color purple" because the model rarely needs both simultaneously.

Why this matters:

Makes interpretation much harder
Suggests models have vast hidden capabilities
Critical for understanding deception - features could hide in superposition
May explain sudden capability jumps during training

Current techniques:

Sparse autoencoders (SAEs) to decompose superposed representations
Dictionary learning to find the "true" features
Analyzing feature interactions and interference patterns

3. Transformer Mechanics

Modern language models are built on transformers, which have specific architectural properties we can exploit for interpretation:

Attention heads can be understood as:

Information routing systems
Pattern matchers (like "find the subject of this sentence")
Copy mechanisms ("move this token forward")
Induction heads that implement in-context learning

MLP layers typically:

Store factual knowledge
Implement more complex computations
Act as memory banks that attention can query

Residual streams create:

Information highways through the network
Spaces where different components can read/write
Compositional computation through vector addition

Practical investigation:

# Analyze attention patterns
attention_pattern = model.get_attention_weights(input_ids)
# Look for interpretable heads
copy_heads = find_copy_patterns(attention_pattern)
induction_heads = find_induction_patterns(attention_pattern)

4. Probing and Intervention

Understanding requires both observation and experimentation:

Probing techniques:

Linear probes to check if information is linearly decodable
Causal probes that intervene to test necessity
Logit lens to see predictions at each layer
Activation patching to trace information flow

Intervention methods:

Ablate components to test their function
Activation steering to control model behavior
Surgical edits to modify specific capabilities
Circuit breaking to disable unwanted behaviors

Example workflow:

Identify interesting behavior (e.g., "model can do modular arithmetic")
Find which layers/heads are active during this task
Ablate components systematically
Trace critical paths through the network
Build minimal circuit that reproduces behavior

Common Pitfalls

1. Confirmation Bias in Circuit Discovery

Problem: Finding circuits that match your hypothesis while missing the real mechanism. Solution: Always test with adversarial examples and edge cases. If you think you've found an "addition circuit," try weird number formats, different bases, and word problems.

2. Over-Interpreting Individual Neurons

Problem: Assigning human-understandable concepts to neurons that actually encode complex, alien features. Solution: Use multiple interpretation methods. If a neuron "detects dogs," verify with activation maximization, dataset examples, and causal interventions.

3. Ignoring Superposition

Problem: Assuming clean, monosemantic features when reality is messier. Solution: Always consider that observed behaviors might be side effects of superposed features. Use SAEs or other decomposition methods.

4. Scale Blindness

Problem: Techniques that work on small models failing catastrophically on larger ones. Solution: Test insights on progressively larger models. What's a clean circuit in a 1M parameter model might be distributed across hundreds of components in a 1B parameter model.

5. Missing Contextual Computation

Problem: Analyzing components in isolation when they work together. Solution: Study information flow between components. Attention heads often work in groups, with early heads setting up information for later ones.

Practical Exercise: Reverse Engineering a Simple Behavior

Let's reverse engineer how a small transformer performs the task: "Complete the pattern: A B A B A ?"

# Step 1: Set up your model and tools
import torch
from transformer_lens import HookedTransformer
import numpy as np

model = HookedTransformer.from_pretrained("gpt2-small")

# Step 2: Create test inputs
test_prompts = [
    "A B A B A",
    "X Y X Y X",
    "1 2 1 2 1",
    "cat dog cat dog cat"
]

# Step 3: Analyze attention patterns
def analyze_pattern_completion(model, prompt):
    # Tokenize and run
    tokens = model.to_tokens(prompt)
    logits, cache = model.run_with_cache(tokens)
    
    # Look for copy patterns in attention
    attention_patterns = cache["pattern", :]  # shape [layer, head, seq, seq]
    
    # Find heads that attend to positions 2 steps back
    copy_heads = []
    for layer in range(model.cfg.n_layers):
        for head in range(model.cfg.n_heads):
            attn = attention_patterns[layer, head, -1, :]  # last token's attention
            if attn[-3] > 0.5:  # attending to position 2 steps back
                copy_heads.append((layer, head))
    
    return copy_heads, cache

# Step 4: Test interventions
def test_ablation(model, prompt, layer, head):
    # Run with and without the head
    def ablate_head(pattern, hook):
        pattern[:, head, :, :] = 0
        return pattern
    
    clean_logits = model(prompt)
    ablated_logits = model.run_with_hooks(
        prompt,
        fwd_hooks=[(f"blocks.{layer}.attn.hook_pattern", ablate_head)]
    )
    
    # Compare predictions
    clean_pred = model.tokenizer.decode(clean_logits.argmax(-1))
    ablated_pred = model.tokenizer.decode(ablated_logits.argmax(-1))
    
    return clean_pred, ablated_pred

# Step 5: Build understanding
# - Which heads implement the copy pattern?
# - Do they work alone or together?
# - What happens in non-periodic patterns?
# - Can we find the "period detection" circuit?

Connections

basic-interpretability: Foundation concepts before diving into mechanistic work
debugging-tools: Practical frameworks for model analysis
agent-architectures: Understanding how interpretability applies to agent systems
safety-evaluation-101: Using interpretability for safety assessments

Key Figures

Chris Olah: Pioneer of neural network interpretability at Anthropic
Neel Nanda: Leading mechanistic interpretability research at DeepMind
Catherine Olsson: Anthropic researcher on interpretability and safety
Nelson Elhage: Core contributor to transformer circuits work

Organizations

Anthropic: Leading interpretability research with published circuits work
DeepMind: Neel Nanda's team advancing mechanistic understanding
Redwood Research: Applying interpretability to alignment research
Apollo Research: Investigating deceptive AI through interpretability

← Back to Module

⚡Pre-rendered at build time (instant load)

Mechanistic Interpretability Practice

Mechanistic Interpretability Practice

Table of Contents

Learning Objectives

Introduction

Core Concepts

1. Circuits and Features

2. Superposition and Polysemanticity

3. Transformer Mechanics

4. Probing and Intervention

Common Pitfalls

1. Confirmation Bias in Circuit Discovery

2. Over-Interpreting Individual Neurons

3. Ignoring Superposition

4. Scale Blindness

5. Missing Contextual Computation

Practical Exercise: Reverse Engineering a Simple Behavior

Further Reading

Essential Papers

Tools and Resources

Courses and Tutorials

Connections

Key Figures

Organizations