Novel Circuit Discovery
Techniques for discovering and analyzing computational circuits in neural networks for safety insights
Novel Circuit Discovery
Table of Contents
- Learning Objectives
- Introduction
- Core Concepts
- Practical Applications
- Common Pitfalls
- Hands-on Exercise: Discover Your Own Circuit
- Further Reading
- Connections
Learning Objectives
- Understand the concept of circuits in neural networks and their role in mechanistic interpretability
- Master techniques for discovering and analyzing computational circuits in AI systems
- Learn automated methods for circuit identification and validation
- Develop skills in reverse-engineering neural network behaviors through circuit analysis
- Apply circuit discovery to uncover safety-relevant mechanisms in AI models
Introduction
Circuit discovery represents one of the most promising approaches to understanding how neural networks implement specific behaviors. A "circuit" in this context refers to a subset of model components (neurons, attention heads, layers) that together implement a specific function or behavior. By identifying and understanding these circuits, we can gain mechanistic insights into how AI systems work, potentially revealing safety-critical behaviors like deception, goal-seeking, or value representation.
This topic explores cutting-edge techniques for discovering novel circuits in neural networks, from manual investigation methods pioneered by researchers like Chris Olah to automated discovery algorithms that can scale to larger models. We'll examine how circuit discovery contributes to AI safety by enabling us to understand, predict, and potentially modify model behaviors at a mechanistic level.
Core Concepts
1. What Are Circuits?
Circuits are the functional building blocks of neural network computation, consisting of connected components that together implement specific behaviors.
Key Properties of Circuits
- Sparsity: Most behaviors are implemented by a small fraction of model components
- Modularity: Circuits often compose in understandable ways
- Interpretability: Individual circuit components often have human-understandable functions
- Causality: Circuit components have direct causal influence on model outputs
Types of Circuits
- Feature Circuits: Detect specific patterns in inputs (e.g., curved lines, syntactic structures)
- Algorithmic Circuits: Implement computational procedures (e.g., copying, comparison)
- Behavioral Circuits: Drive high-level behaviors (e.g., refusing harmful requests)
- Value Circuits: Encode preferences and goals
2. Manual Circuit Discovery Techniques
Traditional circuit discovery involves careful investigation of model internals using interpretability tools.
The Investigation Process
- Behavior Identification: Choose a specific model behavior to investigate
- Component Attribution: Identify which model components contribute most
- Ablation Studies: Test importance by removing/modifying components
- Path Tracing: Follow information flow through the network
- Hypothesis Formation: Develop mechanistic explanations
- Validation: Test predictions on new inputs
Example: Indirect Object Identification Circuit
def investigate_ioi_circuit(model, prompts):
"""
Investigate the Indirect Object Identification circuit
Example: "John gave Mary the ball" -> model should predict "Mary"
"""
# Step 1: Identify important attention heads
attention_patterns = model.get_attention_patterns(prompts)
important_heads = []
for layer in range(model.n_layers):
for head in range(model.n_heads):
# Check if head attends from end position to name positions
if attends_to_names(attention_patterns[layer][head]):
important_heads.append((layer, head))
# Step 2: Test importance via ablation
original_logits = model(prompts)
for head in important_heads:
# Zero out this head's contribution
ablated_logits = model(prompts, ablate_head=head)
effect = measure_effect(original_logits, ablated_logits)
print(f"Head {head}: effect size = {effect}")
return important_heads
3. Automated Circuit Discovery
Modern approaches use algorithmic methods to discover circuits automatically, enabling analysis of larger models and more complex behaviors.
Activation Patching Methods
def automated_circuit_discovery(model, clean_input, corrupted_input, metric):
"""
Discover circuits using activation patching
"""
circuit_components = []
# Get activations for clean and corrupted inputs
clean_acts = model.get_all_activations(clean_input)
corrupted_acts = model.get_all_activations(corrupted_input)
# Try patching each component
for component in model.get_all_components():
# Replace component activation with clean version
patched_acts = corrupted_acts.copy()
patched_acts[component] = clean_acts[component]
# Measure restoration of original behavior
restoration = metric(model.forward_from_acts(patched_acts))
if restoration > threshold:
circuit_components.append(component)
# Find minimal circuit via greedy search
minimal_circuit = find_minimal_subset(circuit_components, metric)
return minimal_circuit
Gradient-Based Discovery
Recent methods use gradients to efficiently identify important connections:
def gradient_circuit_discovery(model, inputs, target_behavior):
"""
Use integrated gradients to find circuits
"""
# Compute gradients of target behavior w.r.t. all connections
gradients = compute_integrated_gradients(
model, inputs, target_behavior
)
# Identify high-gradient paths
important_edges = gradients.get_top_k_edges(k=1000)
# Verify importance through masking
circuit = verify_circuit_edges(model, important_edges, target_behavior)
return circuit
4. Circuit Analysis and Validation
Once discovered, circuits must be carefully analyzed and validated to ensure they truly implement the hypothesized function.
Validation Techniques
- Sufficiency: Does activating only the circuit reproduce the behavior?
- Necessity: Does removing the circuit eliminate the behavior?
- Generalization: Does the circuit work across different inputs?
- Specificity: Is the circuit specific to the target behavior?
- Compositionality: How does the circuit interact with others?
Circuit Visualization
def visualize_circuit(circuit, model):
"""
Create interpretable visualization of discovered circuit
"""
graph = nx.DiGraph()
# Add nodes for each component
for component in circuit.components:
label = get_component_function(component, model)
graph.add_node(component.id, label=label)
# Add edges for connections
for edge in circuit.edges:
weight = edge.importance_score
graph.add_edge(edge.source, edge.target, weight=weight)
# Layout and render
pos = hierarchical_layout(graph)
draw_circuit_diagram(graph, pos)
Practical Applications
Case Study: Deception Circuit Discovery
A critical safety application involves discovering circuits responsible for deceptive behaviors:
class DeceptionCircuitFinder:
def __init__(self, model):
self.model = model
def find_deception_circuits(self):
# Create dataset of honest vs deceptive responses
honest_prompts, deceptive_prompts = create_deception_dataset()
# Find components that differ between honest/deceptive
circuits = []
for layer in range(self.model.n_layers):
# Compare activations
honest_acts = self.model.get_layer_acts(honest_prompts, layer)
deceptive_acts = self.model.get_layer_acts(deceptive_prompts, layer)
# Statistical test for significant differences
different_neurons = find_different_neurons(
honest_acts, deceptive_acts
)
if different_neurons:
# Trace connections
circuit = trace_circuit_from_neurons(
different_neurons, layer
)
circuits.append(circuit)
return self.validate_circuits(circuits)
Scaling Circuit Discovery
For larger models, we need efficient methods:
def scalable_circuit_discovery(model, behavior_dataset, max_components=10000):
"""
Discover circuits in large models efficiently
"""
# Use sparse probing to identify candidate components
candidates = sparse_probe_components(model, behavior_dataset)
# Hierarchical search: start with coarse modules
coarse_circuit = find_coarse_modules(model, candidates)
# Refine to find specific components
fine_circuit = refine_circuit(model, coarse_circuit, behavior_dataset)
# Compress circuit to minimal form
minimal_circuit = compress_circuit(fine_circuit, max_components)
return minimal_circuit
Common Pitfalls
1. Confirmation Bias
Problem: Finding circuits that confirm preconceptions rather than true mechanisms Solution: Pre-register hypotheses and use systematic search procedures
2. Overfitting to Specific Examples
Problem: Circuits that only work on the discovery dataset Solution: Always validate on held-out data with diverse examples
3. Ignoring Backup Circuits
Problem: Models often have redundant implementations of important behaviors Solution: Check for behavior restoration after circuit removal
4. Correlation vs Causation
Problem: Components that correlate with behavior but aren't causal Solution: Use causal intervention methods, not just correlation
Hands-on Exercise: Discover Your Own Circuit
Choose a simple behavior in a small language model and discover its circuit:
- Select a behavior (e.g., completing common phrases, basic arithmetic)
- Create a dataset of examples exhibiting the behavior
- Use activation patching to identify important components
- Trace connections between components
- Validate the circuit on new examples
Starter code:
# Load a small model
model = load_model("gpt2-small")
# Define behavior: completing common phrases
test_phrases = [
"The early bird gets the", # -> "worm"
"Actions speak louder than", # -> "words"
"Better late than", # -> "never"
]
# Your task: implement circuit discovery
def discover_phrase_completion_circuit(model, phrases):
# TODO: Implement discovery algorithm
pass
Further Reading
Foundational Papers
- Zoom In: An Introduction to Circuits - Olah et al., foundational circuit work
- A Mathematical Framework for Transformer Circuits - Anthropic's circuit framework
- Causal Scrubbing - Rigorous circuit validation
Recent Advances
- Automated Circuit Discovery - Automated methods for finding circuits
- Sparse Autoencoders - Finding interpretable features
- Circuit Breaking - Modifying circuits for safety
Tools and Resources
- TransformerLens - Library for mechanistic interpretability
- Circuitviz - Circuit visualization tools
- Pyvene - Intervention-based interpretability
Connections
Related Topics
- Prerequisites: Mechanistic Interpretability, Basic Interpretability
- Parallel Concepts: Explainable AI, AI Debugging Frameworks
- Advanced Applications: Scalable Interpretability, Research Methodology
Key Researchers
- Chris Olah: Pioneer of circuits research at Anthropic
- Neel Nanda: Mechanistic interpretability and circuit discovery
- Arthur Conmy: Automated circuit discovery methods
Research Groups
- Anthropic Interpretability Team: Leading circuit discovery research
- Redwood Research: Applied circuit analysis for safety
- MATS Scholars: Many working on circuit discovery projects