Novel Circuit Discovery

Techniques for discovering and analyzing computational circuits in neural networks for safety insights

⏱️ 4-6 hoursAdvanced

Novel Circuit Discovery

Learning Objectives

Understand the concept of circuits in neural networks and their role in mechanistic interpretability
Master techniques for discovering and analyzing computational circuits in AI systems
Learn automated methods for circuit identification and validation
Develop skills in reverse-engineering neural network behaviors through circuit analysis
Apply circuit discovery to uncover safety-relevant mechanisms in AI models

Circuit discovery represents one of the most promising approaches to understanding how neural networks implement specific behaviors. A "circuit" in this context refers to a subset of model components (neurons, attention heads, layers) that together implement a specific function or behavior. By identifying and understanding these circuits, we can gain mechanistic insights into how AI systems work, potentially revealing safety-critical behaviors like deception, goal-seeking, or value representation.

This topic explores cutting-edge techniques for discovering novel circuits in neural networks, from manual investigation methods pioneered by researchers like Chris Olah to automated discovery algorithms that can scale to larger models. We'll examine how circuit discovery contributes to AI safety by enabling us to understand, predict, and potentially modify model behaviors at a mechanistic level.

Core Concepts

1. What Are Circuits?

Circuits are the functional building blocks of neural network computation, consisting of connected components that together implement specific behaviors.

Key Properties of Circuits

Sparsity: Most behaviors are implemented by a small fraction of model components
Modularity: Circuits often compose in understandable ways
Interpretability: Individual circuit components often have human-understandable functions
Causality: Circuit components have direct causal influence on model outputs

Types of Circuits

Feature Circuits: Detect specific patterns in inputs (e.g., curved lines, syntactic structures)
Algorithmic Circuits: Implement computational procedures (e.g., copying, comparison)
Behavioral Circuits: Drive high-level behaviors (e.g., refusing harmful requests)
Value Circuits: Encode preferences and goals

2. Manual Circuit Discovery Techniques

Traditional circuit discovery involves careful investigation of model internals using interpretability tools.

The Investigation Process

Behavior Identification: Choose a specific model behavior to investigate
Component Attribution: Identify which model components contribute most
Ablation Studies: Test importance by removing/modifying components
Path Tracing: Follow information flow through the network
Hypothesis Formation: Develop mechanistic explanations
Validation: Test predictions on new inputs

Example: Indirect Object Identification Circuit

def investigate_ioi_circuit(model, prompts):
    """
    Investigate the Indirect Object Identification circuit
    Example: "John gave Mary the ball" -> model should predict "Mary"
    """
    # Step 1: Identify important attention heads
    attention_patterns = model.get_attention_patterns(prompts)
    important_heads = []
    
    for layer in range(model.n_layers):
        for head in range(model.n_heads):
            # Check if head attends from end position to name positions
            if attends_to_names(attention_patterns[layer][head]):
                important_heads.append((layer, head))
    
    # Step 2: Test importance via ablation
    original_logits = model(prompts)
    for head in important_heads:
        # Zero out this head's contribution
        ablated_logits = model(prompts, ablate_head=head)
        effect = measure_effect(original_logits, ablated_logits)
        print(f"Head {head}: effect size = {effect}")
    
    return important_heads

3. Automated Circuit Discovery

Modern approaches use algorithmic methods to discover circuits automatically, enabling analysis of larger models and more complex behaviors.

Activation Patching Methods

def automated_circuit_discovery(model, clean_input, corrupted_input, metric):
    """
    Discover circuits using activation patching
    """
    circuit_components = []
    
    # Get activations for clean and corrupted inputs
    clean_acts = model.get_all_activations(clean_input)
    corrupted_acts = model.get_all_activations(corrupted_input)
    
    # Try patching each component
    for component in model.get_all_components():
        # Replace component activation with clean version
        patched_acts = corrupted_acts.copy()
        patched_acts[component] = clean_acts[component]
        
        # Measure restoration of original behavior
        restoration = metric(model.forward_from_acts(patched_acts))
        
        if restoration > threshold:
            circuit_components.append(component)
    
    # Find minimal circuit via greedy search
    minimal_circuit = find_minimal_subset(circuit_components, metric)
    return minimal_circuit

Gradient-Based Discovery

Recent methods use gradients to efficiently identify important connections:

def gradient_circuit_discovery(model, inputs, target_behavior):
    """
    Use integrated gradients to find circuits
    """
    # Compute gradients of target behavior w.r.t. all connections
    gradients = compute_integrated_gradients(
        model, inputs, target_behavior
    )
    
    # Identify high-gradient paths
    important_edges = gradients.get_top_k_edges(k=1000)
    
    # Verify importance through masking
    circuit = verify_circuit_edges(model, important_edges, target_behavior)
    return circuit

4. Circuit Analysis and Validation

Once discovered, circuits must be carefully analyzed and validated to ensure they truly implement the hypothesized function.

Validation Techniques

Sufficiency: Does activating only the circuit reproduce the behavior?
Necessity: Does removing the circuit eliminate the behavior?
Generalization: Does the circuit work across different inputs?
Specificity: Is the circuit specific to the target behavior?
Compositionality: How does the circuit interact with others?

Circuit Visualization

def visualize_circuit(circuit, model):
    """
    Create interpretable visualization of discovered circuit
    """
    graph = nx.DiGraph()
    
    # Add nodes for each component
    for component in circuit.components:
        label = get_component_function(component, model)
        graph.add_node(component.id, label=label)
    
    # Add edges for connections
    for edge in circuit.edges:
        weight = edge.importance_score
        graph.add_edge(edge.source, edge.target, weight=weight)
    
    # Layout and render
    pos = hierarchical_layout(graph)
    draw_circuit_diagram(graph, pos)

Practical Applications

Case Study: Deception Circuit Discovery

A critical safety application involves discovering circuits responsible for deceptive behaviors:

class DeceptionCircuitFinder:
    def __init__(self, model):
        self.model = model
        
    def find_deception_circuits(self):
        # Create dataset of honest vs deceptive responses
        honest_prompts, deceptive_prompts = create_deception_dataset()
        
        # Find components that differ between honest/deceptive
        circuits = []
        for layer in range(self.model.n_layers):
            # Compare activations
            honest_acts = self.model.get_layer_acts(honest_prompts, layer)
            deceptive_acts = self.model.get_layer_acts(deceptive_prompts, layer)
            
            # Statistical test for significant differences
            different_neurons = find_different_neurons(
                honest_acts, deceptive_acts
            )
            
            if different_neurons:
                # Trace connections
                circuit = trace_circuit_from_neurons(
                    different_neurons, layer
                )
                circuits.append(circuit)
        
        return self.validate_circuits(circuits)

Scaling Circuit Discovery

For larger models, we need efficient methods:

def scalable_circuit_discovery(model, behavior_dataset, max_components=10000):
    """
    Discover circuits in large models efficiently
    """
    # Use sparse probing to identify candidate components
    candidates = sparse_probe_components(model, behavior_dataset)
    
    # Hierarchical search: start with coarse modules
    coarse_circuit = find_coarse_modules(model, candidates)
    
    # Refine to find specific components
    fine_circuit = refine_circuit(model, coarse_circuit, behavior_dataset)
    
    # Compress circuit to minimal form
    minimal_circuit = compress_circuit(fine_circuit, max_components)
    
    return minimal_circuit

Select a behavior (e.g., completing common phrases, basic arithmetic)
Create a dataset of examples exhibiting the behavior
Use activation patching to identify important components
Trace connections between components
Validate the circuit on new examples

Starter code:

# Load a small model
model = load_model("gpt2-small")

# Define behavior: completing common phrases
test_phrases = [
    "The early bird gets the",  # -> "worm"
    "Actions speak louder than",  # -> "words"
    "Better late than",  # -> "never"
]

# Your task: implement circuit discovery
def discover_phrase_completion_circuit(model, phrases):
    # TODO: Implement discovery algorithm
    pass

Connections

Prerequisites: Mechanistic Interpretability, Basic Interpretability
Parallel Concepts: Explainable AI, AI Debugging Frameworks
Advanced Applications: Scalable Interpretability, Research Methodology

Key Researchers

Chris Olah: Pioneer of circuits research at Anthropic
Neel Nanda: Mechanistic interpretability and circuit discovery
Arthur Conmy: Automated circuit discovery methods

Research Groups

Anthropic Interpretability Team: Leading circuit discovery research
Redwood Research: Applied circuit analysis for safety
MATS Scholars: Many working on circuit discovery projects

Loading resources...

← Back to Module

⚡Pre-rendered at build time (instant load)

Novel Circuit Discovery

Novel Circuit Discovery

Table of Contents

Learning Objectives

Introduction

Core Concepts

1. What Are Circuits?

Key Properties of Circuits

Types of Circuits

2. Manual Circuit Discovery Techniques

The Investigation Process

Example: Indirect Object Identification Circuit

3. Automated Circuit Discovery

Activation Patching Methods

Gradient-Based Discovery

4. Circuit Analysis and Validation

Validation Techniques

Circuit Visualization

Practical Applications

Case Study: Deception Circuit Discovery

Scaling Circuit Discovery

Common Pitfalls

1. Confirmation Bias

2. Overfitting to Specific Examples

3. Ignoring Backup Circuits

4. Correlation vs Causation

Hands-on Exercise: Discover Your Own Circuit

Further Reading

Foundational Papers

Recent Advances

Tools and Resources

Connections

Key Researchers

Research Groups