Novel Circuit Discovery

Techniques for discovering and analyzing computational circuits in neural networks for safety insights

⏱️ 4-6 hoursAdvanced

Novel Circuit Discovery

Table of Contents

Learning Objectives

  • Understand the concept of circuits in neural networks and their role in mechanistic interpretability
  • Master techniques for discovering and analyzing computational circuits in AI systems
  • Learn automated methods for circuit identification and validation
  • Develop skills in reverse-engineering neural network behaviors through circuit analysis
  • Apply circuit discovery to uncover safety-relevant mechanisms in AI models

Introduction

Circuit discovery represents one of the most promising approaches to understanding how neural networks implement specific behaviors. A "circuit" in this context refers to a subset of model components (neurons, attention heads, layers) that together implement a specific function or behavior. By identifying and understanding these circuits, we can gain mechanistic insights into how AI systems work, potentially revealing safety-critical behaviors like deception, goal-seeking, or value representation.

This topic explores cutting-edge techniques for discovering novel circuits in neural networks, from manual investigation methods pioneered by researchers like Chris Olah to automated discovery algorithms that can scale to larger models. We'll examine how circuit discovery contributes to AI safety by enabling us to understand, predict, and potentially modify model behaviors at a mechanistic level.

Core Concepts

1. What Are Circuits?

Circuits are the functional building blocks of neural network computation, consisting of connected components that together implement specific behaviors.

Key Properties of Circuits

  • Sparsity: Most behaviors are implemented by a small fraction of model components
  • Modularity: Circuits often compose in understandable ways
  • Interpretability: Individual circuit components often have human-understandable functions
  • Causality: Circuit components have direct causal influence on model outputs

Types of Circuits

  1. Feature Circuits: Detect specific patterns in inputs (e.g., curved lines, syntactic structures)
  2. Algorithmic Circuits: Implement computational procedures (e.g., copying, comparison)
  3. Behavioral Circuits: Drive high-level behaviors (e.g., refusing harmful requests)
  4. Value Circuits: Encode preferences and goals

2. Manual Circuit Discovery Techniques

Traditional circuit discovery involves careful investigation of model internals using interpretability tools.

The Investigation Process

  1. Behavior Identification: Choose a specific model behavior to investigate
  2. Component Attribution: Identify which model components contribute most
  3. Ablation Studies: Test importance by removing/modifying components
  4. Path Tracing: Follow information flow through the network
  5. Hypothesis Formation: Develop mechanistic explanations
  6. Validation: Test predictions on new inputs

Example: Indirect Object Identification Circuit

def investigate_ioi_circuit(model, prompts):
    """
    Investigate the Indirect Object Identification circuit
    Example: "John gave Mary the ball" -> model should predict "Mary"
    """
    # Step 1: Identify important attention heads
    attention_patterns = model.get_attention_patterns(prompts)
    important_heads = []
    
    for layer in range(model.n_layers):
        for head in range(model.n_heads):
            # Check if head attends from end position to name positions
            if attends_to_names(attention_patterns[layer][head]):
                important_heads.append((layer, head))
    
    # Step 2: Test importance via ablation
    original_logits = model(prompts)
    for head in important_heads:
        # Zero out this head's contribution
        ablated_logits = model(prompts, ablate_head=head)
        effect = measure_effect(original_logits, ablated_logits)
        print(f"Head {head}: effect size = {effect}")
    
    return important_heads

3. Automated Circuit Discovery

Modern approaches use algorithmic methods to discover circuits automatically, enabling analysis of larger models and more complex behaviors.

Activation Patching Methods

def automated_circuit_discovery(model, clean_input, corrupted_input, metric):
    """
    Discover circuits using activation patching
    """
    circuit_components = []
    
    # Get activations for clean and corrupted inputs
    clean_acts = model.get_all_activations(clean_input)
    corrupted_acts = model.get_all_activations(corrupted_input)
    
    # Try patching each component
    for component in model.get_all_components():
        # Replace component activation with clean version
        patched_acts = corrupted_acts.copy()
        patched_acts[component] = clean_acts[component]
        
        # Measure restoration of original behavior
        restoration = metric(model.forward_from_acts(patched_acts))
        
        if restoration > threshold:
            circuit_components.append(component)
    
    # Find minimal circuit via greedy search
    minimal_circuit = find_minimal_subset(circuit_components, metric)
    return minimal_circuit

Gradient-Based Discovery

Recent methods use gradients to efficiently identify important connections:

def gradient_circuit_discovery(model, inputs, target_behavior):
    """
    Use integrated gradients to find circuits
    """
    # Compute gradients of target behavior w.r.t. all connections
    gradients = compute_integrated_gradients(
        model, inputs, target_behavior
    )
    
    # Identify high-gradient paths
    important_edges = gradients.get_top_k_edges(k=1000)
    
    # Verify importance through masking
    circuit = verify_circuit_edges(model, important_edges, target_behavior)
    return circuit

4. Circuit Analysis and Validation

Once discovered, circuits must be carefully analyzed and validated to ensure they truly implement the hypothesized function.

Validation Techniques

  1. Sufficiency: Does activating only the circuit reproduce the behavior?
  2. Necessity: Does removing the circuit eliminate the behavior?
  3. Generalization: Does the circuit work across different inputs?
  4. Specificity: Is the circuit specific to the target behavior?
  5. Compositionality: How does the circuit interact with others?

Circuit Visualization

def visualize_circuit(circuit, model):
    """
    Create interpretable visualization of discovered circuit
    """
    graph = nx.DiGraph()
    
    # Add nodes for each component
    for component in circuit.components:
        label = get_component_function(component, model)
        graph.add_node(component.id, label=label)
    
    # Add edges for connections
    for edge in circuit.edges:
        weight = edge.importance_score
        graph.add_edge(edge.source, edge.target, weight=weight)
    
    # Layout and render
    pos = hierarchical_layout(graph)
    draw_circuit_diagram(graph, pos)

Practical Applications

Case Study: Deception Circuit Discovery

A critical safety application involves discovering circuits responsible for deceptive behaviors:

class DeceptionCircuitFinder:
    def __init__(self, model):
        self.model = model
        
    def find_deception_circuits(self):
        # Create dataset of honest vs deceptive responses
        honest_prompts, deceptive_prompts = create_deception_dataset()
        
        # Find components that differ between honest/deceptive
        circuits = []
        for layer in range(self.model.n_layers):
            # Compare activations
            honest_acts = self.model.get_layer_acts(honest_prompts, layer)
            deceptive_acts = self.model.get_layer_acts(deceptive_prompts, layer)
            
            # Statistical test for significant differences
            different_neurons = find_different_neurons(
                honest_acts, deceptive_acts
            )
            
            if different_neurons:
                # Trace connections
                circuit = trace_circuit_from_neurons(
                    different_neurons, layer
                )
                circuits.append(circuit)
        
        return self.validate_circuits(circuits)

Scaling Circuit Discovery

For larger models, we need efficient methods:

def scalable_circuit_discovery(model, behavior_dataset, max_components=10000):
    """
    Discover circuits in large models efficiently
    """
    # Use sparse probing to identify candidate components
    candidates = sparse_probe_components(model, behavior_dataset)
    
    # Hierarchical search: start with coarse modules
    coarse_circuit = find_coarse_modules(model, candidates)
    
    # Refine to find specific components
    fine_circuit = refine_circuit(model, coarse_circuit, behavior_dataset)
    
    # Compress circuit to minimal form
    minimal_circuit = compress_circuit(fine_circuit, max_components)
    
    return minimal_circuit

Common Pitfalls

1. Confirmation Bias

Problem: Finding circuits that confirm preconceptions rather than true mechanisms Solution: Pre-register hypotheses and use systematic search procedures

2. Overfitting to Specific Examples

Problem: Circuits that only work on the discovery dataset Solution: Always validate on held-out data with diverse examples

3. Ignoring Backup Circuits

Problem: Models often have redundant implementations of important behaviors Solution: Check for behavior restoration after circuit removal

4. Correlation vs Causation

Problem: Components that correlate with behavior but aren't causal Solution: Use causal intervention methods, not just correlation

Hands-on Exercise: Discover Your Own Circuit

Choose a simple behavior in a small language model and discover its circuit:

  1. Select a behavior (e.g., completing common phrases, basic arithmetic)
  2. Create a dataset of examples exhibiting the behavior
  3. Use activation patching to identify important components
  4. Trace connections between components
  5. Validate the circuit on new examples

Starter code:

# Load a small model
model = load_model("gpt2-small")

# Define behavior: completing common phrases
test_phrases = [
    "The early bird gets the",  # -> "worm"
    "Actions speak louder than",  # -> "words"
    "Better late than",  # -> "never"
]

# Your task: implement circuit discovery
def discover_phrase_completion_circuit(model, phrases):
    # TODO: Implement discovery algorithm
    pass

Further Reading

Foundational Papers

Recent Advances

Tools and Resources

Connections

Key Researchers

  • Chris Olah: Pioneer of circuits research at Anthropic
  • Neel Nanda: Mechanistic interpretability and circuit discovery
  • Arthur Conmy: Automated circuit discovery methods

Research Groups

  • Anthropic Interpretability Team: Leading circuit discovery research
  • Redwood Research: Applied circuit analysis for safety
  • MATS Scholars: Many working on circuit discovery projects
Loading resources...
Pre-rendered at build time (instant load)