Advanced Git for Research

Version control best practices for collaborative AI safety research

⏱️ Intermediate

Advanced Git for AI Safety Research

Learning Objectives
Introduction
Core Concepts
Practical Applications
- Case Study: Multi-team Safety Evaluation Project
- Collaborative Research Patterns
Common Pitfalls and How to Avoid Them
Hands-on Exercise
Further Reading
Related Topics

Learning Objectives

By the end of this topic, you will be able to:

Implement specialized Git workflows designed for AI/ML research projects
Manage large model files and datasets efficiently using Git LFS and DVC
Design branching strategies that support parallel experimentation
Ensure reproducibility of AI experiments through proper version control
Secure sensitive data and API keys in collaborative research environments

Version control is fundamental to rigorous AI safety research, but standard Git workflows often fall short when dealing with the unique challenges of machine learning projects. This topic explores advanced Git techniques specifically tailored for AI safety researchers who need to manage large models, track experiments, ensure reproducibility, and maintain security while collaborating on sensitive projects.

The complexity of AI research—with its iterative experimentation, large binary files, and critical security requirements—demands a sophisticated approach to version control that goes beyond basic Git usage.

Core Concepts

1. ML-Specific Git Workflows

Traditional software development workflows like GitFlow need adaptation for machine learning projects. The ML Git workflow extends standard practices with specialized branch types:

main/
├── develop/
├── feature/new-safety-metric/
│   ├── experiment/baseline-evaluation/
│   ├── experiment/adversarial-testing/
│   └── experiment/robustness-checks/
├── data/dataset-v2/
├── model/transformer-variant/
└── training/distributed-setup/

Key Branch Types:

Feature branches: Implement new functionality or research directions
Experiment branches: Isolate specific experimental configurations
Data branches: Track dataset versions and preprocessing changes
Model branches: Manage architectural modifications
Training branches: Version training pipeline changes

This structure enables parallel experimentation without conflicts and maintains clear separation between different aspects of ML development.

2. Managing Large Files with Git LFS

AI models and datasets can quickly exceed Git's practical file size limits. Git Large File Storage (LFS) provides a solution:

Setup and Configuration:

# Install Git LFS
git lfs install

# Track ML file types
git lfs track "*.h5" "*.pkl" "*.pth" "*.onnx"
git lfs track "*.parquet" "*.tfrecord"

# Track specific large files
git lfs track "models/safety-classifier-v2.pth"

Best Practices:

Only track files that need versioning (not intermediate outputs)
Use .gitattributes to maintain consistent LFS tracking
Monitor storage usage: git lfs ls-files --size
Consider cloud storage integration for files >5GB

3. Experiment Tracking and Reproducibility

Reproducibility is crucial for AI safety research. Integrate Git with specialized ML tools:

Data Version Control (DVC):

# dvc.yaml
stages:
  prepare:
    cmd: python src/prepare_data.py
    deps:
      - src/prepare_data.py
      - data/raw/
    outs:
      - data/processed/
  
  train:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/processed/
    params:
      - train.learning_rate
      - train.batch_size
    outs:
      - models/model.pth
    metrics:
      - metrics.json

Integration with MLflow:

import mlflow

mlflow.set_tracking_uri("./mlruns")
mlflow.set_experiment("safety-evaluation")

with mlflow.start_run():
    # Log Git commit hash
    mlflow.set_tag("git_commit", 
                   subprocess.check_output(["git", "rev-parse", "HEAD"]).decode().strip())
    
    # Log parameters and metrics
    mlflow.log_params(config)
    mlflow.log_metrics(evaluation_results)
    
    # Log model with Git context
    mlflow.pytorch.log_model(model, "model")

4. Jupyter Notebook Version Control

Notebooks present unique challenges for version control. Implement these solutions:

Using nbdime for meaningful diffs:

# Install nbdime
pip install nbdime

# Configure Git to use nbdime
nbdime config-git --enable

# View notebook diffs
nbdiff notebook1.ipynb notebook2.ipynb

Jupytext for plain-text synchronization:

# jupytext.toml
formats = "ipynb,py:percent"
notebook_metadata_filter = "-all"
cell_metadata_filter = "-all"

This maintains a .py version alongside each notebook for better version control.

5. Security and Sensitive Data Management

AI safety research often involves sensitive data and expensive API keys:

Git-secrets configuration:

# Install git-secrets
brew install git-secrets  # macOS
# or compile from source on Linux

# Initialize in repository
git secrets --install
git secrets --register-aws  # AWS credentials
git secrets --add "sk-[a-zA-Z0-9]{48}"  # OpenAI API keys
git secrets --add "anthropic_api_key"

Environment variable management:

# config.py
import os
from dotenv import load_dotenv

load_dotenv()

class Config:
    OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
    ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
    MODEL_ENDPOINT = os.getenv("MODEL_ENDPOINT")
    
    @classmethod
    def validate(cls):
        """Ensure all required secrets are present"""
        required = ["OPENAI_API_KEY", "ANTHROPIC_API_KEY"]
        missing = [k for k in required if not getattr(cls, k)]
        if missing:
            raise ValueError(f"Missing required secrets: {missing}")

Practical Applications

Case Study: Multi-team Safety Evaluation Project

Consider a project evaluating LLM safety across multiple dimensions:

Repository Structure:

ai-safety-eval/
├── .github/
│   ├── workflows/          # CI/CD pipelines
│   └── CODEOWNERS         # Automatic reviewers
├── experiments/
│   ├── red-teaming/       # Attack experiments
│   ├── robustness/        # Robustness testing
│   └── interpretability/  # Model analysis
├── models/                # Model checkpoints (LFS)
├── data/                  # Datasets (DVC)
├── src/                   # Source code
└── notebooks/             # Research notebooks

Workflow Implementation:

Main branch protected with required reviews
Feature branches for new evaluation methods
Experiment branches for parameter tuning
Automated testing on pull requests
Security scanning for exposed keys

Collaborative Research Patterns

Fork and Pull Model:

Researchers fork the main repository
Develop features in isolation
Submit pull requests for review
Automated checks ensure quality

Shared Repository Model:

Direct collaboration on branches
Protected main branch
Required status checks
Automated deployment to test environments

Common Pitfalls and How to Avoid Them

Committing Large Files Directly
- Solution: Configure LFS before adding files
- Use pre-commit hooks to check file sizes
Exposing API Keys
- Solution: Implement git-secrets scanning
- Regular key rotation policies
- Use environment variables consistently
Notebook Merge Conflicts
- Solution: Clear outputs before committing
- Use nbdime for conflict resolution
- Consider jupytext for critical notebooks
Lost Experiments
- Solution: Integrate MLflow/W&B from the start
- Tag experiments with Git commits
- Document experiment purposes in commit messages
Inconsistent Environments
- Solution: Pin all dependencies
- Use Docker for reproducibility
- Version control environment files

Hands-on Exercise

Create a Git repository for an AI safety project with the following requirements:

Set up Git LFS for model files
Configure git-secrets to prevent API key exposure
Create a branching structure for experiments
Integrate DVC for dataset versioning
Set up pre-commit hooks for code quality
Configure nbdime for notebook diffs
Create a GitHub Action for automated testing

This exercise reinforces the key concepts while building a practical foundation for AI safety research.