A research framework for analyzing differences between language models using interpretability techniques. This project enables systematic comparison of base models and their variants (model organisms) through various diffing methodologies.
Note: The toolkit is based on a heavily modified version of the saprmarks/dictionary_learning repository, available at science-of-finetuning/dictionary_learning. Although we may eventually merge these repositories, this is currently not a priority due to significant divergence.
Publications
This framework consists of two main pipelines:
- Preprocessing Pipeline: Extract and cache activations from pre-existing models
- Diffing Pipeline: Analyze differences between models using interpretability techniques
The framework is designed to work with pre-existing model pairs (e.g., base models vs. model organisms) rather than training new models.
- Clone the repository:
git clone https://github.com/science-of-finetuning/diffing-game
cd diffing-game- Install dependencies:
pip install -r requirements.txtRun the complete pipeline (preprocessing + diffing) with default settings:
python main.pyRun preprocessing only (extract activations):
python main.py pipeline.mode=preprocessingRun diffing analysis only (assumes activations already exist):
python main.py pipeline.mode=diffingAnalyze specific organism and model combinations:
python main.py organism=caps model=gemma3_1BUse different diffing methods:
python main.py diffing/method=kl
python main.py diffing/method=normdiffRun experiments across multiple configurations:
python main.py --multirun organism=caps,roman_concrete model=gemma3_1BRun with different diffing methods:
python main.py --multirun diffing/method=kl,normdiffThe framework includes a Streamlit-based interactive dashboard for visualizing and exploring model diffing results.
- Dynamic Discovery: Automatically detects available models, organisms, and diffing methods
- Real-time Visualization: Interactive plots and visualizations of diffing results
- Model Integration: Direct links to Hugging Face model pages
- Multi-method Support: Compare results across different diffing methodologies
- Interactive Model Testing: Test custom inputs and steering vectors on both base and finetuned models in real-time
Launch the dashboard with:
streamlit run dashboard.pyThe dashboard will be available at http://localhost:8501 by default.
You can also pass configuration overwrites to the dashboard:
streamlit run dashboard.py -- model.dtype=float32- Select Base Model: Choose from available base models
- Select Organism: Pick the model organism (finetuned variant)
- Select Diffing Method: Choose the analysis method to visualize
- Explore Results: Interact with the generated visualizations
The dashboard requires that you have already run diffing experiments to generate results to visualize.
To reproduce the experiments from the paper:
bash narrow_ft_experiments/run.sh To run the agents on all models run
bash narrow_ft_experiments/agents.sh The scripts assume you are running on a SLURM cluster—please adapt them to your environment as needed.
Relevant code for the Activation Difference Lens is found at src/diffing/methods/activation_difference_lens and used utilities at src/utils. Plotting scripts are found under narrow_ft_experiments/plotting/. The statistical evaluation of the agent performance using HiBayes can be found in narrow_ft_experiments/hibayes/.
