Mechanistic Interpretability Research

This repository contains implementations for mechanistic interpretability research, focusing on understanding the internal representations and circuits of transformer language models.

Components

Sparse Autoencoder (`sparse-autoencoder/`)

Implementation of sparse autoencoders for analyzing transformer activations. Features include:

Training sparse autoencoders on MLP layer activations
8x expansion factor with L1 sparsity penalty
Token-aligned activation extraction and analysis
Neuron patching and intervention experiments
Comprehensive training monitoring and visualization

See sparse-autoencoder/README.md for detailed documentation.

Circuit Tracer Playground (`circuit-tracer-playground/`)

Safety-focused circuit discovery and analysis framework using library. Features include:

Automated circuit pattern mining across model layers
Safety benchmark evaluation (deception, harmful content, manipulation, power-seeking)
Contrasting example analysis for safety feature identification
Interactive dashboard for circuit visualization
Graph-based circuit representation and analysis

See circuit-tracer-playground/README.md for detailed documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
circuit-tracer-playground		circuit-tracer-playground
gemma3_from_scratch		gemma3_from_scratch
sparse-autoencoder		sparse-autoencoder
.gitignore		.gitignore
README.md		README.md
mondrian-mechinter.png		mondrian-mechinter.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mechanistic Interpretability Research

Components

Sparse Autoencoder (`sparse-autoencoder/`)

Circuit Tracer Playground (`circuit-tracer-playground/`)

About

Uh oh!

Releases

Packages

Uh oh!

Languages

danielekp/mechanistic-interpretability

Folders and files

Latest commit

History

Repository files navigation

Mechanistic Interpretability Research

Components

Sparse Autoencoder (sparse-autoencoder/)

Circuit Tracer Playground (circuit-tracer-playground/)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Sparse Autoencoder (`sparse-autoencoder/`)

Circuit Tracer Playground (`circuit-tracer-playground/`)

Packages