Mechanistic Interpretability
Latent mechanistic interpretability
Mechanistic interpretability Benchmarking SAEThis project seeks to provide a benchmark for evaluating the extent to which a model is mechanistically interpretable
Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models
ITDA LLMs SAEs Mechanistic Interpretability Representations Model DiffingPatrick Leask, Noura Al Moubayed, and Neel Nanda debut ITDA—a new mechanistic interpretability approach 100x faster to train than SAEs—at ICML'25
Sparse Autoencoders Do Not Find Canonical Units of Analysis
SAEs Mechanistic Interpretability RepresentationsPatrick Leask and Noura Al Moubayed present a paper and poster on SAEs at ICLR'25
Stitching Sparse Autoencoders of Different Sizes
SAE Stitching Stitching SAE Sparsity Autoencoders Latents Mechanistic InterpretabilityPatrick Leask and Noura Al Moubayed introduce SAE stitching, a new method for mechanistic intepretability, in a poster at NeurIPS 2024.
BatchTopK Sparse Autoencoders
BatchTopK SAE Sparsity Autoencoders Mechanistic Interpretability ArchitecturePatrick Leask contributes to BatchTopK, a new SAE architecture introduced in a NeurIPS'24 poster.
Worldbuilding
Worldbuilding Technology Design Intelligence Mechanistic interpretabilityEleanor Dare creates an intra-active arts installation for the the 4th International Conference on Possibility Studies