Representations

Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models

Jul 16, 2025 ITDA LLMs SAEs Mechanistic Interpretability Representations Model Diffing

Patrick Leask, Noura Al Moubayed, and Neel Nanda debut ITDA—a new mechanistic interpretability approach 100x faster to train than SAEs—at ICML'25

Sparse Autoencoders Do Not Find Canonical Units of Analysis

Apr 24, 2025 SAEs Mechanistic Interpretability Representations

Patrick Leask and Noura Al Moubayed present a paper and poster on SAEs at ICLR'25