ITDA
Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models
ITDA LLMs SAEs Mechanistic Interpretability Representations Model DiffingPatrick Leask, Noura Al Moubayed, and Neel Nanda debut ITDA—a new mechanistic interpretability approach 100x faster to train than SAEs—at ICML'25