Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models

Patrick Leask, Noura Al Moubayed, and Neel Nanda debut ITDA—a new mechanistic interpretability approach 100x faster to train than SAEs—at ICML'25

Vancouver

International Conference on Machine Learning

Conference website

Patrick Leask Noura Al Moubayed

Latent Mechanistic Interpretability ITDA LLMs SAEs Mechanistic Interpretability Representations Model Diffing

Leask, P., Nanda, N., & Al Moubayed, N. (2025, July). Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models. International Conference on Machine Learning. https://icml.cc/virtual/2025/poster/46477

Abstract
Sparse Autoencoders (SAEs) are a popular method for decomposing Large Language Model (LLM) activations into interpretable latents, however they have a substantial training cost and SAEs learned on different models are not directly comparable. Motivated by relative representation similarity measures, we introduce Inference-Time Decomposition of Activation models (ITDAs). ITDAs are constructed by greedily sampling activations into a dictionary based on an error threshold on their matching pursuit reconstruction. ITDAs can be trained in 1% of the time of SAEs, allowing us to cheaply train them on Llama-3.1 70B and 405B. ITDA dictionaries also enable cross-model comparisons, and outperform existing methods like CKA, SVCCA, and a relative representation method on a benchmark of representation similarity.
Code available at https://github.com/pleask/itda.