SAEs
Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models
ITDA LLMs SAEs Mechanistic Interpretability Representations Model DiffingPatrick Leask, Noura Al Moubayed, and Neel Nanda debut ITDA—a new mechanistic interpretability approach 100x faster to train than SAEs—at ICML'25
Sparse Autoencoders Do Not Find Canonical Units of Analysis
SAEs Mechanistic Interpretability RepresentationsPatrick Leask and Noura Al Moubayed present a paper and poster on SAEs at ICLR'25
Calendar feature geometry in GPT-2 layer 8 residual stream SAEs
LLMs SAEs GPT2 GeometryPatrick Leask, Bart Bussmann, and Neel Nanda take a close look at GPT-2’s SAE feature geometry on the AI Alignment Forum