Mechanistic Interpretability

This project seeks to provide a benchmark for evaluating the extent to which a model is mechanistically interpretable

Patrick Leask, Noura Al Moubayed, and Neel Nanda debut ITDA—a new mechanistic interpretability approach 100x faster to train than SAEs—at ICML'25

Patrick Leask and Noura Al Moubayed present a paper and poster on SAEs at ICLR'25

Patrick Leask and Noura Al Moubayed introduce SAE stitching, a new method for mechanistic intepretability, in a poster at NeurIPS 2024.

Patrick Leask contributes to BatchTopK, a new SAE architecture introduced in a NeurIPS'24 poster.

Eleanor Dare creates an intra-active arts installation for the the 4th International Conference on Possibility Studies