Mechanistic Interpretability

Latent mechanistic interpretability

Mechanistic interpretability Benchmarking SAE

This project seeks to provide a benchmark for evaluating the extent to which a model is mechanistically interpretable

Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models

ITDA LLMs SAEs Mechanistic Interpretability Representations Model Diffing

Patrick Leask, Noura Al Moubayed, and Neel Nanda debut ITDA—a new mechanistic interpretability approach 100x faster to train than SAEs—at ICML'25

Sparse Autoencoders Do Not Find Canonical Units of Analysis

SAEs Mechanistic Interpretability Representations

Patrick Leask and Noura Al Moubayed present a paper and poster on SAEs at ICLR'25

Stitching Sparse Autoencoders of Different Sizes

SAE Stitching Stitching SAE Sparsity Autoencoders Latents Mechanistic Interpretability

Patrick Leask and Noura Al Moubayed introduce SAE stitching, a new method for mechanistic intepretability, in a poster at NeurIPS 2024.

BatchTopK Sparse Autoencoders

BatchTopK SAE Sparsity Autoencoders Mechanistic Interpretability Architecture

Patrick Leask contributes to BatchTopK, a new SAE architecture introduced in a NeurIPS'24 poster.

Worldbuilding

Worldbuilding Technology Design Intelligence Mechanistic interpretability

Eleanor Dare creates an intra-active arts installation for the the 4th International Conference on Possibility Studies