Mechanistic Interpretability

Sparse Autoencoders Do Not Find Canonical Units of Analysis

SAEs Mechanistic Interpretability Representations

Patrick Leask and Noura Al Moubayed present a paper and poster on SAEs at ICLR'25

Latent mechanistic interpretability

Mechanistic interpretability Benchmarking SAE

This project seeks to provide a benchmark for evaluating the extent to which a model is mechanistically interpretable

Stitching Sparse Autoencoders of Different Sizes

SAE Stitching Stitching SAE Sparsity Autoencoders Latents Mechanistic Interpretability

Patrick Leask and Noura Al Moubayed introduce SAE stitching, a new method for mechanistic intepretability, in a poster at NeurIPS 2024.

BatchTopK Sparse Autoencoders

BatchTopK SAE Sparsity Autoencoders Mechanistic Interpretability Architecture

Patrick Leask contributes to BatchTopK, a new SAE architecture introduced in a NeurIPS'24 poster.

Worldbuilding

Worldbuilding Technology Design Intelligence Mechanistic interpretability

Eleanor Dare creates an intra-active arts installation for the the 4th International Conference on Possibility Studies