Calendar feature geometry in GPT-2 layer 8 residual stream SAEs

Patrick Leask, Bart Bussmann, and Neel Nanda take a close look at GPT-2’s SAE feature geometry on the AI Alignment Forum

AI Alignment Forum / MATS Program / LessWrong

AI Alignment Forum

Patrick Leask

Latent Mechanistic Interpretability LLMs SAEs GPT2 Geometry

TL;DR: We demonstrate that the decoder directions of GPT-2 SAEs are highly structured by finding a historical date direction onto which projecting non-date related features lets us read off their historical time period by comparison to year features.