Forensics toolkit
Latent mechanistic interpretability
This project seeks to provide a benchmark for evaluating the extent to which a model is mechanistically interpretable
Technical | Design | Socio– |
Dataset | Model | Application |
Mechanistic interpretability is an increasingly popular approach to evaluating neural systems through the direct study of the model parameters. So far, the field has relied on resource intensive manual investigation into neural networks, where algorithms for subtasks in networks, rather than entire neural networks, have been characterised. Thus, a key question in the field is how mechanistic interpretability can be used to solve real-world interpretability tasks at the model level.
The latent mechanistic interpretability project seeks to provide a benchmark for evaluating the extent to which a particular model is mechanistically interpretable by training models to perform interpretability tasks on models that solve real-world problems. This benchmark can be used in the evaluation and design of models in order to ensure that they are internally explainable.
So far experiments have been conducted on toy datasets, with encouraging results on tabular problems. Project research since has focused on research focuses on:
- Transferring methods to the vision domain;
- Scaling the maximum size of networks that can be interpreted.