Mechanistic interpretability

Interested in working on this research direction? Apply for our coaching

This post is likely to be most relevant for students studying maths and computer science, although other STEM subjects may also offer scope for you to do related research. If you want to read about AI safety research more broadly, start with our research direction ‘human-aligned artificial intelligence‘. 

What is mechanistic interpretability?

AI interpretability is the study of what goes on inside of artificial neural networks, what behaviours they’ve learned, and why they do what they do. This is a difficult problem because these networks are trained by being given a bunch of data and a task, which they learn to perform well over time – by default we have no idea how they’ve learned to do their task well! 

Mechanistic interpretability is a subfield of AI interpretability that focuses on reverse-engineering neural networks and trying to figure out what algorithms they’ve learned to perform well on a task. This seems particularly important to alignment, because it could let us interpret what goals the model may have learned. Further, it allows us to distinguish between multiple possible ways to implement a behaviour, which is essential for alignment, because looking aligned could be implemented by either actually being aligned, or by learning to deceive your supervisors and show them what they want to see!

What research areas are there?

A good starting point for understanding real models is taking tiny language models (e.g. with only one layer) and trying to fully reverse engineer these. An example of this work is Anthropic’s A Mathematical Framework for Transformer Circuits, which discovered important algorithms such as induction heads, which also appear in much larger models

Taking non-toy language models like GPT-2, isolating some behaviour of the network, and trying to find and interpret the subgraph of the network that performs this task. An example of this work is Redwood Research’s Interpretability in the Wild (and see the walkthrough here), which discovered a 25 head circuit in GPT-2 Small by which it performs the grammatical task of indirect object identification

A phenomena that makes interpreting language models harder is that they engage in superposition, where they simulate a much larger network by compressing more features than they have neurons into their lower dimensional space. A crucial open problem is understanding this better, and how to resolve it or deal with it. An example of this work is Anthropic’s Toy Models of Superposition, where they exhibit a toy network that engages in superposition and use this to study the phenomena.

Another angle is to practice reverse-engineering simpler algorithmic problems, and use this to refine our techniques and understanding of model internals, in a clean setting where we know the ground truth solution. An example of this work is Neel Nanda’s Progress Measures for Grokking via Mechanistic Interpretability, where they reverse-engineer a network trained to perform modular addition, and find that it learns a Fourier Transform and trig identity based algorithm.

An early focus of the field was on understanding image classification models, and how they learned to extract features from images and to process them. An example of this work is OpenAI’s Curve Circuits, where they study a set of neurons that detect curves, understand the algorithm well enough to write the neuron weights by hand, and replace these neurons with their hand-crafted set.

How to tell if you might be a good fit

Mechanistic Interpretability is a fundamentally interdisciplinary field that can benefit from a range of backgrounds. There’s a diverse range of projects, which can require reasoning through the mathematical structure of a network for mathematicians, the algorithms it may have learned for computer scientists, performing careful empirical work to discover what is going on inside of the model and to avoid self-deception for scientists and engaging with a network and disentangling the internal complexity for those with a background in complex systems. 

Most projects will involve coding, but there’s room for some more qualitative projects, or projects analysing theoretical and mathematical questions. However, it will likely take more creativity to find a way of working on this area within the boundaries of academic disciplines outside of computer science.

How to get involved

Neel Nanda’s 200 Concrete Open Problems in Mechanistic Interpretability is a literature review that lists many open problems in the field that could be good theses! The first post outlines how to get started in the field, the key skills to learn and good resources for learning these. Neel Nanda’s Mechanistic Interpretability Explainer is a good resource to look up unfamiliar terms as you explore the field.


This profile was published 21/02/2023. Thanks to Neel Nanda for writing this profile.

The image used in this article is available under CC-BY 4.0 and can be found here:

Subscribe to the Future Researchers Newsletter

Subscribe to our Future Researchers Newsletter for key concepts, resources and news related to changing the world with your thesis and long-term research career.

Explore all our recommended research directions

Search for profiles that are tailored specifically to your degree or discipline using the menu below. If you’re searching for thesis topics, or considering a research career which allows you to make a significant positive impact in the world, we advise you to go through these research directions and learn more about those that seem impactful or interesting to you.

If you are interested in a profile that isn’t listed under your discipline, we still encourage you to explore it if you think you could make progress in this direction. You can also explore all our recommended research directions organised by theme.

See here for a visual map of all our research directions