Mechanistic interpretability

Interested in working on this research direction? Apply for our coaching

Want more context on this profile? Explore a map of all our profiles →​

This post is likely to be most relevant for students studying maths and computer science, although other STEM subjects may also offer scope for you to do related research. If you want to read about AI safety research more broadly, start with our profile on Human-aligned artificial intelligence.

What is mechanistic interpretability?

AI interpretability is the study of what goes on inside of artificial neural networks, what behaviours they’ve learned, and why they do what they do. This is a difficult problem because these networks are trained by being given a bunch of data and a task, which they learn to perform well over time – by default we have no idea how they’ve learned to do their task well! 

Mechanistic interpretability is a subfield of AI interpretability that focuses on reverse-engineering neural networks and trying to figure out what algorithms they’ve learned to perform well on a task. This seems particularly important to alignment, because it could let us interpret what goals the model may have learned. Further, it allows us to distinguish between multiple possible ways to implement a behaviour, which is essential for alignment, because looking aligned could be implemented by either actually being aligned, or by learning to deceive your supervisors and show them what they want to see!

What research areas are there?

A good starting point for understanding real models is taking tiny language models (e.g. with only one layer) and trying to fully reverse engineer these. An example of this work is Anthropic’s A Mathematical Framework for Transformer Circuits, which discovered important algorithms such as induction heads, which also appear in much larger models

Taking non-toy language models like GPT-2, isolating some behaviour of the network, and trying to find and interpret the subgraph of the network that performs this task. An example of this work is Redwood Research’s Interpretability in the Wild (and see the walkthrough here), which discovered a 25 head circuit in GPT-2 Small by which it performs the grammatical task of indirect object identification

A phenomena that makes interpreting language models harder is that they engage in superposition, where they simulate a much larger network by compressing more features than they have neurons into their lower dimensional space. A crucial open problem is understanding this better, and how to resolve it or deal with it. An example of this work is Anthropic’s Toy Models of Superposition, where they exhibit a toy network that engages in superposition and use this to study the phenomena.

Another angle is to practice reverse-engineering simpler algorithmic problems, and use this to refine our techniques and understanding of model internals, in a clean setting where we know the ground truth solution. An example of this work is Neel Nanda’s Progress Measures for Grokking via Mechanistic Interpretability, where they reverse-engineer a network trained to perform modular addition, and find that it learns a Fourier Transform and trig identity based algorithm.

An early focus of the field was on understanding image classification models, and how they learned to extract features from images and to process them. An example of this work is OpenAI’s Curve Circuits, where they study a set of neurons that detect curves, understand the algorithm well enough to write the neuron weights by hand, and replace these neurons with their hand-crafted set.

How to tell if you might be a good fit

Mechanistic Interpretability is a fundamentally interdisciplinary field that can benefit from a range of backgrounds. There’s a diverse range of projects, which can require reasoning through the mathematical structure of a network for mathematicians, the algorithms it may have learned for computer scientists, performing careful empirical work to discover what is going on inside of the model and to avoid self-deception for scientists and engaging with a network and disentangling the internal complexity for those with a background in complex systems. 

Most projects will involve coding, but there’s room for some more qualitative projects, or projects analysing theoretical and mathematical questions. However, it will likely take more creativity to find a way of working on this area within the boundaries of academic disciplines outside of computer science.

How to get involved

Neel Nanda’s 200 Concrete Open Problems in Mechanistic Interpretability is a literature review that lists many open problems in the field that could be good theses! The first post outlines how to get started in the field, the key skills to learn and good resources for learning these. Neel Nanda’s Mechanistic Interpretability Explainer is a good resource to look up unfamiliar terms as you explore the field.


This profile was published 21/02/2023. Thanks to Neel Nanda for writing this profile. Learn more about how we create our profiles.

The image used in this article is available under CC-BY 4.0 and can be found here:

Subscribe to the Topic Discovery Digest

Subscribe to our Topic Discovery Digest to find thesis topics, tools and resources that can help you significantly improve the world.

Where next?

Keep exploring our other services and content
We use cookies to enhance your experience. By continuing to visit this site you agree to our use of cookies. More info