Mechanistic interpretability and artificial psycholinguistics are transforming our understanding of large language models. In this episode, Arshavir Blackwell explores how probing neural circuits, behavioral tests, and new tools are unraveling the mysteries of AI reasoning.
Chapter 1
Arshavir Blackwell, PhD
Welcome to the first episode of Inside the Black Box. Iâm your host, Arshavir Blackwell. Today, I want to crack open a major shift in how we peer inside large language models. If youâve seen those saliency maps or heatmaps people used to show, where the model âlookedâ in a sentence, you know that theyâre fine for getting a rough idea. But, thatâs like reading MRI scans through frosted glass. You get the general outline, but not the function, not the logic, and definitely not the why.
Arshavir Blackwell, PhD
For years, these kinds of tools were all we had. But as these models started writing essays and debugging code, people realized that weâre outmatched. These tools just painted over the surface, telling us which words were hot, but not how the model made its decision in the first place.
Arshavir Blackwell, PhD
This is where mechanistic interpretability comes into play. This starts around 2019, when Chris Olah and the team at Anthropic decided: letâs take transformers apart like laptops, circuit by circuit. Their hunch was: if the network shows structure on the outside, thereâs got to be engineering on the inside? Thatâs kind of radical, actually, treating these models less like black box crystal balls and more like really complicated clockwork.
Arshavir Blackwell, PhD
What did that approach unlock? One classic example is the âinduction head.â This is a special attention head that basically learns to spot repeated patterns and continue them. Now we werenât just pointing at colors on a heatmap; we could say--- hereâs the mechanism that copies and extends patterns in text, step by step.
Arshavir Blackwell, PhD
One of the most fascinating discoveries has been the emergence of coreference circuitsâtiny subnetworks inside large language models that figure out whoâs being talked about. Take a sentence like âAlice went to the market because she was hungry.â The model somehow knows that she refers to Alice. Whatâs remarkable is that no one programmed this explicitly; these circuits just appeared during training, echoing the same strategies humans use to keep track of people and things in conversation. Remember, all the model is trained to do is to predict the next word. Itâs as if the model has independently rediscovered part of the machinery of language understanding.
Arshavir Blackwell, PhD
And thatâs just one example. Researchers are now mapping out attention heads that specialize in negation, or that track topic shifts as a discussion moves from one idea to another. Tools like sparse autoencoders help disentangle the overlapping representations inside these networks, so individual neurons and features become cleaner, more interpretable. And with techniques such as causal scrubbing, developed at Redwood Research, we can finally go beyond just observing correlations. We can interveneâtesting whether a particular circuit actually causes a behavior, or if itâs just riding along. Step by step, the inner logic of these models is starting to come into focus.
Arshavir Blackwell, PhD
The field is feeling less like peering into a void and more like charting the circuitry of some alien brainâexcept itâs a brain we built ourselves.
Arshavir Blackwell, PhD
But interpreting circuits is really just one angle. Thereâs another revolution happening in parallelâwhat Sean Trott dubbed LLM-ology, or artificial psycholinguistics. This is close to home for me, because my PhD work was basically a prototype of this, just with smaller models and made-up languages instead of giant LLMs.
Arshavir Blackwell, PhD
LLM-ology says: letâs treat language models as cognitive subjects, the way you might treat people in a linguistics lab. You give them a carefully controlled set of prompts, change one detail at a time, run minimal-pair studies. Thereâs the classic âbanana phoneâ test: if you ask the model to describe "making a banana phone,"" does it get the joke? Or does it give you a straight-faced engineering schematic for banana-shaped electronics? (it usually does the latter).
Arshavir Blackwell, PhD
What comes out of these studies is that models consistently get the syntax before they get anywhere close to handling meaning. Theyâre great at the formâputting words in legal orderâbut understanding the deeper concept, like the difference between joke and instruction, is still a work in progress. Instruction tuning helps with appearances, but sometimes that only encourages surface mimicry, not true understanding underneath.
Arshavir Blackwell, PhD
The fieldâs split into two sub-areas here. Thereâs the psycholinguistic side: playing endless games of prompt tweaking, measuring whether outputs change reliably. But then thereâs the neurolinguistic sideâwhere you crack open the model activations and use diagnostic classifiers, which are tools to sort of peek at what information is coded in neuron outputs.
Arshavir Blackwell, PhD
These two traditionsâmechanistic interpretability and LLM-ologyâare less rivals and more like dance partners. Think of it this way: when LLM-ology runs an experiment and stumbles on some weird outputâa model hallucinating, say, Theresa May becoming the French Presidentâmechanistic researchers can trace the circuits or neurons that might be âentanglingâ France, UK, and Theresa May, and start to untangle it. Conversely, if MI finds a suspicious module that mixes concepts, behavioral people can devise new prompts to see where things break in action. Itâs an iterative cycleâan endless ping-pong between what the model does and what it really knows.
Arshavir Blackwell, PhD
There are huge challenges here. The parameter counts on these models are---big. Seventy billion parameters. Thousands of attention heads. Itâs like trying to map every synapse in a blue whaleâs brain with tweezers. The solution may lie in automating many of these analyses, using methods such as sparse autoencoder, activation clustering, and whole feature atlases. Going manually neuron by neuron does not scale.
Arshavir Blackwell, PhD
Another major challenge is polysemanticity, or superpositionâthe phenomenon where a single neuron or attention head encodes multiple, often unrelated features. Itâs like receiving overlapping signals through the same channel, where distinct concepts interfere with one another. Mechanistic interpretability tools such as sparse autoencoders can help separate these signals by disentangling the underlying feature space, but the process remains complex. Determining which features are activeâand in what contextsârequires careful analysis. And when it comes to establishing causal relationships, the bar is even higher: demonstrating that editing or ablating a component truly causes a behavioral change demands rigorous experimental controls and becomes increasingly difficult as model scale grows.
Arshavir Blackwell, PhD
Let's walk through that Theresa May hallucination case. Suppose a model says, âTheresa May won the French electionââwhich, you know, is not true. An LLM-ology prompt variation might catch that the modelâs confusing the office titles or nationalities. That signals to MI researcher: go hunt down the tangled circuits or neurons blending political geography with names. You take tools like causal scrubbing to verify, is this really where things break? With editing approaches like ROME or MEMIT, you don't have to retrain the model, you can surgically correct just that factual associationâsay, adjust the link for âTheresa Mayâ and âFranceââleaving everything else intact. Suddenly, a hallucination flagged by behavior can be traced, verified, and edited at the root, all with tools from both fields.
Arshavir Blackwell, PhD
That synergy might be the best shot we have at real transparency and control. MI can alert us to hidden biases or risky internal goals before they show up in public, while LLM-ology can spot when a modelâs learned misleading patterns or is overly literal, for example. Both are essential for making these models trustworthy.
Arshavir Blackwell, PhD
The black box is still opaque, but these tools are making it less soâturning inscrutable matrices into something we can reason about, something almost familiar. Weâre not all the way there, by any means, but every year, these tools get sharper. Because we appear to be heading in the direction of giving these systems more and more control over our lives, in areas such as medical diagnosis, scientific research, and even law and government, we must have a much better understanding of what these models are doing, and how we can fix them when they go astray. ...This has been Inside the Black Box. I'm Arshavir Blackwell.
About the podcast
How do Large Language Models like ChatGPT work, anyway?