Arshavir Blackwell takes you on a journey inside the black box of large language models, showing how cutting-edge methods help researchers identify, understand, and even fix the inner quirks of AI. Through concrete case studies, he demonstrates how interpretability is evolving from an arcane art to a collaborative scienceâwhile revealing the daunting puzzles that remain. This episode unpacks the step-by-step workflow and surprising realities of mechanistically mapping model cognition.
Chapter 1
Arshavir Blackwell, PhD
Welcome to Inside the Black Box, where we peer into the hidden machinery of modern AI and try to understand how meaning and reasoning emerge from math. Iâm Arshavir Blackwell. Today, weâre taking things in a slightly different direction. Up to now, weâve been peering into large language modelsâ mapping their circuits, tracing attention heads, and asking what really goes on inside. But thereâs a tougher question underneath all that curiosity: Once you can see whatâs happening under the hood, can you actually fix it? Or are we just standing by, watching strange behaviors march past? Todayâs episode is about that next stepâ debugging a model from the inside out.
Arshavir Blackwell, PhD
This is where interpretability turns into something closer to engineering. Itâs not just about observing what the model thinks. Itâs about intervening when it stumblesâand learning how to set it right.
Arshavir Blackwell, PhD
Take a few quick examples. A model that takes everything too literally: You say, âLend me a hand,â and it starts imagining limb-sharing. Say âkick the bucket,â and it confidently explains a leg exercise routine. Itâs funnyâuntil it happens in medicine or law. These arenât random slips. They reveal how the model organizes meaning. With idioms, internal detectors light up for kick and bucket separately, but nothing binds them into âto die.â Thatâs weak multi-token binding. Sparse autoencoders or induction head analysis can pinpoint where that connection fails.
Arshavir Blackwell, PhD
Metaphors show the same weakness. âThe stock market caught a cold.â Instead of mapping that to âthe marketâs weak,â some models report, âInfluenza detected among traders.â The bridge between health and economicsâgone. And the classics: âCan you pass the salt?â âYes, I can pass the salt.â Syntax understood; social intent missed. Or: âThe trophy doesnât fit into the suitcase because itâs too small.â Flip small to big, and many models still swap the referent. Once you start looking, these misfires are everywhere. But theyâre not noiseâtheyâre clues. They show us how the internal machinery is wired, and where it breaks down.
Arshavir Blackwell, PhD
So how do you debug a model? Thatâs where mechanistic interpretability shows its power. Step one: elicit the failure. Collect examples. Find the pattern. Step two: locate the relevant features. Sparse autoencoders act like microscopes for activations, revealing clusters that light up on idioms or metaphors. Then comes activation patching. Tools like ACDCâAutomated Circuit Discovery and Correctionâ let you transplant activations from successful cases into failed ones. If swapping activations from layer 8, head 3 makes the model interpret âkick the bucketâ correctly, youâve likely found a circuit for figurative language.
Arshavir Blackwell, PhD
But correlation isnât causation. Thatâs where Redwood Researchâs Causal Scrubbing comes in. You deliberately disrupt that circuit and see whether the behavior breaksâ while everything else stays stable. If it does, youâve found the cause. Once you know the culprit, you can edit directly. Other techniques let you adjust weights surgically, repairing associations without retraining the entire model.
Arshavir Blackwell, PhD
Of course, real debugging isnât tidy. Circuits overlap. Causal effects blur. Fixes that work in one domain collapse in another. Interpretability takes rigor, but it also takes intuition. And not every odd output is an error. Sometimes itâs a sign of creativity. When a model rephrases a garden-path sentence to make it clearer, thatâs not failureâitâs linguistic insight.
Arshavir Blackwell, PhD
Hereâs where it gets fascinating. A transformer isnât one big mind. Itâs a network of subnetworks. Each subnetwork contributes its own computationâ syntax parsing, factual recall, metaphor resolution, pronoun tracking. Theyâre distinct but interconnected, like specialized teams in a vast organization. Each can be studied, visualized, even conceptualized on its own terms. Seen this way, a transformer isnât a single black boxâ itâs a federation of smaller intelligences working together.
Arshavir Blackwell, PhD
Anthropicâs Transformer Circuits project made this vividly clear. Grammar heads, recall heads, reasoning headsâ they form coherent internal communities. The model becomes less of a monolith and more of an ecosystem of cooperating modules.
Arshavir Blackwell, PhD
And the ecosystem studying those modules is expanding fast. Sparse autoencoders now extract stable, interpretable features. Neuronpedia and SAEBench let researchers annotate and share discoveriesâ a public atlas of model internals. DeepMindâs Gemma Scope puts billion-parameter interpretability on a laptop. Debugging is no longer the privilege of a few elite labs. Itâs becoming a discipline open to anyone with curiosity and persistence.
Arshavir Blackwell, PhD
Still, the big puzzles remain. Superposition and driftâfeatures overlap and shift during training. Translationâeach model speaks its own internal dialect. And large-scale editingâhow to tweak trillion-parameter systems without setting off unpredictable ripples elsewhere. These issues persist because neural networks werenât designed to be legible. Weâre reverse-engineering emergent intelligenceâ a structure that evolved its own logic. Until architectures are transparent by design, that sense of digital archaeology isnât going away.
Arshavir Blackwell, PhD
But thatâs exactly what makes the work exciting. Mechanistic interpretability is where curiosity becomes controlâ where we move from âletâs see what happensâ to âletâs make it happen.â And beyond the lab, this matters for everyone. The more transparent these systems become, the more we can trust themâ and the more theyâll teach us about intelligence itself. Artificial or human, the patterns are starting to rhyme.
Arshavir Blackwell, PhD
If youâre experimenting with any of these techniques, or others, please share your findings. This field advances through collaboration. Every circuit mapped, every subnetwork clarified, brings us one step closer to understandingâ and shapingâthe minds weâve built.
Arshavir Blackwell, PhD
Thanks for listening to Inside the Black Box. If you enjoyed this episode, consider subscribing on Substack or your favorite podcast app. Youâll find show notes, research links, and visual diagrams at arshavirblackwell.substack.com. Iâm Arshavir Blackwell. See you next timeâ when weâll open another corner of the machine mind and see whatâs thinking inside.
About the podcast
How do Large Language Models like ChatGPT work, anyway?