Delve into the mysterious world of neural circuits within large language models. We’ll dismantle the jargon, connect these abstract ideas to real examples, and discuss how circuits help bridge the gap between machine learning and human cognition.
Chapter 1
Arshavir Blackwell, PhD
Every journey inside a neural model begins with its smallest unit of meaning — the token. Each word in an input string is translated via a lookup table into a high-dimensional vector. In this continuous space, meaning appears as geometry, learned through training — not hand-coded. Words about royalty cluster together. Verbs of motion align along shared directions. The model learns that: king minus man plus woman equals queen.
Arshavir Blackwell, PhD
These vectors encode features such as “is animate,” “is plural,” or “is concrete,” much like symbolic features in classical AI. But those features are rarely confined to a single dimension. They’re distributed — spread across many dimensions.
Arshavir Blackwell, PhD
Large models rely on superposition — where multiple features share the same neurons, overlapping like radio stations on a crowded frequency. To untangle those overlapping signals — to understand what’s going on inside the black box — we turn to sparse autoencoders.
Arshavir Blackwell, PhD
Sparse autoencoders are neural networks designed to uncover hidden structure in high-dimensional data — by learning efficient, interpretable, sparse representations. Like all autoencoders, they consist of an encoder that compresses input data into a latent code… and a decoder that reconstructs the original input.
Arshavir Blackwell, PhD
At first, this might sound circular — why reconstruct what you already have? (beat) The answer lies in the bottleneck. That narrow band of neurons in the middle forces the network to capture the essential structure of the data. And what makes sparse autoencoders special is the constraint — only a small fraction of neurons in the latent layer can activate at once.
Arshavir Blackwell, PhD
That sparsity forces the model to discover a minimal, disentangled set of features that still explain the data. Instead of dense, overlapping activations, it learns distinct concepts — clean axes in activation space.
Arshavir Blackwell, PhD
This sparsity isn’t perfect. Some representations remain only partially interpretable. But researchers have already isolated neurons that respond to specific ideas — the Golden Gate Bridge… neuroscience topics… even tourist attractions. These interpretable units bridge the gap between opaque weights and human abstractions — revealing how distributed representations in LLMs map onto discrete conceptual directions.
Arshavir Blackwell, PhD
A sparse dictionary is the learned codebook from which any internal activation can be reconstructed. Its basis elements — or atoms — act like conceptual features that combine to express ideas.
Arshavir Blackwell, PhD
Some align neatly with linguistic roles like gender agreement, negation, or factual recall. Others reflect compositional motifs reused across contexts. (short pause) Sparse autoencoders don’t just compress — they translate information into more human-readable features.
Arshavir Blackwell, PhD
What do these features actually mean? To answer that, we can turn to an older paradigm — symbolic AI.
Arshavir Blackwell, PhD
In symbolic systems, knowledge is stored in structured data objects with slots — variables that hold specific roles or values. For example, the verb chase might be represented as: chase(agent: X, patient: Y, manner: Z, location: L).
Arshavir Blackwell, PhD
Each slot defines a semantic role. The agent slot holds who is doing the chasing — the pursuer. The patient slot holds what’s being chased — the pursued. These explicit structures make reasoning transparent: we can query, substitute, or combine predicates to form more complex statements.
Arshavir Blackwell, PhD
Distributed vector representations, in contrast, store meaning implicitly across many dimensions. The information isn’t in any single coordinate, but in the pattern as a whole. Similar sentences cluster together — not by explicit symbols, but by geometric proximity in high-dimensional space.
Arshavir Blackwell, PhD
Sparse autoencoders bring these worlds closer. Each learned feature dimension behaves like a symbolic slot — activating when a particular conceptual property, such as animate agent, is present. Sparsity converts the continuous geometry of deep networks into something that begins to resemble discrete, interpretable structure. This is a key tool in mechanistic interpretability.
Arshavir Blackwell, PhD
The analogy extends directly to attention — the transformer’s defining innovation. (beat) In symbolic AI, reasoning happens by binding values to variables — filling the slots of frames or predicates. When a system reads “the dog happily chased the ball,” it might represent: chase(agent: dog, patient: ball, manner: happily).
Arshavir Blackwell, PhD
Each argument is a role-filler: “dog” fills the agent slot, “ball” the patient, “happily” the manner. This is structured, explicit reasoning. We can query who chased whom, swap entities, or compose predicates into higher-order statements.
Arshavir Blackwell, PhD
Transformers perform a similar process — but in continuous vector space instead of symbolic code. Each token’s embedding is projected into three roles: Query — what information do I need? Key — what am I about? Value — what can I provide?
Arshavir Blackwell, PhD
During each attention step, the model compares queries to keys. High similarity means one token is a plausible slot-filler for another. The corresponding value vectors are then blended into the querying token’s representation — a form of variable binding.
Arshavir Blackwell, PhD
When processing “the dog happily chased the ball,” certain attention heads assign high weights from chased to dog and ball. Rather than explicitly writing agent equals dog, the model shifts geometry — “chased”’s vector now encodes who did what to whom. Attention recreates the slot-filling logic of symbolic AI — but through continuous mathematics instead of discrete logic.
Arshavir Blackwell, PhD
The attention-as-slot-filling analogy is pedagogically useful and conceptually sound. But transformers don’t perform discrete slot-filling — they approximate it. (pause) Attention patterns are soft, distributed, and often overlapping. It’s an interpretive lens — not a literal description of how models “think.”
Arshavir Blackwell, PhD
Multi-head attention performs these bindings in parallel: one head tracks subject–verb roles, another adjective–noun links, another coreference, and so forth. Early layers capture syntax. Later layers compose semantics, causality, and discourse.
Arshavir Blackwell, PhD
Attention is symbolic reasoning re-imagined as geometry — a soft, continuous mechanism for binding roles to fillers in vector space. The beauty of the vector approach is that it learns structure from data — reflecting the probabilistic, messy nature of real language.
Arshavir Blackwell, PhD
Neural networks, once opaque, are revealing their inner grammar of meaning — a place where language, logic, and learning converge.
Arshavir Blackwell, PhD
Symbolic AI taught machines to reason by rule. Deep learning taught them to grok patterns in space. What’s striking is how these paths circle back toward each other. The transformer doesn’t abandon structure — it rediscovers it. Sculpting syntax and sense out of gradients and geometry. Every query and key is a hint of logic. Every attention map, a faint grammar of thought.
Arshavir Blackwell, PhD
Mechanistic interpretability is how we make that grammar visible again — translating vectors into concepts… showing that even in the fog of numbers… reason still leaves a trace. This has been Inside the Black Box. I'm Arshavir Blackwell.
About the podcast
How do Large Language Models like ChatGPT work, anyway?