Inside the Black Box: Cracking AI and Deep Learning

Using Symbolic AI to Explain LLMs

Delve into the mysterious world of neural circuits within large language models. We’ll dismantle the jargon, connect these abstract ideas to real examples, and discuss how circuits help bridge the gap between machine learning and human cognition.

Published OnNovember 2, 2025

Chapter 1

Imported Transcript

Arshavir Blackwell, PhD

Every journey inside a neural model begins with its smallest unit of meaning — the token. Each word in an input string is translated via a lookup table into a high-dimensional vector. In this continuous space, meaning appears as geometry, learned through training — not hand-coded. Words about royalty cluster together. Verbs of motion align along shared directions. The model learns that: king minus man plus woman equals queen.

Arshavir Blackwell, PhD

These vectors encode features such as “is animate,” “is plural,” or “is concrete,” much like symbolic features in classical AI. But those features are rarely confined to a single dimension. They’re distributed — spread across many dimensions.

Arshavir Blackwell, PhD

Large models rely on superposition — where multiple features share the same neurons, overlapping like radio stations on a crowded frequency. To untangle those overlapping signals — to understand what’s going on inside the black box — we turn to sparse autoencoders.

Arshavir Blackwell, PhD

Sparse autoencoders are neural networks designed to uncover hidden structure in high-dimensional data — by learning efficient, interpretable, sparse representations. Like all autoencoders, they consist of an encoder that compresses input data into a latent code… and a decoder that reconstructs the original input.

Arshavir Blackwell, PhD

At first, this might sound circular — why reconstruct what you already have? (beat) The answer lies in the bottleneck. That narrow band of neurons in the middle forces the network to capture the essential structure of the data. And what makes sparse autoencoders special is the constraint — only a small fraction of neurons in the latent layer can activate at once.

Arshavir Blackwell, PhD

That sparsity forces the model to discover a minimal, disentangled set of features that still explain the data. Instead of dense, overlapping activations, it learns distinct concepts — clean axes in activation space.

Arshavir Blackwell, PhD

This sparsity isn’t perfect. Some representations remain only partially interpretable. But researchers have already isolated neurons that respond to specific ideas — the Golden Gate Bridge… neuroscience topics… even tourist attractions. These interpretable units bridge the gap between opaque weights and human abstractions — revealing how distributed representations in LLMs map onto discrete conceptual directions.

Arshavir Blackwell, PhD

A sparse dictionary is the learned codebook from which any internal activation can be reconstructed. Its basis elements — or atoms — act like conceptual features that combine to express ideas.

Arshavir Blackwell, PhD

Some align neatly with linguistic roles like gender agreement, negation, or factual recall. Others reflect compositional motifs reused across contexts. (short pause) Sparse autoencoders don’t just compress — they translate information into more human-readable features.

Arshavir Blackwell, PhD

What do these features actually mean? To answer that, we can turn to an older paradigm — symbolic AI.

Arshavir Blackwell, PhD

In symbolic systems, knowledge is stored in structured data objects with slots — variables that hold specific roles or values. For example, the verb chase might be represented as: chase(agent: X, patient: Y, manner: Z, location: L).

Arshavir Blackwell, PhD

Each slot defines a semantic role. The agent slot holds who is doing the chasing — the pursuer. The patient slot holds what’s being chased — the pursued. These explicit structures make reasoning transparent: we can query, substitute, or combine predicates to form more complex statements.

Arshavir Blackwell, PhD

Distributed vector representations, in contrast, store meaning implicitly across many dimensions. The information isn’t in any single coordinate, but in the pattern as a whole. Similar sentences cluster together — not by explicit symbols, but by geometric proximity in high-dimensional space.

Arshavir Blackwell, PhD

Sparse autoencoders bring these worlds closer. Each learned feature dimension behaves like a symbolic slot — activating when a particular conceptual property, such as animate agent, is present. Sparsity converts the continuous geometry of deep networks into something that begins to resemble discrete, interpretable structure. This is a key tool in mechanistic interpretability.

Arshavir Blackwell, PhD

The analogy extends directly to attention — the transformer’s defining innovation. (beat) In symbolic AI, reasoning happens by binding values to variables — filling the slots of frames or predicates. When a system reads “the dog happily chased the ball,” it might represent: chase(agent: dog, patient: ball, manner: happily).

Arshavir Blackwell, PhD

Each argument is a role-filler: “dog” fills the agent slot, “ball” the patient, “happily” the manner. This is structured, explicit reasoning. We can query who chased whom, swap entities, or compose predicates into higher-order statements.

Arshavir Blackwell, PhD

Transformers perform a similar process — but in continuous vector space instead of symbolic code. Each token’s embedding is projected into three roles: Query — what information do I need? Key — what am I about? Value — what can I provide?

Arshavir Blackwell, PhD

During each attention step, the model compares queries to keys. High similarity means one token is a plausible slot-filler for another. The corresponding value vectors are then blended into the querying token’s representation — a form of variable binding.

Arshavir Blackwell, PhD

When processing “the dog happily chased the ball,” certain attention heads assign high weights from chased to dog and ball. Rather than explicitly writing agent equals dog, the model shifts geometry — “chased”’s vector now encodes who did what to whom. Attention recreates the slot-filling logic of symbolic AI — but through continuous mathematics instead of discrete logic.

Arshavir Blackwell, PhD

The attention-as-slot-filling analogy is pedagogically useful and conceptually sound. But transformers don’t perform discrete slot-filling — they approximate it. (pause) Attention patterns are soft, distributed, and often overlapping. It’s an interpretive lens — not a literal description of how models “think.”

Arshavir Blackwell, PhD

Multi-head attention performs these bindings in parallel: one head tracks subject–verb roles, another adjective–noun links, another coreference, and so forth. Early layers capture syntax. Later layers compose semantics, causality, and discourse.

Arshavir Blackwell, PhD

Attention is symbolic reasoning re-imagined as geometry — a soft, continuous mechanism for binding roles to fillers in vector space. The beauty of the vector approach is that it learns structure from data — reflecting the probabilistic, messy nature of real language.

Arshavir Blackwell, PhD

Neural networks, once opaque, are revealing their inner grammar of meaning — a place where language, logic, and learning converge.

Arshavir Blackwell, PhD

Symbolic AI taught machines to reason by rule. Deep learning taught them to grok patterns in space. What’s striking is how these paths circle back toward each other. The transformer doesn’t abandon structure — it rediscovers it. Sculpting syntax and sense out of gradients and geometry. Every query and key is a hint of logic. Every attention map, a faint grammar of thought.

Arshavir Blackwell, PhD

Mechanistic interpretability is how we make that grammar visible again — translating vectors into concepts… showing that even in the fog of numbers… reason still leaves a trace. This has been Inside the Black Box. I'm Arshavir Blackwell.

About the podcast

How do Large Language Models like ChatGPT work, anyway?

Episodes

Using Symbolic AI to Explain LLMs

Imported Transcript