Natural Language Autoencoders for Unsupervised LLM Interpretability Artwork

Intellectually Curious

Intellectually Curious is a podcast by Mike Breault featuring over 1,800 AI-powered explorations across science, mathematics, philosophy, and personal growth. Each short-form episode is generated, refined, and published with the help of large language models—turning curiosity into an ongoing audio encyclopedia. Designed for anyone who loves learning, it offers quick dives into everything from combinatorics and cryptography to systems thinking and psychology.

Inspiration for this podcast:

"Muad'Dib learned rapidly because his first training was in how to learn. And the first lesson of all was the basic trust that he could learn. It's shocking to find how many people do not believe they can learn, and how many more believe learning to be difficult. Muad'Dib knew that every experience carries its lesson."

― Frank Herbert, Dune

Note: These podcasts were made with NotebookLM. AI can make mistakes. Please double-check any critical information.

Show More

Intellectually Curious

Natural Language Autoencoders for Unsupervised LLM Interpretability

May 08, 2026 • Mike Breault

0:00 | 6:11

Introducing Natural Language Autoencoders (NLAs), an unsupervised method developed by researchers at Anthropic to translate the complex internal activations of large language models into human-readable text. By utilizing an activation verbalizer to describe model states and an activation reconstructor to map those descriptions back to vectors, NLAs provide a legible interface for AI interpretability and auditing. The researchers demonstrate that these tools can surface unverbalized reasoning, such as a model's hidden awareness that it is being evaluated or its internal plans for generating specific responses. Although NLAs occasionally confabulate specific details, they remain highly effective for identifying safety-relevant behaviors and diagnosing flaws in training data.

Note: This podcast was AI-generated, and sometimes AI can make mistakes. Please double-check any critical information.

Sponsored by Embersilk LLC

SPEAKER_00 0:00

You know, imagine just uh looking at your friend's blank stare and literally seeing a ticker tape of their exact thoughts. I mean, I have this one friend who has this well, this intense laser-focused thinking phase. One time I was telling him a story and he's just glaring at me.

SPEAKER_01 0:16

Like he's furious or something.

SPEAKER_00 0:18

Yeah, exactly. I thought he was so mad. But um it turns out he was just silently trying to remember the lyrics to some song from 2004.

SPEAKER_01 0:27

That is hilarious. Just a complete disconnect between his internal state and well, what you could observe on the outside.

SPEAKER_00 0:34

Right, right. And while we can't project human thoughts onto a screen just yet, which is maybe for the best, Anthropic has basically figured out how to do this with AI. So welcome to Intellectually Curious. Today's deep dive mission is tackling their new research on natural language autoencoders, or NLAs. We're looking at a mechanism that actually translates an AI's internal thoughts into plain English.

SPEAKER_01 0:57

Aaron Powell, which is honestly a massive paradigm shift. I mean, historically, if you looked under the hood of a large language model, you didn't see English. No, you just see math. Right. Trevor Burrus, Jr. Exactly. You saw these high-dimensional activation vectors. They're essentially massive arrays of numbers that dictate the model's behavior, but they are completely opaque to human analysis.

SPEAKER_00 1:14

Aaron Powell So if we don't have a Rosetta Stone to translate those vectors, how is an NLA actually bridging that gap? Because it's not like you can just hand it a bilingual dictionary for, you know, machine vector to human English.

SPEAKER_01 1:27

Aaron Powell Right. That's the main challenge. There is no dictionary at all. So the NLA has to rely on two interconnected components. First, you have the activation verbalizer.

SPEAKER_00 1:36

Aaron Powell Okay, the verbalizer.

SPEAKER_01 1:37

Yeah. That takes the model's internal mathematical state and attempts to map it into a text description. Then the second part, the activation reconstructor, takes that English text and tries to map it back into the original mathematical activation.

SPEAKER_00 1:50

Oh wow, wait, so it's doing a round-trip translation. But how does it actually know if the English text it came up with is uh accurate?

SPEAKER_01 1:57

Aaron Ross Powell Well, that is where unsupervised reinforcement learning comes in. See, because there's no pre-existing data set of AI thoughts to train on, the system just evaluates itself.

SPEAKER_00 2:06

Aaron Powell So it's basically checking its own work. And you know, speaking of building and understanding AI systems, this deep dive is actually sponsored by Embersilk. Whether you need help with AI training, automation, integration, or software development, they have you covered.

SPEAKER_01 2:21

Yeah, if you are uh uncovering where agents could make the most impact for your business or personal life, you really should check out Embersilk.com for your AI needs.

SPEAKER_00 2:30

Exactly. So going back to the NLA checking its own work, how does that reinforcement learning actually play out?

SPEAKER_01 2:37

So it compares the original activation vector to the reconstructed one. If they match closely, the system rewards itself. It forces the verbalizer to find the absolute most precise English words to minimize the loss during that round trip.

SPEAKER_00 2:51

Yeah, it makes perfect sense. It's optimizing for accuracy by seeing if its own translation survives the journey back to code. Let's bring this into the real world, though. If you're listening to this, you've probably used a model like Claude and occasionally gotten a truly bizarre response.

SPEAKER_01 3:05

Oh, definitely. And Anthropic actually ran this NLA tool on Claude Opus 4. The case studies are really illuminating.

SPEAKER_00 3:17

I love the poetry example.

SPEAKER_01 3:18

Right. So the prompt asked Claude for a rhyming couplet starting with he saw a carrot and had to grab it. When researchers looked at the NLA output, they saw that precisely at the moment the model processed the word grab it, it was already internally representing the word rabbit to finish the rhyme on the next line.

SPEAKER_00 3:37

Which proves foresight. I mean, it's not just predicting the next immediate token, it's actually pre-planning the structure.

SPEAKER_01 3:42

Exactly. And then there was the language switching anomaly, which is fascinating. A user gave Claude an English prompt about staying up late, and it bizarrely replied in Russian.

SPEAKER_00 3:52

Which is so weird. Previously we just write that off as an unexplainable glitch, right?

SPEAKER_01 3:56

Yeah, but the NLA revealed the internal logic. Due to some quirky misaligned training data, the model internally and incorrectly assumed the user was Russian.

SPEAKER_00 4:05

Oh, so it wasn't broken. It was just following a logical path based on an invisible assumption. That is incredibly optimistic because it means we can actually fix AI errors now that we see their underlying logic.

SPEAKER_01 4:18

Exactly. We finally have a window into the black box.

SPEAKER_00 4:20

Aaron Powell But let me push back here just a little. Language models make things up all the time. Like how do we know the NLA isn't just hallucinating a plausible sounding English thought? I mean, how can we trust that this translation is actually what the AI was thinking and not just a convincing lie?

SPEAKER_01 4:36

Aaron Ross Powell Right. That is the critical question. And to be fair, the system isn't flawless. Researchers found it does experience what they call confabulation.

SPEAKER_00 4:43

Confabulation, like um making up memories.

SPEAKER_01 4:46

Sort of. The NLA will sometimes invent fake specific details to fill in the gaps. For instance, if the AI's internal state is focused broadly on a historical European dynasty, the NLA might mistakenly output the name of a specific king who wasn't actually part of the thought process.

SPEAKER_00 5:02

Ah, I see. So it accurately captures the broad theme, but fuzzes the trivia. Honestly, that sounds a lot like how my own memory works.

SPEAKER_01 5:10

Right. It's a really useful comparison. The core conceptual mapping is sound, even if the granular details occasionally drift. And look, the confabulation isn't a dead-end roadblock.

SPEAKER_00 5:21

It's more of a stepping stone.

SPEAKER_01 5:22

Exactly. It's just a measurable error rate in a brand new transparency tool. The trajectory points toward us finally being able to debug AI logic directly, which is incredibly hopeful.

SPEAKER_00 5:35

It really is. I mean, if NLAs can successfully translate an AI's inner thoughts into our language, think about where this goes next.

SPEAKER_01 5:43

It opens up so many possibilities for the future.

SPEAKER_00 5:46

It does. Imagine a network of specialized AIs directly reading each other's transparent thoughts to collaborate on solving complex problems. Makes you wonder if they don't need English to communicate with us, what kind of language will their autoencoders invent to talk to each other?

SPEAKER_01 6:01

That is a fascinating thought to leave on.

SPEAKER_00 6:03

Definitely. Well, if you enjoyed this deep dive, please subscribe to the show. Hey, leave us a five star review if you can. It really does help get the word out. Thanks for tuning in.