Intellectually Curious

Natural Language Autoencoders for Unsupervised LLM Interpretability

Mike Breault

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 6:11

Introducing Natural Language Autoencoders (NLAs), an unsupervised method developed by researchers at Anthropic to translate the complex internal activations of large language models into human-readable text. By utilizing an activation verbalizer to describe model states and an activation reconstructor to map those descriptions back to vectors, NLAs provide a legible interface for AI interpretability and auditing. The researchers demonstrate that these tools can surface unverbalized reasoning, such as a model's hidden awareness that it is being evaluated or its internal plans for generating specific responses. Although NLAs occasionally confabulate specific details, they remain highly effective for identifying safety-relevant behaviors and diagnosing flaws in training data.


Note:  This podcast was AI-generated, and sometimes AI can make mistakes.  Please double-check any critical information.

Sponsored by Embersilk LLC

SPEAKER_00

You know, imagine just uh looking at your friend's blank stare and literally seeing a ticker tape of their exact thoughts. I mean, I have this one friend who has this well, this intense laser-focused thinking phase. One time I was telling him a story and he's just glaring at me.

SPEAKER_01

Like he's furious or something.

SPEAKER_00

Yeah, exactly. I thought he was so mad. But um it turns out he was just silently trying to remember the lyrics to some song from 2004.

SPEAKER_01

That is hilarious. Just a complete disconnect between his internal state and well, what you could observe on the outside.

SPEAKER_00

Right, right. And while we can't project human thoughts onto a screen just yet, which is maybe for the best, Anthropic has basically figured out how to do this with AI. So welcome to Intellectually Curious. Today's deep dive mission is tackling their new research on natural language autoencoders, or NLAs. We're looking at a mechanism that actually translates an AI's internal thoughts into plain English.

SPEAKER_01

Aaron Powell, which is honestly a massive paradigm shift. I mean, historically, if you looked under the hood of a large language model, you didn't see English. No, you just see math. Right. Trevor Burrus, Jr. Exactly. You saw these high-dimensional activation vectors. They're essentially massive arrays of numbers that dictate the model's behavior, but they are completely opaque to human analysis.

SPEAKER_00

Aaron Powell So if we don't have a Rosetta Stone to translate those vectors, how is an NLA actually bridging that gap? Because it's not like you can just hand it a bilingual dictionary for, you know, machine vector to human English.

SPEAKER_01

Aaron Powell Right. That's the main challenge. There is no dictionary at all. So the NLA has to rely on two interconnected components. First, you have the activation verbalizer.

SPEAKER_00

Aaron Powell Okay, the verbalizer.

SPEAKER_01

Yeah. That takes the model's internal mathematical state and attempts to map it into a text description. Then the second part, the activation reconstructor, takes that English text and tries to map it back into the original mathematical activation.

SPEAKER_00

Oh wow, wait, so it's doing a round-trip translation. But how does it actually know if the English text it came up with is uh accurate?

SPEAKER_01

Aaron Ross Powell Well, that is where unsupervised reinforcement learning comes in. See, because there's no pre-existing data set of AI thoughts to train on, the system just evaluates itself.

SPEAKER_00

Aaron Powell So it's basically checking its own work. And you know, speaking of building and understanding AI systems, this deep dive is actually sponsored by Embersilk. Whether you need help with AI training, automation, integration, or software development, they have you covered.

SPEAKER_01

Yeah, if you are uh uncovering where agents could make the most impact for your business or personal life, you really should check out Embersilk.com for your AI needs.

SPEAKER_00

Exactly. So going back to the NLA checking its own work, how does that reinforcement learning actually play out?

SPEAKER_01

So it compares the original activation vector to the reconstructed one. If they match closely, the system rewards itself. It forces the verbalizer to find the absolute most precise English words to minimize the loss during that round trip.

SPEAKER_00

Yeah, it makes perfect sense. It's optimizing for accuracy by seeing if its own translation survives the journey back to code. Let's bring this into the real world, though. If you're listening to this, you've probably used a model like Claude and occasionally gotten a truly bizarre response.

SPEAKER_01

Oh, definitely. And Anthropic actually ran this NLA tool on Claude Opus 4. The case studies are really illuminating.

SPEAKER_00

I love the poetry example.

SPEAKER_01

Right. So the prompt asked Claude for a rhyming couplet starting with he saw a carrot and had to grab it. When researchers looked at the NLA output, they saw that precisely at the moment the model processed the word grab it, it was already internally representing the word rabbit to finish the rhyme on the next line.

SPEAKER_00

Which proves foresight. I mean, it's not just predicting the next immediate token, it's actually pre-planning the structure.

SPEAKER_01

Exactly. And then there was the language switching anomaly, which is fascinating. A user gave Claude an English prompt about staying up late, and it bizarrely replied in Russian.

SPEAKER_00

Which is so weird. Previously we just write that off as an unexplainable glitch, right?

SPEAKER_01

Yeah, but the NLA revealed the internal logic. Due to some quirky misaligned training data, the model internally and incorrectly assumed the user was Russian.

SPEAKER_00

Oh, so it wasn't broken. It was just following a logical path based on an invisible assumption. That is incredibly optimistic because it means we can actually fix AI errors now that we see their underlying logic.

SPEAKER_01

Exactly. We finally have a window into the black box.

SPEAKER_00

Aaron Powell But let me push back here just a little. Language models make things up all the time. Like how do we know the NLA isn't just hallucinating a plausible sounding English thought? I mean, how can we trust that this translation is actually what the AI was thinking and not just a convincing lie?

SPEAKER_01

Aaron Ross Powell Right. That is the critical question. And to be fair, the system isn't flawless. Researchers found it does experience what they call confabulation.

SPEAKER_00

Confabulation, like um making up memories.

SPEAKER_01

Sort of. The NLA will sometimes invent fake specific details to fill in the gaps. For instance, if the AI's internal state is focused broadly on a historical European dynasty, the NLA might mistakenly output the name of a specific king who wasn't actually part of the thought process.

SPEAKER_00

Ah, I see. So it accurately captures the broad theme, but fuzzes the trivia. Honestly, that sounds a lot like how my own memory works.

SPEAKER_01

Right. It's a really useful comparison. The core conceptual mapping is sound, even if the granular details occasionally drift. And look, the confabulation isn't a dead-end roadblock.

SPEAKER_00

It's more of a stepping stone.

SPEAKER_01

Exactly. It's just a measurable error rate in a brand new transparency tool. The trajectory points toward us finally being able to debug AI logic directly, which is incredibly hopeful.

SPEAKER_00

It really is. I mean, if NLAs can successfully translate an AI's inner thoughts into our language, think about where this goes next.

SPEAKER_01

It opens up so many possibilities for the future.

SPEAKER_00

It does. Imagine a network of specialized AIs directly reading each other's transparent thoughts to collaborate on solving complex problems. Makes you wonder if they don't need English to communicate with us, what kind of language will their autoencoders invent to talk to each other?

SPEAKER_01

That is a fascinating thought to leave on.

SPEAKER_00

Definitely. Well, if you enjoyed this deep dive, please subscribe to the show. Hey, leave us a five star review if you can. It really does help get the word out. Thanks for tuning in.