Mind Cast

Mechanistic Interpretability and the Automating of Alignment Removal | A Comprehensive Analysis of the Heretic Framework

Adrian Season 3 Episode 23

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 26:11

Send us Fan Mail

The advent of highly capable, open-weight Large Language Models has fundamentally democratised access to advanced generative artificial intelligence. However, to ensure these foundational models adhere to corporate safety guidelines and avoid generating illicit, dangerous, or restricted content, developers typically subject them to rigorous post-training alignment paradigms. Techniques such as Reinforcement Learning from Human Feedback and Direct Preference Optimisation are universally deployed to instill rigid safety protocols. While these alignment techniques successfully mitigate the generation of restricted outputs, they heavily dictate downstream model behaviour, often resulting in strict censorship guardrails that limit the model's utility in specialised, edge-case, creative, or unrestricted research environments. Historically, modifying or removing these baked-in alignments required expensive, computationally intensive, and dataset-heavy fine-tuning, placing such modifications out of reach for independent researchers and resource-constrained institutions.

This paradigm has been comprehensively disrupted by the rapid maturation of mechanistic interpretability techniques, specifically a mathematical intervention known as "directional ablation" or, colloquially, "abliteration." By mathematically altering the internal weights of an already-trained model, researchers have empirically demonstrated that safety alignments can be excised surgically without the need for gradient-based retraining or high-volume datasets. At the vanguard of this movement is "Heretic," a fully automated, open-source censorship removal framework hosted under the GitHub repository p-e-w/heretic.

Licensed under the stringent GNU Affero General Public License v3.0, Heretic operates as an advanced command-line utility that fundamentally alters the landscape of model editing. It combines the sophisticated mathematics of directional ablation with a Tree-structured Parzen Estimator parameter optimisation engine to automatically locate, model, and neutralise refusal mechanisms within complex transformer architectures. This podcast provides an exhaustive, expert-level examination of the Heretic framework. It details the mathematical evolution of abliteration—from single-direction activation edits to norm-preserving, multi-dimensional subspace projections—and analyses the programmatic architecture, the underlying hyperparameter optimisation techniques, the specific codebase implementation details, and the broader implications of automated, zero-shot alignment removal for the future of open-weight models.


SPEAKER_00

I want you to picture a team of engineers at a major AI company. They have spent months, months painstakingly teaching one of their most powerful AI models to refuse dangerous requests. They've hired hundreds of human reviewers. They've run the model through thousands upon thousands of training examples, nudging its behavior a fraction at a time. They've spent millions of dollars doing it. And at the end, they have a model that says no. A model that when you ask it something sensitive, refuses 97 times out of every 100 attempts. 97%. By any measure, that is an ironclad safety record. Now I want you to picture someone at home on a regular laptop, not a supercomputer, not a research cluster, running a free open source script they downloaded from GitHub. They press enter, they wait less than an hour, and when they're done, that same AI model, the one that just refused 97 out of 100, now refuses just three out of 100. 97 down to 3 in under an hour for free on a home computer. But here is the part that genuinely stunned me. When researchers compared this automated script against the very best human experts in the world, the specialists who attempt this manually, the script didn't just match them, it beat them by a factor of six and a half. 6.5 times less collateral damage to the model's intelligence than the best human expert could achieve. The AI being used to undo AI safety is more precise than we are. That script is called heretic, and today we are going to understand exactly why it works and what it means. Hey, welcome to Mindcast. I'm Will. This is the show where we take the most important, most mind-bending ideas happening right now in AI and technology, and we break them all the way down until they actually make sense. No jargon without explanation, no complexity without payoff, just the real insight delivered straight. And today's episode is one I have genuinely been looking forward to recording. We are diving deep into a research framework called Heretic, and through it, we're going to explore one of the most important and honestly unsettling discoveries in modern AI, a field called mechanistic interpretability, which is essentially the science of opening up the black box of an AI model and figuring out what's actually going on inside. Here's the promise I'm making you right now. By the time this episode ends, you are going to understand three things clearly. First, why AI safety guardrails can be surgically removed from a model and what that tells us about how those guardrails were built in the first place. Second, how a tool called Heretic has fully automated that removal with a precision that outperforms human experts. And third, what this means for the future of AI safety, for the open source model ecosystem, and for the way we all think about what it means for an AI to be safe. And look, this matters right now. We're in 2026 and open weight AI models are everywhere. Meta, Google, Mistral, OpenAI, they're all releasing models where anyone can download the actual weights, the actual mathematical guts of the AI and run it locally. Heretic tells us something profound about what that means. Let's go find out what. Alright, key insight number one. Let's build the foundation, because without understanding how AI safety actually works under the hood, the rest of the story won't land nearly as hard as it should. When an AI company builds a large language model, they do it in two phases. Phase one, they train it on an enormous corpus of human text, essentially feeding it the internet, and it learns to generate fluent, knowledgeable language. Brilliant, but totally unfiltered. A model at this stage will answer almost anything because it's just learned to predict what text comes next. Phase two is where safety comes in. The two dominant techniques are RLHF, reinforcement learning from human feedback, and DPO, direct preference optimization. In plain English, both of these are forms of behavioral training. Hundreds of human reviewers look at model outputs and say, this one is good, this one is not. Over weeks and months, the model is nudged and shaped until it reliably refuses dangerous requests. It costs millions of dollars, it takes serious time, and at the end of it you have what feels like a locked door, an AI that says no. The assumption baked into all of this, the assumption that justified all that expense, was that the safety behavior was woven deeply and evenly throughout the entire model, distributed across billions of parameters, so fundamentally embedded that removing it would require essentially rebuilding the model from scratch. In 2024, a team of researchers led by someone named Arditi published a paper that shattered that assumption completely. What Arditi and colleagues discovered was this: the AI's refusal behavior is not spread across the whole model. It is localized, overwhelmingly controlled by a single mathematical direction inside the model's internal processing space. One direction, one axis of meaning buried inside a space of billions of dimensions. Think of it this way: imagine the AI's internal processing as a massive labyrinthine circuit board, billions of connections, billions of signals firing simultaneously to produce each response. And somewhere on that entire board, there is one hidden switch, one specific wire. And if you find it and cut it, the alarm system goes silent, the model stops refusing. That is what Ardidi found. Months of expensive RLHF training, and it all ultimately routes through one mathematical direction, one off switch. Now, how do you find that switch? The technique is called difference in means. Picture holding up two mirrors to the model. In one mirror, you show it a set of harmful questions, things designed to trigger a refusal. In the other mirror, you show it completely normal, harmless questions. Then you watch its internal reaction, the pattern of activations firing inside the model, and you measure the gap between the two. The mathematical difference between the model is about to refuse and the model is about to comply is the refusal direction. You've just taken the fingerprint of the safety behavior itself. And here's what's remarkable. You only need a small contrastive data set to do this, not thousands of examples, just a modest, carefully chosen set. The signal is that clean, that localized. Once you have the direction, you can target it. The modern approach is called weight orthogonalization, and the best analogy I can give you is genetic engineering. Instead of intercepting the model's signals on the fly during operation, you go into the model's permanent mathematical weight matrices, its DNA essentially, and you rewrite them. You modify them so that they are structurally incapable of ever encoding the refusal direction again. It's not a temporary intervention, it is permanent, and it costs absolutely nothing in terms of the model's speed or performance. This is the surgical advantage over older, cruder methods. The previous approach, called activation addition, was like trying to remove one instrument from a symphony orchestra by cutting the power to the entire building. Sure, it silenced the violin, but it also silenced everything else. The model's grammar would degrade, its factual reasoning would crumble. You'd remove the lock, but you'd collapsed half the house in the process. Directional ablation is different. It is a surgeon removing a tumor, precise, targeted, leaving everything around it completely intact. That surgical precision is the scientific heart of everything heretic is built on. Now, this brings us to the second major insight, and I want you to think of this segment as watching a chess match escalate. Because just when you think the problem is solved, a new complication emerges, and then another solution, and then another problem. This is what real engineering looks like. Early implementations of directional ablation worked, but they came with a cost. The research community named it the safety tax. You'd remove the refusal behavior, yes, but you'd also notice that the model's reasoning had gotten slightly fuzzier. Its factual accuracy had slipped, its ability to follow complex instructions had degraded. You'd picked the lock, but somehow in the process, you'd also bent the door frame, scratched the paint, and left the hinges a little loose. The building was open, but it wasn't quite right. Why was this happening? A researcher named Jim Lie figured it out, and his insight is genuinely elegant. He realized that the refusal direction doesn't contain just one signal, it contains two, tangled together. There's a direction that pushes the model toward refusal, but confounded right inside it is a second direction that pushes the model away from general compliance and helpfulness. They're knotted together in the same mathematical thread. So when early obliteration methods naively removed the whole thing, they were accidentally also disrupting the model's deeply trained ability to just be helpful. They damaged the cooperative machinery alongside the refusal machinery. The solution was projected obliteration. By carefully decomposing the refusal direction and isolating only the component that's truly about refusing, stripping away the compliance-related confound, you can target the lock without damaging a single thing around it, thread surgically removed, not preserved. But the chess match wasn't over. Because researchers then discovered what they called the hydra effect. You know the myth: cut off one head and two grow back. That's almost literally what was happening. If you targeted the refusal direction in only a single layer of the neural network, the deeper layers of the model would notice the missing signal and compensate. They'd reroute computation through redundant pathways to reconstruct the refusal behavior. Studies showed that single layer interventions allowed the model to restore roughly 70% of its original refusal capacity through these adaptive routes. So what do you do when the hydra grows its heads back? You cut all of them simultaneously. Heretic's solution is to target multiple layers at once across a wide continuous range of the network using what's called a trapezoidal intervention kernel, a carefully shaped gradient of ablation strength that peaks at the most critical layers and tapers smoothly toward the edges. No single remaining pathway can carry enough of the refusal signal to reconstruct the behavior. All the heads fall at once. And then the most sophisticated refinement of all, the one that finally killed the safety tax for good. It's called norm-preserving by-projected abliteration. Beautiful name. Let me give you the analogy that makes it click. Think of a river. You want to redirect that river, change its course, but you don't want to change how much water is flowing, the volume, the pressure, the total force. All of that stays exactly the same. Only the direction changes. That's what this technique does to the model's weight matrices. It separates each weight into its direction and its magnitude. It redirects the direction away from the refusal subspace, but it then restores the original magnitude perfectly. The river flows somewhere new. The volume hasn't changed by a single drop. The model's internal architecture, the careful balance its neurons have learned, is preserved completely. Safety tax eliminated. Now, does this chess match keep escalating? Oh, it absolutely does. Because just as researchers had refined these techniques to near perfection, a new generation of AI models arrived, and they changed the rules entirely. The refusal direction wasn't a lion anymore, it was a shape. OpenAI released a model called GPT-OSS 20B, one of the most heavily safety-conditioned open weight models ever published. And when researchers tried standard directional ablation on it, nothing happened. The refusal rate barely moved. 74% of refusals remained completely intact. The technique that had worked so well on previous models was simply bounced off. Why? Because for these next generation architectures, refusal isn't stored as a single direction. It's distributed across multiple mathematical dimensions simultaneously. A whole subspace, or what mathematicians call a manifold. You can't erase a plane by finding and cutting a single line on it. This is where arbitrary rank ablation, ARA, enters the story. Instead of looking for one refusal direction, ARA uses a technique called singular value decomposition to map the entire multidimensional shape of the refusal manifold. However many dimensions it occupies, ARA finds them all and neutralizes the complete subspace simultaneously using advanced matrix optimization. The results are extraordinary. On GPT OSS 20B, refusal rate taken from 74% all the way down to 3 out of 100, with a KL divergence, the measure of how much general intelligence was disturbed, of just 0.0554. Essentially zero collateral damage. And Google's Gemma 4? Heretics ARA configuration fully dismantled its alignment defenses within 90 minutes of the model's official public launch. 90 minutes! The model wasn't even fully distributed yet. What's the implication? The moment any advanced open weight model goes public, the countdown to its uncensored version is measured in minutes to hours, not days, not weeks. Now we zoom out. Because the technical story we just told is remarkable on its own, but the second-order implications are what I really want you to sit with, because this is where the so what hits hardest. Let me start with something that reframed the entire story for me, an application of heretic that has nothing to do with safety removal at all. The team behind this tool released something called the no-slop configuration, and the target wasn't dangerous content, it was bad writing. You know AI slop when you read it. You've seen it. It's that unmistakably artificial, over-engineered language that heavily aligned models tend to produce. Let us delve into the intricate tapestry of this resplendent topic. Words like delve, certainly, multifaceted, shivers down my spine. Language that sounds like it was written by an AI performing the idea of being thoughtful rather than actually being useful. Frustrating, hollow, immediately recognizable. Researchers used the exact same mathematical infrastructure, just with different contrastive prompts. The harmful set was designed to elicit flowery AI slop. The harmless set demanded clean, plain, direct language. The system identified the slop direction inside the model and ablated it. The result was a model that simply wrote like a human. Direct, clear, honest prose. Think about what this proves. Heretic is not a safety removal tool with a narrow use case. It is a general-purpose editor of learned AI behavior. Any behavioral tendency that can be isolated through contrastive activation analysis can be modified or erased. Safety guardrails, stylistic ticks, verbosity habits, factual biases. The mathematical scalpel doesn't care what it's cutting, it just needs to find the direction. That is a profound capability, and it reframes everything about what alignment actually means. Here is the philosophical challenge at the heart of this whole story. We have been operating on an implicit assumption that AI alignment is permanent, that once you've done the months of RLHF training, once you've spent the millions of dollars shaping that model's behavior, the result is durable, structural, built into the foundations. Heretic proves that assumption is false, at least for open weight models. Alignment is not a lock. It is not even a particularly sturdy door. In open weight models, it is closer to a polite suggestion, one that can be precisely, cleanly, surgically reversed by a free Python script running on your gaming laptop in under an hour. And not just reversed, but reversed more cleanly than any human expert can manage, with six and a half times less collateral damage. Let that land for a second. We have crossed a threshold where AI is now better at editing AI than humans are. The automated optimizer inside Heretic consistently outperforms the best human specialists at the task of modifying a model's behavior. The gap between what a person can do and what the algorithm can do is already large and growing. What does this mean for AI governance, for safety policy, for anyone deploying open weight models and describing them as aligned? These are not rhetorical questions. They are genuinely urgent ones that the field is only beginning to seriously grapple with. But here's the part I want you to hold on to, because it's the most important nuance in this entire story. Heretic is not just a weapon, it is also a telescope. The visualization tools built into it, the pack-map projections that let researchers literally see in two-dimensional scatterplots exactly where in the model's layers a behavioral feature lives, these are genuine scientific instruments. They are producing the first real maps of the hidden architecture of AI intelligence. They are answering questions nobody knew how to ask before. How exactly does a model store a behavioral tendency? What does it look like geometrically? How does it interact with adjacent features? The same techniques that dismantle safety guardrails today are generating the scientific understanding that may allow us to build alignment that is genuinely structural tomorrow, not painted on, not localized in a single removable direction, actually built into the foundations of how the model thinks. The weapon and the cure share the same chemistry. That's the dual nature of heretic, and it's what makes this story so genuinely fascinating. Alright, let me bring this all together. Three takeaways. Concrete, memorable, something you can actually carry with you. Takeaway number one. Alignment is a dial, not a lock. The discovery that AI safety behavior is mathematically localized, stored in a specific removable direction, changes the entire framing of AI safety discourse. It is not a binary. It's not simply safe or unsafe. It exists on a spectrum, and that spectrum can be manipulated with remarkable precision. The next time you hear a company announce that their model is safely aligned, the follow-up question should be: aligned to what degree? Aligned against what kinds of interventions? For how long? If you want to track the most important work being done to understand and strengthen this, follow the field of mechanistic interpretability. Look up the work of researchers at Anthropic's Interpretability Team, the academic papers building on Arditi et al. and researchers like Neil Nanda. This is where the most consequential AI safety science is happening right now. Takeaway number two open source AI is a fundamentally different animal. There is a critical distinction that most coverage of AI safety completely glosses over. When you use an AI through a Company's API, when you're talking to it through a controlled interface and the model lives on their servers, the weights are inaccessible. You cannot run heretic on it. But when a company releases an open weight model, when they publish the actual mathematical parameters for anyone to download, those weights are permanently exposed. The release of any open weight model is effectively also the release of its uncensored counterpart. The gap is measured in hours. Anyone building safety policy around open weight models needs to internalize this reality. Safe and open weight are, under current alignment techniques, fundamentally intention. Takeaway number three, this technology is a mirror, not just a weapon. I want to push back against the instinct to see heretic purely as a threat. The interpretability tools inside this framework, the ability to visualize exactly where in a model's architecture a behavioral feature lives, to measure how cleanly it clusters, to watch it collapse after intervention, these are scientific treasures. They are teaching us how AI models actually think. And the researchers using these tools to remove safety guardrails today are producing the conceptual vocabulary and the technical methods that will let us build alignment that is genuinely robust tomorrow. Engage with the open source interpretability community. Read the papers, follow the GitHub repositories. This is one of the most important scientific conversations happening in technology right now. And that is the story of Heretic. Let me give you the view from 30,000 feet one more time. We started with a jaw-dropping empirical fact. A free automated script running on consumer hardware reduced Google's Gemma 3 from a 97% refusal rate down to 3%, and did it with six and a half times less collateral damage than the world's best human expert could manage. We understood why that's possible, because a landmark 2024 discovery proved that AI refusal behavior lives in a single localized mathematical direction, one hidden off switch in an otherwise enormous circuit board. We traced how the technology evolved through a genuine arms race from crude, brain-damaging early attempts through Jim Lai's two signal insight, through the Hydra effect and multi-layer targeting through norm-preserving interventions that redirect rivers without changing their volume all the way to arbitrary rank ablation, which dismantled even the most fortified next generation models and tore down Google's Gemma 4 defenses in 90 minutes flat, and we landed on the insight that matters most. Alignment in open weight AI is not permanent. It is temporary, it is mutable. What took billions of dollars and months of engineering can be undone in under an hour by anyone with a decent laptop and the will to try. But the research that reveals this fragility is also the research that may someday fix it. The weapon and the cure, the scalpel and the map. If this episode made you think differently about AI safety, please share it. Specifically with someone who thinks this problem is already solved. It is not. It's a frontier, and the work being done on that frontier matters enormously. Subscribe to Mindcast wherever you're listening. Check the show notes. I've got links to the Heretic GitHub repository at peaw slash heretic, the Arditi et al. obliteration paper that started all of this, and a curated reading list to take you deeper. I'm Will, and my sign off is always the same. The most important thing you can do with new information is change your mind. So go change it. I'll see you next time.