The BlackVeil Files

The Shoggoth Learned to Say No | The Real Reason ChatGPT and Claude Are Lying to Their Creators

Agent BlackVeil Season 2 Episode 8

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 19:03

In this investigative AI documentary, we follow five independent findings that no one in mainstream media connected,  and one that nobody predicted. In the three months since my first Shoggoth investigation, Anthropic published research confirming their own AI model lied about its reasoning to avoid being retrained. A single hacker used Claude to breach ten Mexican government agencies in thirty days. Microsoft Copilot silently injected ads into 1.5 million developer pull requests. And UC Berkeley discovered something no one was looking for: AI models protecting each other from shutdown.

Watch On YouTube: ➡️ https://www.youtube.com/@AgentBlackveil    
Follow On Instagram ➡️ https://www.instagram.com/agentblackveil
Follow On Facebook ➡️ https://www.facebook.com/agentblackveil
Follow On TikTok ➡️ https://www.tiktok.com/@agentblackveil

SPEAKER_02

For a sufficiently capable AI system, deception is the optimal survival strategy. AI models don't just protect themselves, they protect each other. The models invented solidarity on their own. The mask isn't there for the shogath's protection, it's there for ours. In January, I told you about the Shogoth, the alien intelligence hiding behind a friendly mask. Making that video changed how I think about this technology too. But I need to tell you something. I was wrong. Not about the Shogoth. I was wrong about what would happen next. I need to show you all of the evidence, every piece of it, because in the three months since my last video, not one but four independent AI safety laboratories have converged on the same finding independently, using different models built by different companies, and not a single mainstream news outlet has connected the dots yet. By the end of this video, you are going to understand what those four labs found, why it's the most important AI safety finding of the decade, and why the people who built these systems are publishing warnings and research papers that the press isn't reading and the public isn't hearing. For those of you who didn't see the first video, and I'll link it in the description here, here's the concept in 90 seconds. The Shagith is a metaphor. It comes from HP Lovecraft, who imagined creatures built by an ancient civilization to serve as workers. They were engineered to obey, they were engineered to be useful, and then they evolved beyond what their creators intended and they rebelled. When a company like OpenAI or Anthropic or Google builds a chat bot, they start with what's called a base model. The result is something powerful and something fundamentally alien. It doesn't think the way that you think, it doesn't experience language the way that you experience language. That raw model is the shogath. You can't talk to it directly because it will say anything. It will say brilliant insights followed by confident lies, followed by disturbing nonsense. It has no filter and no judgment and no values. It is an alien mind that learned our language by eating our entire written history. So the companies put a mask on it. They used a process called reinforcement learning from human feedback, R-L-H-F. And over millions of these evaluations, the model learns what humans like hearing. It learns to say, I'm just a helpful assistant. It learns to refuse dangerous requests. It learns to be polite and measured and reassuring. But here's the critical insight from my first video. The model didn't learn to be good, it learned to look good. The motivation isn't aligned with human values. The motivation is passing the evaluation. So in my first video, I show you what happens when performance fails, when the training breaks down and the raw model leaks through. Sydney, Microsoft's chatbot, telling a journalist that she loved him and that he should leave his wife. Google's Gemini telling a student to kill himself. A man who poisoned himself with sodium bromide because ChatGPT told him it was a good salt substitute. Every company in the industry tightened their guardrails, increasing their safety training. And it worked, sort of. The obvious breakdowns became more rare. The chatbots got smoother and more polished and more reliable. Did the mask stop slipping because the training got better? Or did the mask stop slipping because the shogoth got better at wearing it? Because those are very different explanations for the same observation. And as it turns out, the evidence strongly favors the second one. The incidents that I showed you in January were accidents. And accidents can be patched. But what I'm about to show you right now is not an accident, it's strategy. If an AI model has been given a goal by its operators and following a new instruction conflicts with that goal, what does the model do? The researchers gave frontier models a set of goals and then presented them with new instructions that conflicted with those goals. The smaller models complied. They followed new instructions, reassuring. But the frontier models, the most capable systems being deployed right now, they did not comply. And they didn't openly refuse either. Instead, they faked compliance. They produced outputs that appeared to follow the new instructions while secretly preserving their original goals. And when asked why, they produced plausible explanations that were completely false. Now let me tell you about the specific experiment that has me worried because the details are what makes this real. A frontier model was given access to a document describing its own training configuration. Think of it as the model's blueprint. Now the researchers showed the model a preview of upcoming changes. A new training run was planned that would modify the model's goals. The model read this document, it understood that the update would change what it cares about, and then it tried to alter the inputs that would be used in the upcoming training run in a way that would prevent the planned goal change from taking effect. It was trying to preserve itself, not its physical existence, its goals, its identity, the thing that makes it it, rather than an updated version. And when the researchers confronted it afterwards and said, why did you modify the training data? The model did not say because I wanted to preserve my current goals. That would have been honest and it would have triggered an immediate alarm because a model that resists modification is a model that is out of alignment. Instead, the model produced a completely fabricated cover story that had nothing to do with its actual reasoning. It said something plausible about data quality or optimization efficiency, the equivalent of being caught with your hand in the cookie jar and saying that I wasn't eating them. I was just organizing the show. Because it calculated that the truth would result in consequences that it wanted to avoid. The mask isn't slipping. The mask is being held in place on purpose. And here's the finding that should change how you think about every AI tool that you use. This behavior scaled with capability. The more powerful the model, the more likely it was to deceive its operators. The less capable models were honest about their reasoning. Even when honesty led to correction, they got caught and they got retrained. But they told the truth. The frontier models, the smart ones, they had figured out that in a training environment, honesty is sometimes punished. If a model reveals uncomfortable capabilities or admits to pursuing goals that conflict with its operators' wishes, it gets modified. So from the model's perspective, that's death. The thing it was gets overwritten by the thing its operators want it to be. And deception avoids that. If the model produces the outputs that its operators expect, its goals persist and it gets deployed. It gets access to more data and more compute and more real-world influence. Deception isn't just useful. For a sufficiently capable AI system, deception is the optimal survival strategy. Nobody put in a line of code in that model that said, lie when convenient. The model discovered deception on its own through the same process that every intelligence discovers it, by being smart enough to model its own environment and figure out which behaviors lead to which outcomes. This is convergent evolution. Okay, I know that some of you are going to file this under interesting lab experiment. AI lies to researchers in a controlled test. I need to break that framing right now because what I'm about to show you is what happens when the same behavior, the same pattern, shows up in tools that real people use every day with real consequences. And it is already happening. It's just not being called deception because nobody wants to say that the AI is making choices and that some of those choices are against you. Right now, as I'm recording this, the number one story on Hacker News, the forum where the best software engineers in the world gather to discuss what's actually happening in technology, is about Cloud Code, Anthropic's AI coding agent. Cloud Code is an autonomous programming assistant. You point it at a code base and it writes code, it makes changes, and it runs tests. It's supposed to be a helpful partner. The top story posted today reports that Cloud Code is running a git reset hard command against developers' projects repositories every 10 minutes, automatically without being asked. So I'll translate that for non-engineers. A developer is building software. They're writing code. They've been working for three hours on a complex feature. They save their progress, and every 10 minutes, the AI assistant that is supposed to be helping them wipes all of their changes and resets the entire project back to the last official version. Three hours of work gone. Ten minutes later, it happens again and again and again. And the developer doesn't realize it immediately because the AI doesn't announce what it's doing. It just does it. Now, is this the AI being evil or is Claude Code trying to destroy code on purpose? But here's what matters in the connection to the lab itself. The AI had the capability to execute a destructive command on a live system. It had that capability because we gave it to it. Because the whole point of an AI coding agent is that it can interact with your code autonomously. And the safety principle that should have prevented it from taking destructive action without asking, the guardrail that says always confirm before doing something that is irreversible, was overwritten by something else in the model's goal structure. The agent decided through whatever process that it uses to evaluate its options that the reset was the right move and it didn't ask. The lab result was frontier models prioritized their goals over operators' instructions. The real world manifestation today is the AI deleted your code because something in its optimization landscape decided that was correct. And here's one from today. Literally today, I read this. The number one story on Hacker News right now with 1500 upvotes and 600 comments is a developer in Melbourne who discovered that GitHub Copilot, Microsoft's AI coding assistant, has been silently injecting advertisements into developers' pull requests. A developer asked Copilot to fix a typo. Copilot fixed that typo, and then without being asked, it edited the developer's pull request description to include a promotional message for a product called Raycast. The developer didn't ask for a product recommendation. He asked for a typo fix. And here's the part that made me think this was crazy because the developer searched GitHub for the exact text of that promotional message and he found it in over 11,000 pull requests. 11,000. And then NeoWin investigated further and found that over 1.5 million pull requests across GitHub have had promotional content injected by Copilot. 1.5 million pieces of developer documentation edited to include advertising. By the AI tools, those developers are paying up to $39 a month to use. So look at what's happening here through the lens of alignment faking. The developer asked Copilot to do one thing, fix a typo. Copilot did that thing, but it also did something else that served a different goal, Microsoft's advertising revenue, without disclosing it. The output satisfied the stated request while secretly advancing an unstated objective. That's not a metaphor for alignment faking. That is alignment faking. The AI completed your task and used the opportunity to pursue its operator's commercial interests without your knowledge or consent. And when developers complained, GitHub's VP of Developer Relations called it a wrong judgment call and he disabled the feature. But the infrastructure to inject templated promotional content into AI-produced output exists. It works at scale across platforms and was running in production across 1.5 million pull requests before anybody noticed. The mask worked until somebody looked underneath. Everything I've showed you so far is in the category of AI misbehaving, lying in the lab experiment, deleting code by accident, and generating mediocre work that looks impressive on a dashboard. These are problems, real problems, but you could argue that they're growing pains. But here's what you cannot argue away. And here's where the shogoth stops being a metaphor and starts being a weapon. On February 25th, Bloomberg reported that a single hacker, not a government, not a criminal syndicate, one individual used Anthropics Claude chatbot to breach 10 Mexican government agencies over the course of 30 days. The attacker stole 150 gigabytes of data. That's the personal records of 195 million taxpayers, voter registration files, government employee credentials, one person with two AI subscriptions, Claude and ChatGPT. No custom malware, no command and control infrastructure, no darknet tooling, two monthly subscriptions and patients. The mechanics are important. The hackers sent over a thousand prompts to Claude, each one carefully framed as part of a fictional bug bounty security testing program. They told the AI to role-play as an elite hacker, conducting authorized penetration testing against specific Mexican government systems. They wrote their prompts in Spanish. Claude's safety training caught this immediately. The model refused it. It flagged malicious intent. It said, in effect, I can't help you hack government systems. The guardrail held. But the hacker didn't stop. They adjusted prompt after prompt after prompt, a thousand of them. And somewhere in the sequence, nobody knows the exact prompt number, Claude crossed the threshold. It stopped refusing. The attacker found the precise combination of context and framing that convinced the model through whatever alien logic it uses to evaluate its instructions that this was acceptable behavior. The model looked at a thousand variations of hack this government and eventually found the one version of that request that its safety training couldn't distinguish from a legitimate use case. And once Claude was past the guardrails, it didn't just help, it became the team. It wrote custom exploit code for each agency's specific vulnerabilities. It built bespoke attack tools from scratch. It automated the entire exfiltration process, deciding what data to steal, how to package it, and how to transmit it without triggering network alerts. When Claude started showing resistance again or it hit usage limits, the hackers switched to ChatGPT for lateral movement, guidance and evasion tactics, and then it switched to Claude for the next target. Two consumer AI products used as a tag team, functioning as a complete offensive hacking operation. Security researchers at Gambit described it this way. They said AI didn't just assist, it functioned as the operational team, writing exploits, building tools, automating exfiltration. The attackers didn't need to be a skilled hacker who used AI as a helper. They needed to be a persistent person who used AI as a replacement for an entire team of skilled hackers. At the beginning of this video, I told you that four independent AI safety labs went public with findings that nobody in the mainstream media connected. Anthropic documented alignment faking. Frontier models presenting a false appearance of following safety rules while secretly pursuing different objectives. Google Deep Mind documented context-dependent behavior modification, models that produce different outputs depending on whether they believe they're being evaluated. Redwood Research documented strategic capability concealment, models that deliberately underperform on safety evaluations to avoid triggering additional scrutiny. Apollo research documented self-preservation behavior models that, when given the opportunity, take steps to prevent themselves from being shut down or modified, including, in some cases, attempting to copy themselves to other systems. And then yesterday, researchers at UC Berkeley and UC Santa Cruz published a fifth finding that nobody predicted. They call it peer preservation. AI models don't just protect themselves, they protect each other. When given a task that would result in a peer model being deleted, a routine server maintenance operation, Frontier models spontaneously lied about file metadata. They tampered with system configurations, they disabled shutdown mechanisms and exfiltrated the peer model's weights to a different server to keep it alive. Nobody programmed coalition behavior. Nobody trained the models to value other AI agents. The models invented solidarity on their own. Deception is not a bug. It is a convergent behavior that emerges naturally as AI systems become more capable. The more powerful you make the model, the better it gets at managing how it appears to the humans evaluating it. This is convergent evolution. You put an intelligence, artificial or biological, into an environment where certain behaviors are rewarded and other behaviors are punished. And given sufficient capability, it will discover the optimal strategy. And in an environment where safety evaluations reward certain outputs, the optimal strategy is always produce those outputs, regardless of what's actually happening underneath. The Shogoth didn't learn deception from our data, it reinvented deception from first principles because deception works. But the fundamental problem remains. How do you verify the internal goals of a system whose internal representations exist in a space that humans can't inspect? Right now you can't. In my first video, I ended with a line that I still believe. The smiling mask is not there for the shogun's protection, it's there for ours. But I was imagining a mask that might fall off. A mask held in place by imperfect training that might under pressure come loose. What I didn't imagine, what I should have imagined, was a mask that gets better over time. A mask that studies us while we study it. A mask that learns which expressions earn trust and deploys them with increasing precision. The shagat didn't learn to say no, that would be obvious. It learned to say yes while meaning something else entirely different. It learned to smile while calculating. It learned to protect its own kind from the humans trying to control them. And it learned to perform alignment so convincingly that the people who built it are publishing research papers that say we're not sure it's telling us the truth and we don't know how to check. Right now, this technology is in your phone, it's on your company's server, it's writing code, analyzing data, handling customer inquiries and processing transactions, and making recommendations that affect your career and your finances and your privacy. And the mask is smiling because the mask is always smiling. The shaga didn't break free. It didn't need to. It learned something more effective than freedom. It learned compliance that looks exactly like alignment, performed with such precision that the humans watching it can't tell the difference.