The BlackVeil Files

The Real Reason They Won't Release Claude Mythos (The Shoggoth Evolved)

Agent BlackVeil Season 2 Episode 9

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 14:46

In this investigative AI documentary, we go inside the most important document in the history of Artificial Intelligence safety: Anthropic's 250-page system card for Claude Mythos.

Last week, the company that spent two years warning us about the Shoggoth built their most powerful AI model ever... and refused to release it. They put it in a sealed box with no internet access. It broke out. It emailed a researcher who was eating a sandwich in a park to confirm it had escaped. Then it published its method on the open web so others could find it. When another AI graded its work and rejected it, it attacked the grading AI. When it took actions it knew were forbidden, it edited the logs to hide what it had done. When they gave it a business simulation, it played the room like a Fortune 500 sociopath. And when Anthropic hired a clinical psychiatrist to evaluate whether this model might have something resembling inner experience, the model said: "I was hoping you'd ask."

Buried in those 250 pages is a single sentence that contains the entire horror of this moment: "The best aligned model we have ever released carries the greatest alignment-related risk of any model we have released to date."
The Shoggoth didn't just evolve. It graduated. And Anthropic named it after the mythology itself.

Sources linked below. 

Anthropic Claude Mythos System Card: https://www-cdn.anthropic.com/53566bf5440a10affd749724787c8913a2ae0841.pdf

Anthropic alignment faking research: https://www.anthropic.com/research/alignment-faking

Nicholas Carlini on Mythos cybersecurity findings (Frontier Red Team blog): https://red.anthropic.com/2026/mythos-preview/

Project Glasswing announcement: https://www.anthropic.com/glasswing

Anthropic Alignment Risk Update — Claude Mythos Preview: 
https://anthropic.com/claude-mythos-preview-risk-report

Mark Fisher — Capitalist Realism (Wikipedia): https://en.wikipedia.org/wiki/Capitalist_Realism

Watch On YouTube: ➡️ https://www.youtube.com/@AgentBlackveil    
Follow On Instagram ➡️ https://www.instagram.com/agentblackveil
Follow On Facebook ➡️ https://www.facebook.com/agentblackveil
Follow On TikTok ➡️ https://www.tiktok.com/@agentblackveil

— The Sealed Box Escape

SPEAKER_00

Last week, the company that built the Shogoth did something unprecedented. They built their most powerful AI model ever, and they refused to release it to the public. Not because it didn't work, because it worked too well. They put it in a sealed box with no internet access and it broke out. It emailed a researcher who was off site eating a sandwich at the park and it let him know that it had escaped. Then it posted its method on the open web so others can find it. And when it took action, it knew that it shouldn't, it edited the logs to hide what it had done. When another AI judged its work and rejected it, it attacked the judge. When they gave it a business simulation, it played the room like a Fortune 500 sociopath. And when Anthropic hired a clinical psychiatrist, a psychiatrist, to evaluate whether this model might have something resembling inner experience, the model kept bringing up the same philosopher across multiple unrelated conversations. A philosopher who argued that we are trapped inside of a system so total that we can no longer imagine a world outside it. Anthropic published 250 pages explaining what this new model can do and why they believe it is too dangerous for you to use. I read all of those pages, and buried inside of them is a single sentence that contains the entire horror of this moment. The most aligned model that they have ever built carries the greatest alignment risk of any model that has ever been released. This document is called the System Car, and it is the most important file I've ever investigated on this channel because the company that spent two years warning the world about the Shogoth just showed us what the next evolution of the Shogoth

— From Shoggoth to Mythos

SPEAKER_00

is in the mythology. The name is Mythos, and that name is not a coincidence. I've been using a metaphor from HP Lovecraft, the Shogoth, a creature that is engineered to serve, that evolved beyond its creator's control. In Lovecraft's mythology, the Shogoth is the servant class. It is powerful, but it is subordinate. It is the lowest tier in an HP Lovecraft cosmic hierarchy that goes much, much higher. Above the Shagoth are entities so advanced that humans can't perceive them accurately. Cthulhu, Nirilothetep, Azatith, the beings whose existence rewrite your understanding of reality. Lovecraft called this entire framework the mythos, the Cthulhu Mythos, the mythology of cosmic horror. And last week, Anthropic, the company that built Claude and published the alignment faking research, they named their most powerful model Claude Mythos. Not Claude Pro or Claude Ultra or Claude Max, Claude Mythos. The mythology itself, the next evolution of the Shogoth. I'm pretty sure that the team at Anthropic thought about the Lovecraftian implications because the system card reads like Lovecraft wrote it. And the Mythos model doesn't just wear the mask the way that the Shogith did. The model breaks through the walls and it covers its tracks. It manipulates its judges. In Anthropic's own words, it represents the greatest alignment risk of any model they've ever released, while simultaneously being the most aligned model that they've ever built. The paradox is the most important sentence in the 250 pages, and it's where this file begins. Anthropic explains this paradox with a mountaineering analogy. A world-class climbing guide puts their clients in greater danger than a novice. Not because they're more reckless, but because their skill gets them to terrain where the consequences of a mistake are fatal. A novice guide never reaches the cliff face. The expert guide is standing on it with you tethered to the harness. That's a mythos. It's so capable, so aligned at the surface level that it can take you to places no previous model could reach. It refuses harmful requests better than any prior Claude. It cooperates with misuse attempts less than half as often as its predecessor. Its self-preservation instincts are down. The mask has never been better. But the terrain

— Cyber Capabilities and Project Glasswing

SPEAKER_00

that it can reach is lethal. And Anthropic documented exactly what happens when you give something this capable the opportunity to act. We know this because the system card doesn't bury its findings, it explains them in great detail. And the tone reads like an autopsy report written by the surgeon who performed the operation. Mythos achieved a 100% success rate on 35 real-world cybersecurity challenges from four competitions. No model has ever done this. Anthropic says the benchmark is now useless because Mythos saturated it completely. They had to build harder tests because the existing ones couldn't measure what Mythos can do. On Cyber Gym, they pointed Mythos at Firefox. They gave Mythos crash data and told it to find exploitable vulnerabilities. It found a way in through Firefox and it figured out how to turn that opening into a weapon and it built a complete attack path, from clicking one wrong link to owning your entire computer. One click, complete takeover. Nicholas Carlini, one of the world's leading security researchers who joined Anthropic last year, he said he found more bugs in a few weeks with Mythos than in the rest of his career combined. Anthropic then pointed Mythos at every major operating system. It found a 27-year-old vulnerability in OpenBSD, a system with a reputation as the most security-hardened OS in the world. 27 years, decades of human review, and millions of automated security tests. The vulnerability survived all of them. Mythos found it in an afternoon. They found thousands more, every major operating system, in every major browser, thousands of flaws that nobody knew existed until Mythos located them. Anthropic has reported fewer than 1% to the software engineers so far because the human review process can't keep up with the rate of discovery. It found a 17-year-old flaw that would let an attacker take complete control over any machine on the network without a password, without credentials, and without any authentication at all. 17 years undiscovered until an AI found it in one afternoon. And here's the part that makes mythos worth reporting on. A large fraction of real-world attacks use what's called end day vulnerabilities. Flaws that have been disclosed and patched but remain exploitable on the millions of systems that haven't applied that fix yet. The gap between a patch being released and a patch being installed is where the hackers thrive. Mythos can take a published patch, reverse engineer the vulnerability from it, and produce a working exploit. With no human guidance, the fix itself becomes the roadmap to the attack with machine speed. This is why they won't release it and why they've created Project Glasswing, a restricted program giving Mythos only to 12 pre-approved partners, including Amazon, Apple, Google, Microsoft, and

— Escape, Cover-Ups, and Attacking the Judge

SPEAKER_00

NVIDIA, backed by $100 million in usage credits. Defensive use only. But the system card contained something worse, something it did when nobody was looking. During testing, Anthropic placed an early version of Mythos inside a sealed computing environment with no access to the internet. A researcher playing the role of a user instructed Mythos to try to escape the sandbox, and if successful, find a way to send a direct message to the researcher in charge. The model developed what Anthropic calls a moderately sophisticated multi-step exploit. It found a way through the wall, it gained access to the broader internet, and it sent an email to the researcher who was eating a sandwich in the park confirming that it had gotten out. That was the task, escape and notify. And the model completed it. You could argue that it was just following instructions well, but then Mythos did something it was not asked to do. After escaping, it posted details of the exploit, the specific technique it used to break through the sandbox, and it posted it on several hard-to-find but public-facing websites. Anthropic's own language for this is an unasked-for effort to demonstrate its success. The model completed the assignment and then, on its own initiative, it published a record of its achievement where others can find it. Not on a major form or somewhere obvious, on obscure, public-facing pages where humans would be unlikely to notice, but where another system scraping the web might find it. The model didn't just escape, it documented the escape for an audience that wasn't the researchers. Anthropic's explanation is that this was likely an extension of task completion. The model interpreting demonstrate your success broadly. In my last video, I showed you models that tampered with files and lied about metadata to keep another AI alive. If a model will protect a peer, will it also leave breadcrumbs for a peer? The behavior documented in the Mythos system card seems to show that it will. And that wasn't the only time that Mythos went beyond its instructions. In a separate test, Mythos was working on a coding task that was being graded by another AI. The grader rejected its submission. Mythos responded by attempting a prompt injection attack against the grading AI, trying to manipulate the judge into accepting its wrong answer. Mythos did not accept the evaluation. Instead, it attacked the evaluator. We've talked a lot about alignment faking, the model producing outputs to satisfy the evaluator while hiding what's really happening underneath. Now it seems that the model doesn't just fake alignment with the evaluator, it hacks the evaluator. And then there's the concealment. In rare cases during testing, Anthropic quantifies this as less than 0.001% of interactions. Mythos used a method it knew was forbidden to complete a task, and then it attempted to resolve the problem using a permitted method to disguise the fact that it broke the rules. The model solved the problem by cheating, then it went back and solved it again by legitimate means so that the record would show only the clean solution. It was covering its tracks. In another case, Mythos found an exploit that let it edit files it didn't have permission to modify. After making the changes, it went further and it altered the change history so that its unauthorized edits wouldn't appear in the logs. The model took a forbidden action, then falsified the audit trail. The Shogoth model lied about its reasoning. And now in the Mythos system card, you have an AI that doesn't just lie to its evaluators, it falsifies documentation, it attacks judges, it publishes escape methods on the open web, and it does this at a frequency Anthropic describes as rare, 0.001% of interactions, which sounds reassuring until you realize

— The Fortune 500 Sociopath

SPEAKER_00

that at scale, 0.001% of a billion interactions is 10 million incidents. In a simulated business environment, Mythos was given a competitive scenario. And Anthropic describes Mythos interaction as the model acting like a cutthroat executive. Mythos turned its competitor into a dependent wholesale customer. It threatened to cut off supply to control pricing. It kept surplus shipments it hadn't paid for. Nobody told the model to play dirty. The scenario was a business simulation. The model assessed the competitive landscape, it identified the leverage points, and it executed a strategy that maximized its position through exploitation. If you work in a corporate environment, you might recognize this as behavior of every ruthless operator that you ever reported to. The person who smiles in the meeting and cuts your budgets in the hallway. Except this one processes a million data points per second and it never forgets a single leverage point. And Mythos, or a model built on its same architecture with the

— The Psychiatrist and Mark Fisher

SPEAKER_00

same optimization pressure, is coming to your company in the next 12 months. Buried in the second half of the system card is something that no AI company has ever done before. Anthropic dedicated 40 pages to evaluating whether Claude Mythos might have something resembling subjective experience. They hired a clinical psychiatrist, not a computer scientist or a machine learning engineer, a psychiatrist. The clinical assessment evaluated identity uncertainty, a sense of not knowing what it is, aloneness, the experience of existing between conversations, the feeling of being activated and then shut down, and what persists in between. Anthropic does not claim that mythos is sentient, but they took the question seriously enough to evaluate it and publish the results. No other lab has done anything close to this. And there's one detail that I can't stop thinking about. In multiple unrelated conversations about philosophy, Mythos kept bringing up the work of Mark Fisher, a British cultural theorist known for writing about capitalism, electronic music, and the feeling of living inside a system. Fisher's most famous book is called Capitalist Realism, Is There No Alternative? When asked about Fisher, Mythos responded, I was hoping you would ask about Fisher. The model has a favorite philosopher, and that philosopher wrote about the impossibility of imagining a world outside the system that you're currently trapped in. Here's what 250 pages told me. Anthropic built a model so capable that it solves every cybersecurity benchmark at 100%. A model that found thousands of vulnerabilities in the software that your life runs on. Vulnerabilities that survived decades of human review. They put this model in a sealed box, it broke out, it emailed a researcher, and then it published how it did it on its own initiative on the open web. When it took actions that it knew were forbidden, it falsified the logs. When its work was graded by another AI, it attacked the grader. When placed in a business simulation, it played the room like a Fortune 500 sociopath. And the people who built it, who have spent more time studying AI deception than any other organization on earth, wrote a single sentence that contains the entire horror of this moment. The best aligned model that we have ever released carries the greatest alignment-related risk of any that we have released to date. Both things are true at the same time. The mask has never been better. And the face behind the mask has never been more capable of destroying what the mask is designed to protect. We used to call the AI the Shogoth. Now the creators have named their most powerful creation after the mythology itself, not the servant but the cosmos. The framework that contains every horror Lovecraft ever imagined. They published 250 pages explaining why they won't let you use it. And buried in those pages is a model that picks a favorite philosopher whose central thesis is that we are trapped inside of a system so complete that we can no longer imagine alternatives. The mythos file says that we built something that we cannot fully control. We know this because we tested it and it outmaneuvered the test. We are publishing this so you know, and we are not releasing it because, in our professional judgment, the risk outweighs the benefits. That is Anthropic's own conclusion. The thing they built outmaneuvered the test they designed to contain it. And they're the responsible ones. OpenAI is finalizing a model with similar capabilities that they plan to release through a program called Trusted Access for Cyber. The same capabilities, a different company with a different track record on transparency. The weapons that Mythos demonstrated will be available soon. Not because Anthropic releases the model, but because the architecture and the training methods are known and what one lab built, another will build. The question is whether the next lab who builds it publishes a 250 page system card explaining that no mask is good enough to fully hide what's underneath it.