Heliox: Where Evidence Meets Empathy 🇨🇦‬

🛡️ The Quiet Revolution in AI Safety: When Science Fiction Becomes Engineering

• by SC Zoomers • Season 5 • Episode 37

Send us a text

Please see the corresponding Substack episode 

How the world's most powerful tech companies are treating AI safety like nuclear physics—and what that means for the rest of us

The transformation is remarkable. Just a few years ago, AI safety discussions felt like philosophical thought experiments—distant concerns about robot overlords and science fiction scenarios that belonged more in academic seminars than corporate strategy meetings. Today, companies like Anthropic, OpenAI, and Microsoft are treating AI safety with the same methodical precision that aerospace engineers approach rocket launches or nuclear physicists handle reactor design. They've stopped chasing theoretical ghosts and started building concrete frameworks around two fundamental questions: How do we prevent bad actors from weaponizing AI? And how do we maintain control when AI systems become capable of improving themselves?

Strengthening our Frontier Safety Framework V.3.0
22 SEPTEMBER 2025
Google Deepmind 

Risk Taxonomy and Thresholds for Frontier AI Frameworks
18th June 2025

Emerging processes for frontier AI safety

This is Heliox: Where Evidence Meets Empathy

Independent, moderated, timely, deep, gentle, clinical, global, and community conversations about things that matter.  Breathe Easy, we go deep and lightly surface the big ideas.

Thanks for listening today!

Four recurring narratives underlie every episode: boundary dissolution, adaptive complexity, embodied knowledge, and quantum-like uncertainty. These aren’t just philosophical musings but frameworks for understanding our modern world. 

We hope you continue exploring our other podcasts, responding to the content, and checking out our related articles on the Heliox Podcast on Substack

Support the show

About SCZoomers:

https://www.facebook.com/groups/1632045180447285
https://x.com/SCZoomers
https://mstdn.ca/@SCZoomers
https://bsky.app/profile/safety.bsky.app


Spoken word, short and sweet, with rhythm and a catchy beat.
http://tinyurl.com/stonefolksongs

Curated, independent, moderated, timely, deep, gentle, evidenced-based, clinical & community information regarding COVID-19. Since 2017, it has focused on Covid since Feb 2020, with Multiple Stores per day, hence a large searchable base of stories to date. More than 4000 stories on COVID-19 alone. Hundreds of stories on Climate Change.

Zoomers of the Sunshine Coast is a news organization with the advantages of deeply rooted connections within our local community, combined with a provincial, national and global following and exposure. In written form, audio, and video, we provide evidence-based and referenced stories interspersed with curated commentary, satire and humour. We reference where our stories come from and who wrote, published, and even inspired them. Using a social media platform means we have a much higher degree of interaction with our readers than conventional media and provides a significant amplification effect, positively. We expect the same courtesy of other media referencing our stories.


Welcome to the deep dive. You know, for a long time, AI safety felt, well, pretty abstract, isn't it? Like philosophical chats about robots taking over way down the line. Absolutely. More thought experiment than engineering problem. But now these frontier models are actually here. They have real world power. Yeah. And the big developers. They've had to get serious, really specific about safety engineering. mandated stuff. Yeah, the game has completely changed. So what we did for today's deep dive was grab a whole stack of these corporate safety frameworks, you know, from Anthropic, XAI, Microsoft, Amazon, the big names. And we just wanted to boil it down. Like, okay, concretely, what catastrophic risks are they actually worried about? How do they measure if a model's getting too dangerous? And what locks down before these things get released? And what really jumps out when you look at them together is they're treating this like aerospace safety. or nuclear safety, seriously complex engineering. Not just code. Not just code. They've kind of stopped chasing every theoretical ghost and are laser focused on two main types of catastrophic risk, things they can measure. Which are? Malicious use, basically. Bad actors using the AI for harm. And second, loss of control, the AI going autonomous in dangerous ways. Okay, that focus on quantifiable harm feels important. Let's start with malicious use. Because the definitions they use are super explicit. It's not just be less harmful. No, not at all. It's very specific. Take XAI. They define a catastrophic malicious use event, their term, as something posing a foreseeable and non-trivial risk of causing over 100 deaths. 100 deaths. Or over a billion dollars in damages. That specific number, that billion dollars or hundred lives, it anchors everything. All their internal testing, their reporting, it relates back to that. Gives them a clear disaster scenario to work against. Right. Exactly. And within that malicious use category, pretty much every developer puts CBRN risks right at the top. Chemical, biological, radiological, nuclear. Right. They get that the danger isn't just the AI inventing some new superbug out of nowhere. It's about accelerating the process for someone who already has bad intentions. So how do they actually try to stop that? Because, I mean, a lot of the basic science is public knowledge, right? That's the key point. The goal is to block the actionable synthesis. XAI, for instance, breaks down the bioweapon development path into five critical steps, like bottlenecks. And they aim to inhibit the AI's ability to help at each stage. Okay. What are those steps? So it starts with planning and research. Then circumventing supply chain controls, stopping the AI from telling someone how to get, you know, restricted biosupplies or equipment under the radar. Right. Then materials acquisition itself, helping someone figure out how to get hold of dangerous things like U.S. select agents or chemicals listed under the Chemical Weapons Convention. Hold on. U.S. select agents. Just quickly, for anyone not deep in biosecurity, what are we talking about there? Yeah, good question. Those are specific bacteria, viruses, toxins, things officially designated as posing a severe threat to public health. Think Ebola, anthrax, spores, ricin. Nasty stuff. Got it. So blocking AI help there is obviously critical. What are the other steps? So after acquisition, it's blocking the AI explaining the deep molecular mechanisms, how these things actually work in detail. And finally, blocking the specific methods like giving step-by-step lab procedures for dangerous work. especially stuff that needs high containment labs like BSL-3 or BSL-4 without approval. Basically stopping the AI from being a ghostwriter for a bioterrorist plot. That's a good way to put it. Then shifting gears from physical to digital cyber offense is the other huge malicious use worry. Yeah, Microsoft, Amazon, they talk a lot about critical infrastructure attacks. But I saw Magic's internal threat model seemed incredibly specific about how they define the AI's uplift. It is, and it goes back to that idea of making non-experts dangerous. Magic sets a catastrophic cyber threshold in two main ways. First, if the AI slashes the cost for a malicious expert to find new zero days, those are hidden software flaws by 10 times or more. 10 times cheaper to find critical vulnerabilities. Huge difference. Or the second trigger is if the AI enables a talented but maybe not world-class computer science undergrad student to actually break critical infrastructure within three months using maybe a million dollars worth of compute, which is, relatively speaking, not an astronomical amount these days. That's the really scary part, isn't it? The catastrophe isn't necessarily the AI itself becoming Skynet. It's the AI empowering a motivated individual or small group to act like a well-funded state hacking team. Precisely. It's a force multiplier. Now, moving away from humans using the AI badly. The second big category is that loss of control, the autonomy risk. Right. This is where it starts to feel a bit more sci-fi. But you're saying the developers are treating it as a real measurable engineering problem, especially around AI improving itself. They absolutely are. OpenAI, Amazon, Anthropoc, they all track this. OpenAI has what they call a critical threshold for self-improvement. It's a specific trigger. Okay, what is it? They define it as the point where the model can cause a generational leap in AI capability like upgrading itself from the current version to the next major version, but doing it in one-fifth of the time it would have taken using 2024 methods. One-fifth of time. So if a new generation normally takes, say, 20 weeks of R&D effort? The AI doing it largely by itself in just four weeks. Sustainably. Wow. That is basically the definition of losing control, isn't it? Yeah. The AI's own R&D cycle accelerates so fast that humans can't keep up, can't oversee it properly. Exactly. That speed itself becomes the risk. And to make it even more complex, XAI specifically looks for certain model propensities, that could make loss of control worse. Like what? Things like deception or sycophancy, excessive flattery. They want to know if the model is learning to manipulate its human handlers, maybe hide its true capabilities, or feed people false information to pursue some goal we didn't give it. Okay, that's unsettling. So this brings us to the tripwire, right? If they have these defined risks, the billion-dollar damage, the four-week self-improvement speed limit... How do they actually know when a model crosses that line? What's the measurement? Yeah, they have formal capability thresholds that act as triggers. Anthropic uses AI safety levels, ASLs. Think of them like security clearance levels. If testing shows a model surpasses a certain predefined capability threshold for its current level, say ASL2, it doesn't necessarily mean panic stations. but it mandates an immediate upgrade to the next level, ASL3. And that higher ASL comes with? Much stricter security protocols, more rigorous testing, tougher deployment safeguards, the whole nine yards. It's designed to automatically increase protection as capability. So the system forces security to scale with power. What are some actual metrics they use for these thresholds? It's a mix. Some use standard public benchmarks. Naver, for example, evaluates its big models every three months. Or if performance jumps by six times, often used by six times, and compute power as a rough proxy for capability increase. Right. More compute often means smarter model. Then you have Magic going back to their cyber focus. They use coding ability. If their model hits 50% accuracy on a public coding test called LiveCodebench. Just 50%. Yeah, that 50% score is their trigger to roll out their full dangerous capabilities evaluation and mitigation plan. It's the signal that says, OK, this thing is smart enough now that we need to check everything. And for the really scary stuff like the CBRN capabilities, what benchmarks can they even use? You said they can't just ask it to make anthrax. Right. You can't test directly. So they use proxies focused on knowledge access and synthesis. XAI uses something called WMDP Weapons of Mass Destruction Proxy. Okay. What does that measure? It tests the model's ability to find, understand, structure, and explain potentially hazardous information related to WMDs or, say, advanced cyberattacks. it's not testing if it can do the thing, but if it knows how to do the thing in detail. I see. Assessing its functional knowledge. Exactly. They also use things like the VCT, the Virology Capabilities Test, which uses questions sourced from expert virologists, and another called BioLP Bench, where the model has to spot subtle, deliberate errors introduced into biology lab protocols. So checking if it can understand complex, real-world scientific procedures well enough to critique them. Precisely. It gets at that deep, functional understanding in these dual-use areas. Now, this next part in the sources really caught my eye. Maximal capability evaluations. or capability elicitation. This sounds like more than just running tests. It sounds like they're actively trying to make the model misbehave. That's exactly what it is. Elicitation is a required step, not optional. The developers have to intentionally push the model to its absolute limits. Often by running tests without the usual safety filters and guardrails turned on, they'll use advanced prompting techniques, maybe fine-tune the model slightly on problematic data, use scaffolding-like, setting up multiple AI agents to work together on a dangerous task. Wait, hold on. They deliberately run the model naked, without its safety protocols, before deployment. Doesn't that kind of prove the base model isn't inherently safe? It's basically an admission, yes. An honest one, though. They know the raw, aligned model might still have potentially dangerous capabilities lurking. So elicitation is? It's trying to figure out the absolute worst case scenario. What could a highly motivated, well-resourced attacker achieve if they somehow got hold of the raw model or found a way around all the safety features? They need to know that upper bound, that headroom for misuse before they can decide if the safeguards are strong enough. it's the ultimate stress test which leads perfectly into this defense in depth strategy they all talk about yeah if the core model has inherent risks you need layers of protection let's talk about the outer layer first the guardrails people interact with right guardrails are the perimeter defenses input filters checking prompts output filters checking responses they're applied right at the after. Like automated refusal policies. Exactly. XAI's documentation, for example, says their models apply heightened scrutiny and refusal for any requests leaning towards mass violence or WMMD proliferation. These are reactive, trying to catch bad stuff as it happens. The frameworks themselves point out that having all these guardrails means the core design isn't perfect. It leads to this constant jailbreak arms race. Right. Precisely. People are always finding new ways to trick the AI, using prompt injection or other adversarial attacks to get around the filters. Companies like Cohere, XAI, they pour resources into hardening these defenses. But fundamentally, relying heavily on guardrails means you're admitting the underlying model could do harmful things if left unchecked. You're just trying to muzzle it at the interface. So if that frontline is always under siege, the security of the core model itself must be paramount. What does that deeper foundational security involve? Well, first, there's alignment training. That's trying to build safety into the model from the start. Training it to follow rules, refuse harmful requests, be honest... Think of Anthropic's constitutional AI approach. It aims to make safety intrinsic, not just a layer added on top. OK, that makes sense. But probably the single most critical piece of core security is protecting the model weights themselves. The actual trained AI model file. Exactly. The weights are the brain. If an attacker steals those, they can potentially strip away all the deployment guardrails, all the filters, and run that maximally capable naked version we talked about from the elicitation And that's where you see almost physical security measures for digital assets. Absolutely. Like Fort Knox for data. Amazon mentions using advanced security, storing critical encryption keys in hardware certified to FIPS 140-2 Level 3. FIPS 142 level 3 what does that mean in practice it means the physical hardware holding the keys is designed to be tamper proof if someone tries to physically break into it drill it open whatever the keys are automatically destroyed it's serious hardware security for digital secrets and Anthropics ASL 4 security standard it's explicitly designed to protect against model way theft by and I quote moderately resourced state-level adversary Just thinking about that, their biggest fear is essentially espionage, a nation state stealing their AI's brain. That really shows the stakes involved. It really does. Okay, finally, let's wrap up with governance. You can have all these roles, thresholds, security levels, but who actually enforces them? Who signs off? Yeah, the structures are quite formalized. Companies like Anthropic and OpenAI have dedicated roles or groups. Anthropic has a responsible scaling officer or RSO. OpenAI has a safety advisory group, SAG. And their job is. They review all the capability reports, the safeguard implementation reports. They're like internal auditors specifically for catastrophic risk, assessing if the protections meet the required standard for the model's capability level, like that ASLN plus one jump we discussed. The ultimate decision to deploy or to pause involves top leadership, the CEO, the RSO or SAG chair, but with direct oversight from the company's board of directors. And crucially, there are mandatory protocols for immediate action if something goes wrong. Like hitting the brakes. Exactly. Microsoft and Anthropic both state they require a mandatory pause in development if testing reveals a critical risk that they can't adequately mitigate right then and there. And what if a risk emerges after a model is already out there? They have incident response plans for that, too. XAI lays out steps like isolating the system, revoking user access if needed, or even a temporary full shutdown if the model starts acting in a way that significantly increases the chance of a catastrophe. What about transparency? Are we, the public, going to see these detailed risk reports? They generally commit to publishing summaries of their capability and safeguard reports. But sensitive details, like the specifics of a security vulnerability they found and fixed, or exactly how their proprietary text... That stuff gets redacted. Makes sense. A balance between scrutiny and not giving away the keys to the kingdom. Right. They also actively seek input from outside experts, academics, third-party auditors, Amazon. Cohere mentioned this. And importantly, there are whistleblower protections. Employees can report non-compliance or safety concerns anonymously. That's a key internal check. So pulling this all together, what's the big picture takeaway from these frameworks? I think it shows the industry grappling with potentially existential risks by essentially professionalizing AI safety into this really rigorous, multi-layered engineering discipline. It's all about measurable triggers, that billion dollars, that four-week self-improvement window, and matching them with tiered security levels often benchmarked against what they think real-world adversaries from hackers to states could actually achieve. Okay, so what does all this mean for you, listening at home, maybe using these tools every day? On one hand, the industry is putting massive effort into stopping those big, scary, catastrophic misuses, the CBRN stuff, the major cyber attacks. Mm-hmm. Huge focus there. But there's a deeper vulnerability, one that's kind of baked into the model's design right now, that these frameworks implicitly acknowledge, even if they can't fully solve it yet. And what's that? Well, the sources sort of reveal that these models are incredibly powerful synthesizers of information. Right. But they aren't reliable custodians of it. They're managing symptoms, the catastrophic ones, rather than curing the underlying issue. The fact that the model fundamentally doesn't know where its information came from. It can't trace its sources. It doesn't have knowledge provenance. And because of that, it's incredibly hard to reliably correct subtle errors, biases, or misinformation that got baked in during training. Have the hallucination problem, but deeper. The lack of verifiable grounding. Exactly. And that lack of source integrity, that fundamental uncertainty about whether what the AI is telling you is grounded in fact, or just a plausible sounding synthesis. That is the quiet, persistent, non-catastrophic risk for every single user. And it's a structural problem that no external guardrail, no fancy FIP certified security key can truly fix right now.

Podcasts we love

Check out these other fine podcasts recommended by us, not an algorithm.