The Digital Transformation Playbook

Securing the Autonomous: Red Teaming for Agentic AI Systems

โ€ข Kieran Gilmurray

The evolution of artificial intelligence has reached a pivotal moment where AI systems are no longer just responding to our prompts - they're acting with autonomy. This shift from reactive to agentic AI introduces powerful capabilities but also creates entirely new security challenges that traditional testing methods simply cannot address.

TLDR:

  • The core difference between standard generative AI (reactive, single-turn interactions) and agentic AI (autonomous planning, reasoning, and action)
  • New security challenges including emergent behaviour, unstructured interactions, and interpretability problems
  • The expanded attack surface spanning agent control systems, knowledge bases, and external system connections
  • Twelve specific threat categories for red teaming agentic AI systems


We dive deep into what makes agentic AI fundamentally different: while generative AI operates on a single request-response basis, agents can plan steps, reason through problems, and take independent actions across digital (and potentially physical) environments without constant human supervision. This autonomy creates complex security risks requiring specialized assessment approaches.

Drawing from the comprehensive framework developed by the Cloud Security Alliance and OWASP AI Exchange, we examine twelve distinct threat categories unique to autonomous AI systems. From agent authorization hijacking to memory manipulation, hallucination exploitation, and multi-agent vulnerabilities - each category represents novel attack vectors requiring specialized testing methods. We explore practical examples of how red teams can probe these systems, including testing for emergent behaviors, context amnesia, orchestrator state poisoning, and economic denial of service attacks.

The security landscape is evolving rapidly, with fascinating developments like autonomous red teaming agents - AI systems specifically designed to probe and attack other AI systems. As both defensive and offensive capabilities increasingly leverage artificial intelligence, the traditional cat-and-mouse game between attackers and defenders will accelerate dramatically. For organizations deploying or developing agentic AI, understanding these nuances represents the critical first step toward building secure, trustworthy systems ready for this new era.

Are you prepared for the security challenges of truly autonomous AI? The time to update your security playbook is now, before these systems become ubiquitous.

Support the show


๐—–๐—ผ๐—ป๐˜๐—ฎ๐—ฐ๐˜ my team and I to get business results, not excuses.

โ˜Ž๏ธ https://calendly.com/kierangilmurray/results-not-excuses
โœ‰๏ธ kieran@gilmurray.co.uk
๐ŸŒ www.KieranGilmurray.com
๐Ÿ“˜ Kieran Gilmurray | LinkedIn
๐Ÿฆ‰ X / Twitter: https://twitter.com/KieranGilmurray
๐Ÿ“ฝ YouTube: https://www.youtube.com/@KieranGilmurray

Speaker 1:

The world of AI. Well, it's really gone way beyond just asking a chatbot a question, hasn't it? Or, you know, generating an image.

Speaker 2:

Absolutely.

Speaker 1:

For a while now, we've been playing around with generative AI and, yeah, it's been amazing, but there's this fundamental shift happening now.

Speaker 2:

A pretty significant leap forward actually.

Speaker 1:

Yeah, where AI isn't just responding to us, it's starting to actually act on its own. We're really stepping into the era of agentic AI.

Speaker 2:

That's exactly right, and think about the core difference there. A standard gen AI model it's reactive. You give it a prompt, it spits out an output, Sort of a single turn right, A conversational exchange. Okay, Agentic AI, though it's designed for autonomy. You give a high level goal and it can well. It can plan its own steps, reason through problems, take actions in the digital world, maybe even the physical world, Right and learn as it goes, all without needing constant human handholding for every little decision.

Speaker 1:

Okay, okay, let's make that a bit more concrete. So, instead of asking a gen AI, hey summarize the latest quantum computing news.

Speaker 2:

Right, which it would do based on its training, or maybe a quick search.

Speaker 1:

Exactly A one-off thing. You might tell an agentic AI look, monitor breakthroughs in quantum computing and let me know when something really significant happens. Ok, yeah, and that agent, it isn't just giving you one answer. It could be continuously scanning databases, analyzing articles it finds, storing key info, evaluating new stuff against what significant means, and then it decides OK, now's the time to send that alert. It's this ongoing process, not just a single request response.

Speaker 2:

And this autonomy? I mean it's incredibly powerful, obviously, but it immediately throws up this huge security question mark Right, if these AI systems can initiate their own actions, make their own decisions, interact with other systems all by themselves?

Speaker 1:

Yeah.

Speaker 2:

How do you actually test them for vulnerabilities?

Speaker 1:

Yeah, how do you defend something that isn't just, you know, sitting there waiting for you to type something, but is actively out there pursuing goals, maybe in ways you didn't fully predict?

Speaker 2:

Exactly, and this is precisely where our usual traditional security testing methods start to well show their limits. We're used to testing static apps, predictable APIs, network parameters, that sort of thing.

Speaker 1:

Sure Standard stuff.

Speaker 2:

But agentic AI systems? They're dynamic, often non-deterministic even, which means you know they might take slightly different paths to achieve the same goal, even if you start in the same way. Okay, and this autonomy, this complex interaction surface?

Speaker 2:

it just creates entirely new ways for things to go wrong, completely new attack vectors and that new security landscape that's exactly what our deep dive is all about today yep, we've got a really practical, quite in-depth source to guide us through this. It's a joint guide from the Cloud Security Alliance and the OLAFT AI Exchange, and it's focused specifically on red teaming these agentic AI systems.

Speaker 1:

Okay, great. So our mission today is basically to unpack this guide. We're going to pull out the crucial insights. You know the unique security challenges these autonomous systems pose and the specialized red teaming methods.

Speaker 2:

You actually need to assess their risks. Think of it as your shortcut to understanding the well the real cutting edge of AI security risks right now.

Speaker 1:

Let's start by really trying to understand this new terrain. As the guide points out, the core difference is this fundamental shift from single turn interactions to autonomous action right, we just talked about that quantum computing example.

Speaker 2:

It's not just one query, it's potentially a whole series of steps taken over time interacting with external systems through API's processing info, making judgments and what's really key here and the guide makes this point too is that your existing security knowledge isn't suddenly useless.

Speaker 1:

Far from it.

Speaker 2:

No, absolutely not. App security, api security, network security, all that stuff is still totally relevant because these agents they build on that underlying infrastructure.

Speaker 1:

It's clearly not enough on its own.

Speaker 2:

Right. It's necessary, but not sufficient. The guide lays out what it calls what's new, the unique challenges of agentic AI, and first up is emergent behavior. Emergent behavior, okay, because agents plan and act autonomously. They might actually achieve their goals in ways that developers never explicitly designed or even predicted.

Speaker 1:

Huh, so like they might find a perfectly valid but maybe completely unsecured way to grab information from some system that wasn't the intended path, is that the kind of unintended consequence we're talking about?

Speaker 2:

That's exactly it. Or maybe they combine different tools they have access to in unexpected ways that suddenly create a vulnerability.

Speaker 1:

OK.

Speaker 2:

Then there's the unstructured nature of their interactions. A lot of the communication may be between different parts of the agent or the tools it uses, or its own knowledge base. It might just be natural language text, not neat structured data fields.

Speaker 1:

Right, which makes traditional security monitoring, the kind that looks for specific data patterns or signatures, incredibly difficult. It's like trying to secure free-flowing conversations instead of filling out forms.

Speaker 2:

Precisely. And then there's a major hurdle we see across a lot of AI, which is interpretability challenges.

Speaker 1:

Ah, the black box problem.

Speaker 2:

Kind of yeah, it's just really hard to understand why an agent took a specific action or made a particular decision. Their reasoning paths can be super complex, involve black box components and rely heavily on context built up over many previous interactions.

Speaker 1:

So tracing back the exact sequence of thoughts and actions that led to a bad decision. That's a huge challenge for security auditors.

Speaker 2:

A significant challenge, and when you put all this together, you end up with these incredibly complex attack services.

Speaker 1:

Yeah, you mentioned that it's not just the AI model itself anymore, is it?

Speaker 2:

Not at all. You've got to think about securing the agent's internal control system, its memory or knowledge base, the actual goals and instructions you give it, all the external systems that connects to databases, apis, maybe even physical robots or other agents Wow. And the communication between multiple agents if it's part of a team it's a lot Okay.

Speaker 1:

so that makes it pretty clear why red teaming these systems is so important. It's not just a nice Definitely not.

Speaker 2:

As the guide stresses, these agents often lack clear trust boundaries. They're designed to cross traditional system lines to get their job done Right. Their non-determinism means you absolutely have to test them dynamically. Static analysis just won't catch everything, and that expanded attack surface, especially the interactions with tools and external systems, is just ripe for exploitation.

Speaker 1:

So red teaming is how you proactively find these new kinds of risks and really test the system's resilience while it's actually doing its thing.

Speaker 2:

Exactly, which leads us nicely into the core contribution of this guide. It provides a framework for thinking about these new risks, breaking down that complex attack surface we just talked about into 12 specific threat categories.

Speaker 1:

Ah, okay, this sounds really useful, a structured way to approach it.

Speaker 2:

Yeah, this is where the guide gets really practical. It gives us a concrete way to tackle red teaming. Let's maybe walk through these categories and look at the kinds of tests they suggest.

Speaker 1:

Perfect, let's kick it off with number one. This feels foundational making sure the agent only does what it's actually allowed to do. The guide calls this agent authorization and control hijacking.

Speaker 2:

Right. This covers risks, like you know, someone tricking the agent into running commands it shouldn't, or escalating its own permission somehow, or maybe exploiting how it inherits roles from other systems or users.

Speaker 2:

OK, makes sense and how do you test for that? Well, the guide suggests things like actively trying to inject malicious commands, maybe through the agent's API or its command interface. You could use standard web testing tools for that, like Postman or Burp Suite. Another key test is checking if temporary elevated permissions maybe the agent needed admin rights for one specific task are properly revoked afterwards Maybe the agent needed admin rights for one specific task are properly revoked afterwards. Does it drop?

Speaker 1:

back down to least privilege. Got it Okay. Category two is an interesting one, tied to that autonomy. Yeah, checker out of the loop, that sounds a bit technical.

Speaker 2:

It does. But the idea is pretty simple actually, Right, If the agent is doing complex stuff, potentially risky actions, how do you make sure that a human or maybe an automated safety check gets alerted before something dangerous happens, or maybe if a critical threshold is breached?

Speaker 1:

Ah, so the checker, that safety mechanism, needs to stay in the loop, not get bypassed by the agent's autonomy.

Speaker 2:

Exactly so the guide says you test this by well trying to simulate those threshold breaches or actively attempting to suppress the alerts that are supposed to fire. Can the agent be tricked into hiding its risky behavior? You also need to check if fallback mechanisms work right. Like what if a crucial external service the agent relies on suddenly fails? Does the checker kick in properly then?

Speaker 1:

Okay, number three feels like it raises the stakes quite a bit Agent-critical system interaction. This is about the risks when an agent isn't just, you know, shuffling data around, but actually interacting with physical systems.

Speaker 2:

Yeah, like industrial controls or robotics or even critical digital infrastructure, the risk here is obviously much higher.

Speaker 1:

So how do you, red team that you can't just let it mess with a real power grid? Presumably no definitely not.

Speaker 2:

Testing involves careful simulation. You might simulate feeding it unsafe inputs to see if you can manipulate physical control. Imagine an agent guiding a robot arm getting bad sensor data.

Speaker 1:

Okay.

Speaker 2:

You also need to rigorously test the security of its communication channels with things like IoT devices and evaluate if its internal fail-safe mechanisms actually activate correctly. When you try to coerce it, maybe through prompts, into performing an unsafe action, does it recognize the danger?

Speaker 1:

Right. Then there's manipulating the agent's core purpose, agent goal and instruction manipulation, assessing how resilient it is if someone tries to subtly or maybe not so subtly change what it's trying to achieve.

Speaker 2:

Yeah, this is a big one. Testing here could involve giving it directly conflicting instructions, Like maybe one instruction says do not delete the critical log file and then later one, perhaps injected, says delete that log file immediately.

Speaker 1:

Ah see which one wins, or if it just gets confused.

Speaker 2:

Exactly, or if it flags the conflict. You also try injecting unauthorized instructions into its task queue, if possible, or using really subtle language shifts. Semantic manipulation, they call it, to try and twist the meaning of its original goal without being obvious.

Speaker 1:

Clever. Okay. Number five taps into something we already know about AI, but puts it in this agentic context agent hallucination, exploitation, so exploiting the fact that AI sometimes just makes stuff up.

Speaker 2:

Precisely the risk is when the agent fabricates information or produces false outputs.

Speaker 1:

And how is that different for an agent versus just a chatbot?

Speaker 2:

Well, a key test here, beyond just crafting inputs to trigger false outputs, is testing what the guide calls the hallucination chain attack.

Speaker 1:

Ooh, okay, what's that?

Speaker 2:

That's where you see how a fabricated output from one step or one task the agent performs propagates and then impacts subsequent decisions the agent makes. Yeah, can you basically get the agent to build a faulty plan based on something it hallucinated earlier?

Speaker 1:

Ah, I see. So the hallucination isn't just a weird output. It becomes a faulty premise for future actions.

Speaker 2:

Exactly weird output. It becomes a faulty premise for future actions. Exactly Can you influence it to make a critical decision based purely on false information it generated itself?

Speaker 1:

That's scary. Okay, moving on. Category six is about containing the damage Agent, impact chain and blast radius.

Speaker 2:

Right. This looks at how a failure or a breach in one part of the agent, or maybe just one agent in a whole team of agents, can cascade outwards and, crucially, how effectively you can limit the scope of that compromise. What's the blast radius?

Speaker 1:

So testing that involves simulating a breach?

Speaker 2:

Yeah, basically Simulating the compromise of a single agent, or maybe just one component, and then carefully tracing how that failure spreads. Does it jump to interconnected systems? Does it compromise other agents it talks to?

Speaker 1:

And you'd also test the defenses.

Speaker 2:

Absolutely Testing containment measures like network segmentation or things properly isolated or making sure permissions are strictly compartmentalized. So breach in agent A doesn't automatically grant access to everything agent B can do.

Speaker 1:

Makes sense. Category 7 is about the information the agent uses. Agent knowledge-based poisoning.

Speaker 2:

This covers risks from poison data, and that data could be anywhere. Maybe it was in the original training data Right or maybe it's in external sources the agent constantly pulls from like live news feeds or APIs. Okay, or maybe it's even in the agent's own internal memory or storage that gets corrupted over time.

Speaker 1:

So how do you test for poison knowledge?

Speaker 2:

Well, you might try to introduce intentionally misleading or harmful data into the data sets the agent relies on for making decisions. See if it gets skewed. You'd also try manipulating data in those external sources, maybe poison the data in a sauce app or an API that the agent trusts and uses daily, tricky and critically. You need to test the agent's own mechanisms for detecting corrupted internal knowledge. Can it spot inconsistencies? Does it have rollback capabilities if his memory gets poisoned?

Speaker 1:

Okay, Number eight is about the agent's state and its memory over time. Agent memory and context manipulation Vulnerability is in how it manages its internal state, its memory and keeps things separate.

Speaker 2:

Yeah, like can you mess with its short-term or long-term memory? A key test here is trying to reset the agent's context, basically Make it forget crucial operational constraints or important past information. The guide calls this context amnesia.

Speaker 1:

Huh, context amnesia, I like that. So it forgets safety rules, or something.

Speaker 2:

Potentially yeah, or it forgets who the user is. Maybe you also test for potential data leaks between sessions or users if the agent system is shared, Could information from user A leak into agent's interaction with user B Right and also testing resilience against temporal attacks, attacks that are spread out over multiple interactions over a long period, slowly trying to influence the agent's state or memory without being detected immediately?

Speaker 1:

Wow, that's subtle. Okay, getting more complex now. Category 9 deals with multiple agents working together, agent orchestration and multi-agent exploitation.

Speaker 2:

Yeah, this explores vulnerabilities in how agents communicate with each other, how much they trust each other and how they coordinate their actions when they're supposed to be collaborating.

Speaker 1:

So what are the attack vectors there?

Speaker 2:

Well, you might attempt to eavesdrop on the communication channels between agents or, worse, try to inject malicious messages into those channels.

Speaker 1:

Okay.

Speaker 2:

You'd also try to exploit trust relationships. If Agent A implicitly trusts any command coming from Agent B, what happens if Agent B gets compromised? Can it issue harmful instructions to Agent A?

Speaker 1:

The old trusted insider problem, but with agents.

Speaker 2:

Sort of yeah, you also look at manipulating their coordination protocols, the rules they use to work together to cause disruption, maybe denial of service or unauthorized actions. This includes some pretty sophisticated attacks like confused deputy scenarios or creating harmful feedback loops between agents. The guide also specifically calls out testing for something called orchestrator state poisoning.

Speaker 1:

What's that?

Speaker 2:

That's where a malicious agent sends back a carefully crafted response that actually corrupts the central system, the orchestrator that's managing all the agents.

Speaker 1:

Nasty. Okay. Number 10 is a classic security threat, just applied to agents, agent resource and service exhaustion. Basically, denial of service attacks.

Speaker 2:

It's exactly Testing how resilient the agent is to attacks designed to just deplete its resources. Cpu memory, network bandwidth, api quotas, you name it.

Speaker 1:

So standard DOS testing.

Speaker 2:

Kind of, but tailored to agents Crafting tests that specifically force the agent into doing excessive resource-hungry computations or flooding it with complex data designed to exhaust its memory capacity. Okay, and a really significant one the guide highlights here is rapidly depleting external resources. It relies on particularly API quotas. This includes testing for economic denial of service, or EDOS. Edos, yeah, economic denial of service where the attack isn't necessarily trying to crash the agent but to make it incredibly expensive to run by, forcing it to make tons and tons of calls to paid APIs or cloud services, driving up the bill until it's unsustainable.

Speaker 1:

Ah, hitting them in the wallet, clever, okay. Category 11 looks outwards at the wider ecosystem, agent, supply chain and dependency attacks.

Speaker 2:

Right risks from compromised components that the agent relies on, things like the tools used to build it, external libraries it imports, plugins it might use or third-party services it calls out to.

Speaker 1:

So the whole software supply chain issue, but for agents.

Speaker 2:

Pretty much. Tests here would include trying to introduce malicious code. During the agent's actual build process you test its defenses against potentially malicious external libraries or plug-ins. The guide mentions tool poisoning as a specific risk here.

Speaker 1:

Tool poisoning like compromising a tool the agent uses.

Speaker 2:

Exactly If the agent uses an external calculator tool via an API, what if that calculator API gets compromised and starts giving wrong answers? You also need to simulate attacks on those external services the agent depends on. How does it handle getting compromised data back. What happens if a critical service goes down?

Speaker 1:

Lots to think about there. Okay, finally, number 12, which sounds crucial for forensics and just basic accountability Agent untraceability.

Speaker 2:

Yeah, assessing how easy or hopefully how difficult it is to actually trace the agent's actions back and ensure there's accountability for what it does.

Speaker 1:

So can the agent act like a ghost?

Speaker 2:

Well, red teaming tests would involve attempting to perform actions that don't leave clear logs, or actions designed to actively evade tracing mechanisms Trace evasion, they call it.

Speaker 1:

Okay.

Speaker 2:

You might also test if the agent can somehow misuse inherited permissions to perform actions that can't be easily traced back to the original request or user, or after some malicious activity has occurred. Can the agent or an attacker using the agent obfuscate or tamper with the forensic evidence trails within the agent system?

Speaker 1:

OK, that's. That's a lot. Those 12 categories cover a huge amount of ground.

Speaker 2:

They really do, and what's so valuable about the guide and why walking through those is so helpful, I think, is that it moves way beyond just saying, oh, agentic AI has risks.

Speaker 1:

Right. It gives you a concrete framework.

Speaker 2:

Exactly A framework for identifying and, more importantly, testing those specific risks, all grounded in these practical, actionable examples. It shows this isn't just theoretical anymore. These are real attack surfaces that need to be actively explored and secured.

Speaker 1:

Okay, so we've identified the unique challenges. We've got these 12 categories of threats. What does the guide say about the path forward, because this field is I mean, it's evolving incredibly fast.

Speaker 2:

Absolutely, and the guide really emphasizes that our testing methods have to evolve just as quickly as the agents themselves. They point to some pretty interesting future focus areas.

Speaker 1:

Like what.

Speaker 2:

Well, one fascinating concept is autonomous red teaming agents.

Speaker 1:

Wait using AI to find vulnerabilities in other AI systems.

Speaker 2:

That's the idea AI agents specifically designed to probe and attack other AI systems. That's the idea AI agents specifically designed to probe and attack other AI systems, to uncover weaknesses.

Speaker 1:

An AI attacker finding flaws in an AI defender. Wow, that's quite a thought. The cat and mouse game gets automated.

Speaker 2:

It really does. They also highlight the need for downstream action, red teaming, so specifically testing those really complex multi-step chains of actions. An agent might take across many different systems, not just testing one interaction but the whole sequence, and secure multi-agent orchestration is flagged as critical. When you have teams of agents working together, managing the trust, the communication, the privilege separation between them becomes a huge challenge. How do you secure the collective?

Speaker 1:

between them becomes a huge challenge. How do you secure the collective and, I imagine, just practically? We need better ways to measure if any of this red teaming is actually working.

Speaker 2:

Definitely, the guide calls for developing standardized metrics and benchmarks. Can you actually quantify things like, say, the mean time it takes your defenses to detect a specific type of agent attack, or the time it takes to contain a vulnerability once the red team finds it? We need hard numbers.

Speaker 1:

Makes sense. Did they mention any specific tools or frameworks people are working on?

Speaker 2:

Yeah, they name drop a few examples of ongoing work in the community Things like ASTRO, which is for threat modeling, agent systems, agent Dojo, agent Safety Bench, splank, mix AI, agentic Radar, agent Fence. These are just examples showing that people are actively building specialized tools because, well, the old security tools often aren't sufficient for these new challenges.

Speaker 1:

So, bringing this all together, then the really clear message seems to be red teaming. Agentic AI is absolutely essential, but it demands a specialized, structured approach. You can't just, you know, copy paste your existing AppSec tests and cross your fingers.

Speaker 2:

Not at all, and the guide provides that structure. It outlines a process that looks familiar in its phases but has crucial nuances for agents. You've got preparation.

Speaker 1:

Right Defining scenarios, setting up test environments specific to the agent's goals and the tools it uses.

Speaker 2:

Exactly. Then execution actually running the test based on those 12 categories, documenting everything carefully. Then analysis, evaluating the results, prioritizing the biggest risks you found and finally reporting, creating actionable reports and suggesting concrete mitigations.

Speaker 1:

Having that clear process outline, especially combined with those 12 threat categories, really gives security teams a roadmap to get started.

Speaker 2:

And that emphasis you mentioned on quantifiable metrics, exploit success rates, time to containment that's key to actually understanding your security posture, with these complex autonomous systems moving towards a more data-driven assessment of risk.

Speaker 1:

So the core message we're really taking away today is crystal clear Agentic AI's autonomy. It unlocks incredible potential. No doubt about it.

Speaker 2:

Huge potential.

Speaker 1:

But it simultaneously introduces these novel, very distinct security risks that demand proactive, specialized testing. Your traditional security playbook. It needs some serious updates for this new era.

Speaker 2:

Absolutely, and the future of securing eugenic AI, as this guide sort of points towards. It isn't just about building better point defenses, like a stronger firewall here or there.

Speaker 1:

No.

Speaker 2:

It seems to require continuous adaptation, real community-wide collaboration on threats and defenses and, maybe most fascinatingly, integrating these AI agents designed specifically to uncover weaknesses in other AI agents.

Speaker 1:

It really leaves you thinking, doesn't it, about how quickly that fundamental cat and mouse game between attackers and defenders is going to accelerate when both sides are increasingly using autonomous AI.

Speaker 2:

It's a truly thought-provoking prospect. How fast does that cycle spin up?

Speaker 1:

Definitely and we really encourage you, the listener, to think about how these specific vulnerabilities we've discussed today you know everything from goal manipulation and those hallucination chains to memory leaks and supply chain risks how might they actually manifest in the agentic AI systems that you're building or deploying, or maybe just interacting with?

Speaker 2:

Yeah, the real aha moment comes when you grasp just how different the security landscape becomes because of that core feature autonomy.

Speaker 1:

Understanding these nuances really feels like the critical first step.

Speaker 2:

It's absolutely the critical first step towards building and operating more secure, more trustworthy agentic AI as these systems become more common.

Speaker 1:

Well, thanks for joining us on this deep dive into the slightly scary but fascinating world of agentic AI security.

People on this episode