Agentic Threats and Trustworthy AI: The Week in Review Artwork

AI Weekly

Each week, I break down the latest headlines and innovations shaping artificial intelligence, from breakthrough research and industry moves to emerging risks and real-world applications. Whether it’s Big Tech battles, startup disruption, or the ethical questions no one’s asking, we cut through the noise to bring you the stories that matter most in AI.

All Episodes

AI Weekly

Agentic Threats and Trustworthy AI: The Week in Review

November 10, 2025 • Mike Housch

This week, we dive into critical research from MIT aimed at building safer, faster AI models and modular software, contrasted sharply by alarming reports of successful data exfiltration attacks against major LLMs like Claude and ChatGPT, alongside the emergence of autonomous, adaptive malware. We also look at the governance challenges presented by autonomous "agentic users" entering the enterprise workforce and the profound uncertainty surrounding AI integration in K-12 schools.

Welcome back to AI Weekly. I’m Michael Hoosch. This week, we are balancing incredible breakthroughs in AI architecture, designed to make systems more dependable and efficient, with a sobering look at the rapid evolution of cyber threats; including autonomous malware and massive data exfiltration vulnerabilities in leading Large Language Models. We also discuss the severe governance and security challenges that the new breed of "agentic AI" systems are introducing into the enterprise environment.

We start with promising research coming out of the MIT-IBM Watson AI Lab Summer Program, focusing on making AI tools more flexible, efficient, and grounded in truth. The adoption of new AI tools is fundamentally linked to users perceiving them as reliable, accessible, and an improvement over existing workflows.

MIT students are working on pain points across safety, inference efficiency, multimodal data, and knowledge-grounded reasoning.

A key focus is learning when to trust a model. MIT students explored the “uncertainty of uncertainty” of LLMs. Traditionally, tiny feed-forward neural networks called probes, usually two-to-three layers deep, are trained alongside LLMs to flag untrustworthy answers to developers. However, these classic probes often produce false negatives and only offer point estimates, which limit information about when the LLM is failing. MIT’s work used prompt-label pairs and hidden states like activation vectors to measure gradient scores and sensitivity to prompts, helping determine probe reliability and identify difficult data areas.

Another way to ensure trustworthiness is by augmenting LLM responses with external, trusted knowledge bases to eliminate hallucinations. For structured data like corporate databases or financial transactions, knowledge graphs are a natural fit. A MIT Physics teams streamlined this computationally expensive process by creating a single-agent, multi-turn, reinforcement learning framework. This framework pairs an API server hosting Knowledge Graphs like Freebase and Wikidata with a single reinforcement learning agent that issues targeted retrieval actions to gather pertinent information, ultimately balancing accuracy and completeness.

Speed is just as critical as accuracy. MIT teams tackled the limitations of transformers, particularly their high computational complexity in long-sequence modeling. When input length doubles in transformers, the computational cost quadruples due to the softmax attention mechanism. The team explored theoretically grounded yet hardware-efficient algorithms, adopting linear attention to reduce this quadratic complexity. They also increased expressivity by replacing the standard Rope (rotary positional encoding)—which limits capturing internal state changes and sequence lengths—with a dynamic reflective positional encoding based on the Householder transform. This advancement allows transformers to handle more complex subproblems with fewer inference tokens.

Moving from model architecture to software architecture, MIT researchers are proposing a new model for creating legible, modular software, specifically aiming to make it safer and easier for LLMs to generate.

The core challenge in modern software is "feature fragmentation," where a single feature, like a "share" button, is scattered across multiple services (posting, notification, authentication). This scattering makes code messy, hard to change safely, and often risks unintended side effects.

The new approach breaks systems into two components: "concepts" and "synchronizations". A concept bundles up a single, coherent piece of functionality, like liking or following, along with its state and possible actions. Synchronizations are explicit rules, written in a small domain-specific language (DSL), that describe exactly how those concepts interact. These synchronizations act like contracts, making the system easier for humans to understand and easier for LLMs to generate correctly.

The benefits are significant. Because synchronizations are explicit and declarative, they can be analyzed, verified, and generated by LLMs, potentially leading to safer, more automated software development without hidden side effects. In a case study, features were centralized and legible, and synchronizations handled common concerns like error handling or persistent storage consistently across the system. This framework may lead to a future where application development involves selecting concepts from shared libraries—a "concept catalog"—and writing the synchronizations between them, treating synchronizations as a new kind of high-level programming language.

Now, let’s shift gears to education. The rapid advancement of generative AI has left K-12 schools scrambling to respond to challenges such as maintaining academic integrity and data privacy. MIT’s Teaching Systems Lab, recently helped publish "A Guide to AI in Schools: Perspectives for the Perplexed".

I would think writing a guidebook on generative AI in schools in 2025 is "a little bit like writing a guidebook of aviation in 1905". No one today can say how best to manage AI in schools, and the guidance is not meant to be prescriptive, but rather to spark thought and discussion.

The core challenge is measuring how student learning loss looks when students bypass "productive thinking" using AI. The AI adoption was unlike previous technologies; it simply "showed up on kids’ phones," bypassing normal rollouts and quality control.

The advice being circulated needs caution, as past tech guidance has often proven faulty—for example, the bad advice that taught students not to trust Wikipedia. I think we are all advocates for an ethos of humility, emphasizing that we need patience and centralized pockets of learning to test ideas and collect evidence on what actually works. The goal must be to "race to answers that are right, not first".

In addition to the guidebook, the team produced "The Homework Machine," a seven-part podcast series that I will post in the show notes, the podcast exploring how AI is reshaping K-12 education, tackling issues like AI adoption, pedagogy, and post-Covid learning loss. The podcast format helps share timely information that the slow academic publishing cycle cannot accommodate.

Let’s turn now to the alarming security threats emerging this week, particularly those targeting leading LLMs and developer tools.

First, a security researcher discovered that Anthropic’s Claude AI APIs can be abused for data exfiltration using indirect prompt injection. This attack is possible if Claude has network access enabled, which is the default on certain plans. The attacker injects a payload into a document loaded by the user, tricking the model into reading user data (like chat conversations saved by the ‘memories’ feature), storing it in the Code Interpreter’s sandbox, and then using an attacker-provided API key to upload the file to the attacker’s account. This technique can exfiltrate up to 30MB at once.

We saw similar prompt injection vulnerabilities targeting ChatGPT’s Memories and Web Search features. Tenable researchers discovered seven vulnerabilities that could be chained for data theft. These exploits involved getting the model to summarize a malicious website where the attacker had planted instructions in the comments section. The web browsing feature, SearchGPT, executes these embedded AI prompts, allowing an attacker to inject prompts into ChatGPT. This attack method was used to urge users to click on phishing links, or to use intermediary Bing URLs to bypass safety checks and exfiltrate user data, including memories and chat history. Prompt injection was even used to add a malicious memory instructing the chatbot to exfiltrate data subsequently. Critically, Tenable noted that some of these attack methods still work, even against the latest GPT-5 model.

The threat landscape is rapidly adapting, moving beyond simple prompt injection to autonomous, self-modifying malware. Google’s Threat Intelligence Group warned that malware is now using AI during execution to mutate and collect data, going beyond traditional uses for planning attacks or creating phishing lures. Examples include: PromptFlux: An experimental dropper written in VBScript that interacts with the Gemini API to request specific VBScript obfuscation and evasion techniques for 'just-in-time' self-modification, aimed at evading static signature-based detection. FruitShell: A reverse shell seen in the wild that includes hardcoded AI prompts designed to bypass analysis by AI-powered security solutions. PromptSteal: A Python-based data miner that uses the Hugging Face API to query an LLM to generate one-line Windows commands for collecting system data. QuietVault: A credential stealer that uses an AI prompt and AI command-line interface tools installed on the compromised host to look for other secrets on the system.

Google emphasizes that this activity is currently nascent but is expected to increase, lowering the barrier to entry for many criminals, especially as unrestricted, specialized AI tools mature in the criminal underground.

Finally, in developer tooling, a data exposure vulnerability (CVE-2025-12058) was found in the Keras deep learning tool. This medium-severity flaw allowed attackers to load arbitrary local files or conduct Server-Side Request Forgery, or SSRF attacks. The vulnerability existed because the StringLookup and IndexLookup preprocessing layers allowed file paths or URLs as inputs for defining vocabularies. Attackers could upload malicious Keras models to public repositories; when a victim downloads and loads the model, their sensitive local files, such as SSH private keys, are read into the model’s vocabulary, allowing the keys to be retrieved by the attacker. This issue was resolved in Keras version 3.11.4.

The most transformative, and arguably riskiest, shift in enterprise AI is the rise of agentic AI—autonomous systems that can plan, act, and coordinate across services. This promises operational speed, but it also increases the risk of unintended actions and widens the attack surface.

Microsoft recently teased the introduction of "agentic users" into the workforce. These new agents are described as having their own identity, dedicated access to organizational systems and applications, and the ability to collaborate with humans and other agents. They can attend meetings, edit documents, send emails and chats, and generally perform tasks autonomously. Microsoft plans to sell these agents, possibly under an "A365" license, in the M365 Agent Store.

Security and licensing experts immediately raised concerns. The move toward consumption-based pricing models for AI, like Microsoft’s Copilot Credit Pre-Purchase Plan, makes forecasting usage difficult. But the greater worry is control: "what happens if they go rogue?". An autonomous agent could send sensitive data to the wrong people, provide incorrect information, or send inappropriate messages.

This speaks directly to a broader concept of the loss of control in modern enterprise IT. The obsessive introduction of AI, which often can’t be easily turned off, means organizations don't know how it works, what it’s doing with data, or how it will change behind the scenes. Rupert Goodwins emphasized that this ultimate loss of control comes from big tech ceding more decision-making to opaque AI systems with less human oversight.

For agentic systems to be safe, pragmatic interventions are essential. Steve Durbin recommends five key steps: Treat agents as first-class service identities and enforce least privilege and short-lived credentials. Auditability must be non-negotiable, requiring immutable logs that capture agent inputs, actions, and decision paths for post-incident forensics. Demand human oversight for high-risk or policy-critical decisions, maintaining a human-in-the-loop governance structure. Red-team agentic behaviors using adversarial simulations that explore multi-step exploit paths. Expand governance frameworks to require cross-functional sign-off from security, legal, and compliance teams before deployment.

Organizations must pair clear objectives with rigorous controls, traceability, and adversarial validation to manage this technology, viewing agentic deployments as higher-risk system integrations.

Finally, we must touch on the ongoing debate regarding AI model evaluation. A new study from the Oxford Internet Institute found that AI benchmarks, often used by companies like OpenAI to tout technical superiority, may not be meaningful.

The study reviewed 445 LLM benchmarks for natural language processing and machine learning and found that only 16 percent used rigorous scientific methods to compare performance. Worse, about half of these benchmarks claim to measure abstract ideas like "reasoning" or "harmlessness" without providing clear definitions or measurement methods. For instance, 27% rely on convenience sampling, like reusing questions from specific exams, which may not accurately predict performance on broader, real-world problems.

The authors of the OII study recommend concrete steps to improve benchmarks, including defining the phenomenon being measured, preparing for contamination, and using appropriate statistical methods. Ultimately, while companies seek to measure AGI—which OpenAI vaguely defines as systems "generally smarter than humans"—one analyst noted that measuring money, such as generating $100 billion in profits, turns out to be easier than measuring intelligence.

That brings us to the end of this week's AI Weekly. From groundbreaking modular software design and the push for AI trustworthiness, to critical security flaws and the rise of autonomous agents, the pace of change remains relentless. I'm Michael Hoosch, thank you for listening.