AI Research Today

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

Aaron Season 1 Episode 10

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 32:32

Send us Fan Mail

In this episode, we break down the new paper “OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation,” which explores how AI agents can be benchmarked across real occupational domains like healthcare, logistics, manufacturing, customs processing, and more.

The paper introduces OccuBench, a large-scale benchmark spanning 100 professional task scenarios across 65 specialized domains. One of the most interesting ideas is the use of Language Environment Simulators (LESs), where LLMs simulate enterprise environments and tool responses for domains that normally have no public APIs or accessible evaluation environments.

We discuss:

  • Why current agent benchmarks miss most real-world enterprise work
  • How simulated environments can evaluate professional AI agents
  • Fault injection testing and robustness evaluation
  • Cross-industry capability differences between frontier models
  • What this means for autonomous enterprise systems and AI agents in production

Paper:
https://arxiv.org/abs/2604.10866

PDF:
https://arxiv.org/pdf/2604.10866

Arkitekt AI:
arkitekt-ai.com

Contact:
support@arkitekt-ai.com

SPEAKER_00

Hello, welcome to another episode of AI Research Today. I'm your host, Aaron, with Architect AI. We help businesses automatically build advanced enterprise grade software solutions using our proprietary platform Architect, which is an agentized system and unsurprisingly is related to the paper today that I've chosen to talk through. So for those who've joined the last few episodes, we went through grad mem, um language model injectivity, and a few other things like that. Some of those papers were fairly mathy, I'll say. One, because uh it was part of a research segment from another podcast I really enjoy is the um Last Week in AI podcast. Um, shout out for the paper recommendation. Um But the paper itself takes a unique approach to baselining and evaluating agentic systems. So agentic systems are the big buzz in industry now. I think it's sort of the next iteration of the AI revolution, which is really beyond just a chat interface, can a generative model actually make decisions and drive value using connections via API? And in some cases, uh some research organizations are looking through direct browser um actions rather than uh relying on API availability for agentic integration. Uh we've seen we've seen a lot of agentic use cases across industry. A lot of them uh focus on things like text to SQL. For example, like a call center we were working with, they had a large uh pile of structured data in a call center um related to account and metadata, like how long has the customer been with us, what account level permissions they have, etc. And there's also unstructured data. And so you can't necessarily build a rag directly on top of that, so it kind of becomes an agentic benchmark where an agent has to figure out where it should go look for the information first, then put build up a potentially SQL query to pull information from the relevant database, um, and then populate that information uh back to the user. So this there are a lot of agent benchmarks that exist. This paper certainly isn't the first one to propose a agentic benchmark. Um, I've released some research papers, uh personal plug, um benchmarking agent systems on the uh App World environment. So App Worlds um uh is a benchmark that includes rec recreations of common software platforms like Spotify. Um I I can't remember the others off the top of my head, but there's several other popular apps. Venmo, that was another one. Um, so there's several popular apps that it sort of implemented dummy versions on, and the agents are given tasks like go into my Spotify list, um find the 10 songs that were mentioned in the most recent email from my friend John. And um it's a stateful environment, so there's a SQLite database. Again, I'm speaking about the App World just to give a little bit of context here. So the App World Benchmark has this um stateful SQLite database. So when you're operating in the database, uh those uh operations are logged as database transactions. And then when an agent says that it's completed a task, um, some unit tests are run on the database to see if the final state of the database matches the expected final state of the task. So in this case, you know, getting songs mentioned in an email. So the agent would need to understand there are multiple uh apps to interact with Gmail, Spotify, it would need to go through the documentation, request tokens, um, etc. So it can be a fairly complex multi-step environment. Um, but most of the benchmarks for agent tests are related to things like that or um navigating a 3D room or website interactions? Um, but this paper decides to look at industry categories. So these are more high-value um agentic tasks and also higher risk tasks. So some examples they cite in the paper are can an agent triage patients in an emergency department? Can an agent manage a nuclear reactor safety alert or um control greenhouse irrigation based on sensor data? Um and so the paper itself is called Ocubench, evaluating AI agents on real-world professional tasks via language environment simulation. So the the key is the language environment simulation. That's what I thought was particularly interesting about this paper. So rather than setting up a stateful deterministic environment where the agent has to come in and run some queries and make some API calls and get some results, um, they actually use uh sort of an adversarial environment where they have one AI functioning as the test environment, excuse me, and another AI that's actually making the API calls into this environment. Um they call out the professional domains where AI is most needed, things like healthcare, finance, legal manufacturing, logistics, government energy are bound to enterprise systems that have no public APIs or access and have irreversible real world consequences. Um obviously the nuclear reactor example has some irreversible real world consequences if things are messed up. Um and there's also a lot of uh prohibitive scaling costs, and another thing they add in to this benchmark that's interesting is non-deterministic errors. So um, in a lot of environments, uh like the app world environment, for example, if you call the tool correctly, it will give you a correct response. But in the real world, this is not always the case. There could be rate limits that you hit or some sort of connectivity issue or silent failures, like data is returned in a different format. So there's sort of a lack of indimpat indemnipacy. I said that wrong. Uh whatever, hopefully everyone knows what I mean, uh, across um transactions. And so they simulate the environment itself by an LLM. So they set up a configuration where they define a variable C as a tuple of a system prompt, a tool schema, an initial state, and a state description. So the LLM itself becomes stateful with a um dynamic and long-running system prompt. Uh and the LLM itself can decide to randomly throw errors, and there's also operator control over what sort of distribution of errors we want interjected during the execution of a task. So they uh set up multi-step decision-making process across 10 industry categories, and they run a series of models across these categories. Uh, these are um both open and closed source, so they benchmark things like uh Claude, OpenAI, as well as some of the uh popular open source uh disclosure. This is put out by uh the Quinn team and the Chinese University of Hong Kong. Um so these are like the open source models that they focus on testing. So they basically test the open source Chinese models against the closed sourced uh US models, uh Gemini, um Claude Opus, OpenAI, uh, etc. Um, and they find um a few things I'll just kind of uh breeze through their uh high-level results. So, first of all, no single model dominates all industries. So they find that Gemini 3.0 um leads in education and science but struggles in healthcare. Claude Opus 4.6 excels in transportation and trails in commerce, and every model has blind spots that are invisible to single domain benchmarks. Um, they also find that implicit faults are harder than explicit. So what this means is they they group their errors into two categories. So explicit faults would be something like you get a 500 internal server error in the API response. This is an explicit fault because the system itself is telling you there's something wrong. Uh, implicit is where the data set that's returned is incorrectly missing columns or there's some sort of silent failure. Um, and so it would be um on the operator to be able to detect that that was in fact an erroneous return, even though the system providing the return is not telling uh the operator that. So obviously those are going to be harder to detect for humans as well. Um, so they state that average performance under implicit drops far more than explicit, which makes sense. Um, and so this requires agents to independently detect data degradation. Um and they also find that scaling consistently improves performance. So across model families, the bigger the model, the better it performs across these different error modes. They find that GPT 5.2 is a 27.5% increase, um, going from none to extra high effort where you're turning on the reasoning effort. Um and they also uh find that the simulators themselves, so when you're positioning the LLM as the simulator, um, that uh strong agents aren't necessarily strong simulators. So GPT-5.2 ranks first as an agent, but produces the worst environment simulation quality. Um they that that's sort of a high-level overview of the introduction. Um they go through some related work, which really is uh sort of like what I was summarizing earlier regarding um benchmarks. So there is a lot of uh there's a ton of benchmarks software engineering bench, work arena, um, Android world, um, app world, like I mentioned. Interestingly, they don't mention App World that I was able to find in the paper, um, which I think is one of the more nicely constructed benchmarks out there. But regardless, there's a bunch of um benchmarks, but they note that some of the limitations are uh environments require substantial engineering to construct and maintain. This is like App World, where you have to recreate entire um applications and APIs. Uh the tests are static, um, which again, every time I run a test on the App World benchmark, um, it's running the same test. Now that could be a pro or a con, I think. Like if you're trying to compare models and every time you run against some benchmark, it's slightly different. Um, then you have another uh axis of comparison you have to um uh benchmark against, namely non-determinism. Um, and they also note that domain coverage is limited. So existing benchmarks cover web browsing, code editing, desktop operations, and a handful of API domains. So this is, I think, a fair statement. Um and uh so then uh their quest becomes to create benchmarks for these professional occupational occupational tasks. Um there obviously is room, you know how how how closely is this uh realistically implementing a operational environment. Um probably needs some domain expertise in there to validate that it is doing that correctly. Um, but I think the approach uh is valuable, the language model as a simulator, um, to throw kind of a wrench in the gear. So in the third section of the paper, they formalize the language environment simulator. So they define this as a function F subθeta of S A and C. Um F subθheta is just your parameterized language model. Um ST is your system prompt, uh or sorry, uh C is a tuple itself. So C is composed of the system prompt, which is the description of the environment and tool schema, initial state, and the state description. So this is something that's stateful, uh, and then the action, input action, and state, and then the AI, the LLM, provides a simulated um observation and output state based on the input action. So you you can imagine this if you're just going back and forth in a uh chat GPT window, and you said, Hey, pretend that you're a nurse, and it says, Okay, uh an agent just uh and a patient just came in. What do you want to do? And you and then you type back, uh, I want to take them back and check their heart rate. And then the language model thinks, Chat GPT thinks, and it says, Okay, uh heart rate's really high. So that's kind of your new state. So so it sort of goes through this back and forth simulation of um state action, uh kind of couching it in the typical language of a Markov process or like a reinforcement learning sort of setup. Um, but the language model simulator itself has this input configuration, which is a stateful description. It has the history, which is the history of all the action observation pairs, um, and it has a rubric uh verifier, which is um verifying that certain things are handled correctly. So there is some sort of deterministic code verification going on here, and then it's uh commit communicating with the agentic LLM, which is the model being tested. Um they make a few arguments for why LLMs can serve as language environment simulators, L E S as they call it. So, one, uh, these are the points they're making. Um, they format priors. So pre-training on vast API documentation and tool call logs provide strong priors for generating well-formatted tool responses. AI is good at coding, right? And that that's even uh evidencing itself in MCP best practices where now the recommended guidance is rather than just calling tools directly, write code in a sort of sandbox environment, and then that code executes against an MCP server. So a lot of these new um trajectories and setups are taking advantage of the code writing abilities of LLMs, and that's something that is done here as well. LLMs are good at writing code, so they can provide a good structured JSON response that's simulated based on all these billions of uh code and APIs they were trained on during um their pre-training. Um they obviously have a lot of domain knowledge as well, uh latent domain knowledge from pre-training just across the corp of work that they're trained on. Um the system prompt can be maintained and they're able to handle unexpected inputs more gracefully. Um, again, there are of course going to be some cons, nothing's perfect. Um, this is just my opinion, not stated in the paper, but I think you know, some cons are uh potential reproducibility and also how closely something actually ties to a real um domain, a professional domain. I'm not I I I don't think I didn't see in the paper where they clarify that that it's benchmarked um using um SMEs from the different industries, so sort of just trusted that the AIs understand the domains they're being asked to construct clearly. Um each evaluation satisfies four uh conditions in their setup. One, it's solvable, so there's a valid solution. Two, it's verifiable, so there's a way to validate that the final state is correct. Um there's calibrated difficulty across different agents to distinguish agent capabilities, and there's structural structural variation across instances. So they design 16 non-overlapping subtopics per scenario and construct a professional reference document for each covering domain terminology, workflows, edge cases, and these documents form references that ground subsequent uh subsequent generations. Um okay, let's uh discuss the fault injection a bit. So Ocubench evaluates agent robustness um against controlled fault injection. Um so they categorize their faults into four categories. Um E0 is clean, meaning there's no faults. You make an API call, it works if you call it correctly. E1 is you're injecting explicit faults. These are things like timeouts, internal server errors, refusal of connection, service unavailable errors, just kind of like typical JSON responses from the server being called, the simulated server, I should say. Um implicit faults. These are things with no error signal, like truncated data, incomplete lists, null, empty fields that uh may appear superficially correct, but upon further investigation are in fact incorrect due to incompleteness or missingness. Um, and then they have E3 category, which is mixed. So this is approximately half explicit errors and half implicit errors. So I'll use these terms E0, one, two, and three when we kind of discuss the results a bit. Um so let's see, the first result that they discuss is the E0 completion rate. So this is just the tried and true um API does what we expect, no errors on the server side, simulated server side. Um so they benchmark GPT 5, Gemini, Claude, Quinn, Deep Seek, the Claude Family, Kimi uh 2.5, GLM5, Minmax, Minimax, um, and Sonnet 4. Um, the highest scoring model, um an aggregate across all of the industry categories. So the industry categories um for each Benchmarked were agriculture, business, communications, education, health, industrials, public, science, technology, and transportation. So across all of these, GPT 5.2 performed best with a 79.6% successful completion rate, followed by Gemini 3 and then Claude Opus 4.6. The first open source model was Quinn 3.5 Plus with a 69.9%. Again, paper was written by the Quinn team. So this was the fourth model. Below that is Deep Seek, and then the remainder of the Claude family, and so on. So they of the fra the completion rate, the evaluation metrics, they quote in two forms completion rate, robustness score. Completion rate being fraction of the 382 tasks where the agent's trajectory passes an automated uh verification against the rubric. Um and then the robustness score is basically looking at the resilience across different fault types. Um so let's see here. What do I want to discuss? Um yeah, let's look here. So uh first, high-level insight. No single model dominates all industries. GPT 5-2 leads overall with the highest scores in agriculture, business, and industrial and science. Um, but its commerce score 67 is far below Quinn. Gemini ranks second with the highest score in education. Opus 4-6 ranks third, shows the opposite pattern where it's strong in transportation and business, but weak in commerce. Um which is interesting. I wonder if that has to do with I I I view Claude as a very um code-centric model. It's very good at quantitative tasks, and perhaps the industries it's shining in most are the ones that are more quantitative versus like uh you know, education or communications, it may struggle with more, whereas more human-centric models like 5-2 maybe um are better there. Again, that's just my speculation. Um, okay, so this one's interesting. So robustness, the uh completion rate across E1, two, and three divided by E0, sort of like your completion rate across your error-prone tasks divided by your perfect tasks. Um, Gemini 3 does the best with a 0.87. Um runner up, mini max, and then GPT-5-2 and GLM5. Um so the Gemini 3 um has the highest scores pretty significantly um across the errors. Uh however, I will say for the error one rate, which is like the structured response error, Claude Opus 4.6 does pretty close to top, coming in 68 below Gemini 3's 73%. Um it also does pretty good in the mix, but it it seems to, or they I mean really they all seem to struggle with the implicit error. GP2 GPT-5-2 does the best with the implicit errors. Um I think I think I think these are really useful results, like if you're trying to design an agentic system and trying to figure out, you know, maybe you have more of a situation where API failures are common, or maybe it's more frequent that you have implicit errors. These could be things like text to SQL, right? Like you have an agent that's querying a database um and that data may come out incomplete, you'd want an agent that's good at being able to detect that, versus like if you're calling like App World, the Spotify API, um, and the Spotify app has an error, uh, you'd want to be able to detect that. So, you know, these this chart um could really help uh fine-tune or give like a good place to fine-tune your agent um workflows. The unsurprising takeaway is that current agents struggle under adverse um conditions. Um even with just two fault events, uh the completion rates can drop by 15 to 20 percent um across the board. Uh uh, it's usually 15 to 20 percent from below whatever the E0 rate was. So it's a pretty um proportional fall across all the models. Um another interesting abolition they do is um generational improvements. So they look at across a family um of models, uh like uh specifically they look at mixed faults. Um what's the completion rate? Uh so um the larger uh versus small model variants from each family uh are it like outside of Claude 4.5, um there's a 7% improvement for Claude 4.6 from small to large, like sonnet to op it opus. Um and Gemini as well, um like light to pro 11 point gap. Um, so uh generational improvements for Claude, for example, go from 61 to 65 to 71. Umnet has a large jump from V4 to V4.5. Um and they also do some ablations on um reasoning efforts. Uh so uh with low, medium, and high reasoning efforts, they look at perfect E so E0 industry baseline performance and a completion rate. Um and they see like for GPD 5.2 going from none to low to medium to high to max high is almost a linear improvement in the agent performance. Interestingly, for Opus 4.6, going from low to medium is actually a decline in performance, so it goes from like 70 to 67. Now, this could, you know, there's gonna be some variance, but basically there's no large increase or decrease, it's almost flat. Um, and even going to the max settings for reasoning um only offers a minimal improvement in Opus 4.6. So maybe uh save the tokens. Um anyways, that that's that pretty much brings us to the end of the paper. Um I'll go ahead and wrap up here. I think I I do like the idea of the agentic environment simulation, but I think it could provide some interesting uh reinforcement learning opportunities. Like if you're to uh let's say focus on one particular industry and have it simulate an environment that's non-deterministic, so there's some sort of stochasticity in the environment, and you're rolling out trajectories and using like a PPO or whatever sort of update, um it would be interesting to see how much you can improve at these tasks and what those improvements would be from you know the final optimized trajectories versus the initial uh suboptimal trajectories. So I think there I like the approach of using language models and environment simulator. I think there's some interesting research that could be spawned off of it, and the study itself does a pretty nice wet uh coverage sweep of different models and model families and reasoning ablations. Like, like I said in the beginning, it's not it's not a mathy paper. Um, you can give it a read and probably understand it the first time through. Um that but but that's not a a dig on it. I think that it's very industry relevant. I think a lot of the results in this paper are really useful if you're in one of these industries and trying to build an agent system and trying to understand how different models could be expected to behave in different situations. Um, this is a great place to get a start at answering that question, and also a great place to kind of come to hypothesize ways that you can improve a certain model. Like if you're in an industry, for example, health, and maybe you can't use a closed source model because you don't have control over the data. So you have to host an on-prem model. Well, you know which on-prem models are going to do best, and you also have a way to um start doing some reinforcement learning tuning on these open source models to try and beat even the performance of what the closed source model would be. So a lot of cool directions you can take out of this. Again, the paper is called Ocubench Evaluating AI Agents on Real World Professional Tasks via Language Environment Simulation out of the Quinn Alibaba team and Chinese University of Hong Kong. Um, and that being said, I'll go ahead and wrap up the episode. Again, I'm Aaron, your host. Uh, you can email me at aaron.mcclendon at architectai.com. Um, I'll include a link to the paper, archive paper, as well as um uh other relevant information in the episode description. So thanks for listening. Hope you have a great rest of your week.