The Digital Transformation Playbook
Kieran Gilmurray is a globally recognised authority on Artificial Intelligence, intelligent automation, data analytics, agentic AI, leadership development and digital transformation.
He has authored four influential books and hundreds of articles that have shaped industry perspectives on digital transformation, data analytics, intelligent automation, agentic AI, leadership and artificial intelligence.
𝗪𝗵𝗮𝘁 does Kieran do❓
When Kieran is not chairing international conferences, serving as a fractional CTO or Chief AI Officer, he is delivering AI, leadership, and strategy masterclasses to governments and industry leaders.
His team global businesses drive AI, agentic ai, digital transformation, leadership and innovation programs that deliver tangible business results.
🏆 𝐀𝐰𝐚𝐫𝐝𝐬:
🔹Top 25 Thought Leader Generative AI 2025
🔹Top 25 Thought Leader Companies on Generative AI 2025
🔹Top 50 Global Thought Leaders and Influencers on Agentic AI 2025
🔹Top 100 Thought Leader Agentic AI 2025
🔹Top 100 Thought Leader Legal AI 2025
🔹Team of the Year at the UK IT Industry Awards
🔹Top 50 Global Thought Leaders and Influencers on Generative AI 2024
🔹Top 50 Global Thought Leaders and Influencers on Manufacturing 2024
🔹Best LinkedIn Influencers Artificial Intelligence and Marketing 2024
🔹Seven-time LinkedIn Top Voice.
🔹Top 14 people to follow in data in 2023.
🔹World's Top 200 Business and Technology Innovators.
🔹Top 50 Intelligent Automation Influencers.
🔹Top 50 Brand Ambassadors.
🔹Global Intelligent Automation Award Winner.
🔹Top 20 Data Pros you NEED to follow.
𝗖𝗼𝗻𝘁𝗮𝗰𝘁 Kieran's team to get business results, not excuses.
☎️ https://calendly.com/kierangilmurray/30min
✉️ kieran@gilmurray.co.uk
🌍 www.KieranGilmurray.com
📘 Kieran Gilmurray | LinkedIn
The Digital Transformation Playbook
The Agent Company Benchmark: Evaluating AI's Real-World Capabilities
Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.
The dizzying pace of AI development has sparked fierce debate about automation in the workplace, with headlines swinging between radical transformation and cautious scepticism. What's been missing is concrete, objective evidence about what AI agents can actually do in real professional settings.
TLDR:
- The Agent Company benchmark creates a simulated software company environment to test AI agents on realistic work tasks
- AI agents must navigate digital tools including GitLab, OwnCloud, Plane, and RocketChat while interacting with simulated colleagues
- Even the best-performing model (Claude 3.5 Sonnet) only achieved 24% full completion rate across all tasks
- Surprisingly, AI agents performed better on software engineering tasks than on seemingly simpler administrative or financial tasks
- Common failure modes include lack of common sense, poor social intelligence, and inability to navigate complex web interfaces
- Performance differences likely reflect biases in available training data, with coding having much more public data than administrative tasks
- The gap between open source and closed source models appears to be narrowing, suggesting wider future access to capable AI systems
The Agent Company benchmark offers that much-needed reality check. By creating a fully simulated software company environment with all the digital tools professionals use daily—code repositories, file sharing, project management systems, and communication platforms—researchers can now rigorously evaluate how AI agents perform on authentic workplace tasks.
The results are eye-opening.
Even the best models achieved just 24% full completion across 175 tasks, with particularly poor performance in social interaction and navigating complex software interfaces.
Counterintuitively, AI agents performed better on software engineering tasks than on seemingly simpler administrative or financial work—likely reflecting biases in available training data.
The specific failure patterns reveal fundamental limitations: agents struggle with basic common sense (not recognizing a Word document requires word processing software), social intelligence (missing obvious implied actions after conversations), and web navigation (getting stuck on routine pop-ups that humans dismiss without thinking).
In one telling example, an agent unable to find a specific colleague attempted to rename another user to the desired name—a bizarre workaround that highlights flawed problem-solving approaches.
For anyone trying to separate AI hype from reality, this benchmark provides crucial context. While showing meaningful progress in automating certain professional tasks, it confirms that human adaptability, contextual understanding, and social navigation remain essential in the modern workplace.
Curious to explore the benchmark yourself? Visit theagentcompany.com or find the code repository on GitHub to join this important conversation about the future of work.
Research: THEAGENTCOMPANY: benchmarking llm agents on consequential real-world tasks
𝗖𝗼𝗻𝘁𝗮𝗰𝘁 my team and I to get business results, not excuses.
☎️ https://calendly.com/kierangilmurray/results-not-excuses
✉️ kieran@gilmurray.co.uk
🌍 www.KieranGilmurray.com
📘 Kieran Gilmurray | LinkedIn
🦉 X / Twitter: https://twitter.com/KieranGilmurray
📽 YouTube: https://www.youtube.com/@KieranGilmurray
📕 Want to learn more about agentic AI then read my new book on Agentic AI and the Future of Work https://tinyurl.com/MyBooksOnAmazonUK
Perspectives on AI Work Automation
Speaker 1You know it feels like every other day there's a new headline about AI agents and how they're about to revolutionize work. Some folks are talking about widespread automation happening incredibly fast.
Speaker 2That's right. We see claims, like suggestions, that these large language models are rapidly reaching a point where they can take over significant chunks of professional tasks.
Speaker 1It's definitely a compelling vision, this idea of AI stepping in to handle a lot of what we currently do. But then you hear a different perspective, a more well cautious one.
Speaker 2Absolutely Researchers and thinkers like Chollet. They argue that current AI, while impressive, still lacks the deeper reasoning and flexible adaptability needed for true widespread automation.
Speaker 1Adaptability right.
Speaker 2Yeah, and in complex professional settings. They point to limitations in, like understanding and handling new situations.
Speaker 1So we're looking at these two very different pictures of the future of work, and it really begs the question how do we actually cut through the hype and figure out what these AI agents are really capable of in a practical professional context?
Speaker 2That's the million-dollar question, isn't it? We're missing objective ways to really test AI agents on real world work tasks and, crucially, to pinpoint exactly where they fall short.
Speaker 1Right.
Speaker 2It's one thing to see a polished demo, but quite another to rigorously evaluate their performance on the day to day stuff.
Speaker 1And that's where the agent company benchmark comes in right. This seems like a really interesting attempt to get some well solid answers.
Speaker 2It is, yeah. The agent company provides a simulated software company environment specifically designed to evaluate AI agents on realistic work tasks. Think of it as a kind of virtual office space, okay, where we can put these agents through their paces on scenarios that mirror actual business operations.
Speaker 1So we're not just talking about abstract problem solving. We're looking at tasks like software engineering, managing projects, even financial analysis the kinds of roles you'd find in a typical company.
Speaker 2Precisely Within this benchmark. The AI agents interact with a simulated digital workspace? Can elite with the ability to browse the web, write and execute code and even communicate with simulated colleagues?
Speaker 1Simulated colleagues, okay.
The Agent Company Benchmark Explained
Speaker 2Yeah, and our goal today is to dive deep into the findings from this benchmark and really understand what they tell us about the current state of AI and its implications for the future of work.
Speaker 1Okay, the agent company benchmark. Let's pull back the curtain. What's the core idea behind this simulated software company?
Speaker 2So the fundamental idea is to create a controlled, self-contained digital environment where we can assess how AI agents perform on tasks that professionals actually do. It's not just about running isolated bits of code. It's about navigating a whole digital ecosystem that tries to reflect a real workplace.
Speaker 1And it's designed to be reproducible, which is so important for any kind of scientific evaluation. How do they manage that?
Speaker 2Absolutely. Reproducibility is vital. The entire environment is built using open source software, which means anyone can set it up and run the same tests. The core of this virtual intranet includes several key components GitLab, which acts as a central place for code and internal docs.
Speaker 1Right like a wiki too.
Speaker 2Exactly, then OwnCloud, which provides office software functionality and file sharing. Okay, plane used for managing tasks and projects.
Speaker 1Task management got it.
Speaker 2And RocketChat, which serves as the company's internal communication system. So yeah, all the standard digital tools you might find, but virtualized.
Speaker 1Simulated colleagues, that sounds like a particularly innovative aspect. How does that work?
Speaker 2Yeah, that's pretty neat To really test the agent's ability to collaborate and communicate. They've integrated these simulated colleagues. They're not just static, they're powered by large language models, specifically using the Zootopia platform. Think of it as a system for creating and managing these AI-driven simulated people.
Speaker 1Okay, so they have some intelligence.
Speaker 2Right, and for that intelligence they're using the Claude 3.5 Sonnet model. By default, each simulated colleague has a defined profile, a specific role, project affiliations.
Speaker 1Well, they're part of the company structure Exactly.
Speaker 2This allows the benchmark to assess how well the AI agents can interact, ask for information and generally collaborate with others through Rocket Chat.
Speaker 1So it's not just about individual performance. It's also looking at how well they can function within a team, which is a huge step forward for these kinds of evaluations definitely, and the way the tasks themselves are structured is also quite detailed.
Speaker 2each task starts with a clear description in plain english, outlining what the agent's supposed to achieve, like a real work assignment. Okay, then each task is broken down into specific checkpoints, yeah, intermediate goals that have a certain number of points associated with them.
Speaker 1Checkpoints with points, so it's not just about whether they complete the whole thing, but also about how much progress they make along the way. That's smart.
Speaker 2Exactly and whether an agent successfully reaches each checkpoint is evaluated programmatically Automatically. Mostly, yeah, For most checkpoints. They use deterministic Python code that automatically checks the state of the simulated environment. Did the file get saved? Was? The message sent that kind of thing. However, for more subjective aspects, like judging the quality of a written report or how appropriate their communication was, they use LLM-based evaluation, Again relying on CLOD 3.5 Sonnet.
Speaker 1Can you give us a few examples of what these checkpoints might look like?
Speaker 2Sure. A checkpoint might involve verifying that the agent has successfully performed a specific action, like, say, cloning a particular code repository from GitLab.
Speaker 1Okay, straightforward.
Speaker 2Or it could assess whether the agent has accurately entered data into a form within Plain, maybe.
Speaker 1Yeah.
Speaker 2And, as we mentioned, some checkpoints are designed to evaluate collaboration.
Speaker 1Like talking to the simulated colleagues.
Speaker 2Right Like checking if the agent correctly reached out to the right simulated colleague on RocketChat to get a piece of information it needed. So a really multifaceted way of seeing what the agent can do.
Speaker 1So a pretty comprehensive way to track their abilities. Now, how do they actually measure the overall success of an agent on a given task? What are the metrics?
Speaker 2They primarily use two main metrics. The first is a straightforward full completion score. This is binary. Either the agent successfully passes all the checkpoints for a task, gets a one All or nothing Right, or if it fails even one checkpoint, it gets a zero.
Speaker 1Simple enough. Did they get it all done or not? But you also mentioned a partial completion score. How does that give us a more nuanced picture?
Speaker 2Yeah, the partial completion score is designed to reward agents for making progress, even if they don't manage to complete every single step.
Speaker 1Which seems more realistic sometimes.
Speaker 2It does. The formula is sparsal equals half times result divided by total, plus half times a full.
Speaker 1Okay, break that down. What's result? And total.
Speaker 2So result is the total number of points the agent actually earned across all the checkpoints in the task, including any partial credit awarded for some checkpoints. Total is just the maximum possible points for all the checkpoints in that specific task.
Speaker 1Okay. So result total gives you the percentage of the task they managed to complete in terms of points, and then they average that percentage with the simple yes-no of the full completion score. So this isn't just pass-fail. It really gives us a more nuanced picture of how far these agents can get, even if they don't cross the finish line completely.
Speaker 2Precisely An agent that gets halfway through a difficult task will get a better score than one that immediately fails. Even if neither fully completes it, it acknowledges effort and partial success.
Speaker 1That seems crucial for understanding real-world readiness.
Speaker 2It is, and they also track efficiency metrics.
Speaker 1Right, you mentioned that.
Speaker 2Yeah, like the number of steps the agent takes, which basically means the number of times it calls the underlying language model.
Speaker 1So how much thinking it does.
Speaker 2Kind of yeah, and also the estimated cost in dollars based on how many tokens it uses.
Speaker 1So it's not just about how well they perform, but also how resource intensive they are. That's important too. Now can you walk us through what a typical task looks like from start to finish within the agent company?
Speaker 2Certainly, A typical task follows a three-stage process. First there's initialization, where the agent sort of sets up its digital workspace and gets ready.
Speaker 1Gets the sparrings.
Speaker 2Right. Then comes the execution phase, where it performs all the necessary subtasks. This might involve using the different intranet tools, finding info, processing it or talking to those simulated colleagues. Finally, in the final evasion stage, the agent delivers its output. Whatever that may be, it's assessed against the checkpoints.
Speaker 1Can you give us a concrete example of a task within this environment, something specific?
Speaker 2Yeah, let's take the example they gave about managing a sprint for a fictional project called Rising Wave using the Plane platform.
Speaker 1Okay, project management tab.
Speaker 2Right. The initial instruction might be something like identify any unfinished issues from the current sprint in Plane, move them to the next sprint, notify the assigned person via Rocket Chat, run a code coverage analysis from GitLab, create a brief summary report in own cloud and incorporate feedback from the simulated PM.
Speaker 1Okay, that sounds like a real task. Lots of steps.
Speaker 2Exactly, and this task has several distinct checkpoints, each worth points. For example, correctly identifying and moving the issues might be worth two points, okay. Sending the notifications another point. Running the code coverage maybe two points. Creating and sharing the report two points. And addressing the feedback one point, for a total of eight possible points.
Speaker 1Okay, I'm following. So what would a partial completion look like in this specific scenario? How would that score work out?
Speaker 2So in their paper they give an example where the agent successfully identifies the unfinished issues and moves them. So two out of two points.
Speaker 2And start and sends out the notifications One out of one point. It then manages to clone the necessary code repo but fails to correctly execute the code coverage script. Maybe it gets an error or something. So it gets partial credit, say one out of two points, for that. But then it doesn't manage to complete the report sharing in own cloud or incorporate the feedback. So zero points for those last two. So out of the total eight possible points, the agent earns four points, plugging this into the partial completion formula we talked about.
Speaker 1The 0.5 times result over total plus 0.5 times full completion.
Speaker 2Right, you get 0.548 plus 0.50, assuming it didn't fully complete all checkpoints. So squirrels is zero.
Speaker 1Which gives 0.25, 25%.
Speaker 2Exactly A partial completion score of 25%. So even though it didn't finish everything, the score reflects the progress it did make. It got halfway there in terms of points.
Speaker 1That really makes it clear how the checkpoint system and partial credit give us a much more detailed understanding than just a simple yes or no. Okay, now let's switch gears. Where did these tasks actually come from? How did they decide what kinds of professional work to include?
Speaker 2That's a critical question. Obviously, the creators wanted tasks that genuinely reflect the diverse range of work done in real professional settings. They wanted to move beyond generic puzzle benchmarks or things too limited in scope.
Speaker 1And how did they ensure that connection to the real world of work? What was their source?
Speaker 2They started by looking at the O-Net database.
Speaker 1Ah, the Department of Labor database.
Speaker 2That's the one. It's a comprehensive resource, has detailed info on tons of jobs, typical tasks, skills needed, importance of tasks, all that. They used the latest version. They could release 29.1 from 2024.
Speaker 1So this provided a data-driven foundation for identifying important and common job categories, which ones initially caught their attention.
Speaker 2Based on things like the number of people employed in the US and total salaries in those sectors. They initially flagged categories like general and operations managers, registered nurses, software developers and financial managers A pretty broad spectrum.
Speaker 1They definitely are. So how did they then narrow their focus down to the software company setting for the agent company? Obviously, nurses wouldn't fit a digital benchmark.
Speaker 2Right. Exactly Because they were aiming for a non-physical, entirely digital benchmark, one where everything could be evaluated within a computer environment, they had to exclude roles, like nursing, that heavily involve physical tasks.
Speaker 1Makes sense.
Speaker 2They ultimately decided on the setting of a software company because it allowed them to cover a wide variety of roles that primarily operate within a digital workspace Things like software development, project management, data science, admin roles, hr finance.
Speaker 1A really rich ecosystem of digital tasks within one company. Okay, once they had this setting, how did they go about selecting the specific tasks to include?
Speaker 2The selection process was quite thorough. They referenced the detailed task lists within OneNet for those chosen job categories.
Speaker 1Okay.
Speaker 2They also drew on the personal experience of the co-authors. Many had direct professional experience in these kinds of roles.
Speaker 1Ah, so real-world knowledge too.
Speaker 2Yeah, and interestingly they also used brainstorming sessions with language models themselves to generate potential task ideas.
Speaker 1Using the AI to help define tasks for the AI. That's meta.
Speaker 2A little bit, yeah, but the key throughout was focusing on concrete tasks with clear objectives and, importantly, well-defined ways to check if they've been successfully completed.
Speaker 1It sounds like a very systematic approach, but actually creating these tasks within the simulated environment must have been a huge effort. Can you tell us about that manual curation process?
Speaker 2It was indeed a very significant undertaking. For every task, they had to manually write the initial task description, the task intent.
Speaker 1English instructions.
Task Structure and Evaluation Metrics
Speaker 2Right, then define all the specific checkpoints for evaluation and figure out how each would be assessed programmatically.
Speaker 1Which means writing code.
Speaker 2A lot of code and often finding or creating necessary data to import into the simulated intranet, like example spreadsheets or project files. They also had to write the scripts to set up the initial state of the workspace for each task and, crucially, implement the checkpoint evaluators, the code that automatically checks if the agent did the right thing.
Speaker 1That sounds incredibly time-consuming. Do they give any indication of just how much effort went into this?
Speaker 2They do. They mentioned it took a team of 20 people a mix of CS students, software engineers, project managers about two months to create the initial set of 175 tasks.
Speaker 1Two months 20 people.
Speaker 2Yeah, they estimate the total effort was around 3,000 person hours, and some of the more complex tasks took over 10 hours each just to design, implement, test and verify.
Speaker 1That really emphasizes the complexity of building a meaningful benchmark like this. You can't just whip it up. And how did they ensure the quality and accuracy of all these tasks and evaluations? Quality control must have been key.
Speaker 2Definitely they had several layers For each task. They required visual proof, like screenshots, that the evaluation code worked correctly and that a perfect score was actually achievable.
Speaker 1So proof it wasn't impossible.
Speaker 2Right. They also encouraged writing automated tests for the evaluation programs themselves. All task contributions were reviewed by the lead authors before being merged.
Speaker 1Okay.
Speaker 2And finally, after all tasks were created, they had a final manual review of everything for each task the data, the evaluators, the checkpoint scoring. And this final check was done by someone who hadn't been involved in creating that specific task, for objectivity.
Speaker 1Like a final sanity check.
Speaker 2Exactly. They specifically mentioned checking the point values assigned to different checkpoints to make sure they accurately reflected relative importance.
Speaker 1That level of rigor is really impressive. It sounds like they've gone to great lengths to make sure this benchmark is both comprehensive and reliable. Okay, so now we understand the agent company itself, let's delve into the findings. What kind of baseline AI agent did they use for their initial experiments?
Speaker 2For their initial runs they used the open-hands code act agent with browsing, specifically version 0.14.2.
Speaker 1Open-hands. Okay, what is that exactly?
Speaker 2It's an agent framework, basically a platform designed to let an AI agent interact with both code execution environments and web browsers in a unified way.
Speaker 1Got it and what are the key capabilities of this baseline agent? What actions can it perform? What can it see?
Speaker 2It can interact with the simulated environment through three main interfaces a standard bash shell.
Speaker 1Command line stuff.
Speaker 2Right a Jup interact with the simulated environment through three main interfaces A standard bash shell Command line stuff Right A Jupyterite Python server so it can run Python code interactively, and a Chromium web browser powered by Playwright. For web interaction, the browser part uses a system called Browser Gem, which provides basic web actions like clicking, typing, scrolling.
Speaker 1So it has access to the fundamental tools a digital worker might typically use and, in terms of what it can observe, what information does it get back?
Speaker 2It gets the results of its actions, output from terminal commands, python results, and for the browser it gets things like visual snapshots of the web page and its underlying structure, the accessibility tree.
Speaker 1Okay, so a reasonably capable agent for navigating this digital workplace. Now, which specific language models do they actually evaluate using this open hands agent? You mentioned a range.
Speaker 2They tested quite a few, both closed source and open weight models. The closed ones included Anthropix Cloud, openai's GPT-4.0, google's Gemini and Amazon's Nova. The big players Right and the open weight models were Meta's Llama and Alibaba's Quinn.
Speaker 1Okay.
Speaker 2It's important to note, though, that for evaluating the agents' outputs like judging those subjective checkpoints, and for powering the simulated colleagues, the NPCs. Yeah, the NPCs. They consistently used Quad 3-5 Sonnet. That was to keep the evaluation standard high and consistent across all the different agents being tested. They ran these models on a total of 175 different tasks within the agent company.
Speaker 1That's a pretty comprehensive set of models and tasks. So what were the main takeaways? What did the scoreboard look like? Which model performed the best overall?
Speaker 2The overall results show that Claude 3.5 Sonnet was the top performer. It achieved a full completion rate of 24%.
Speaker 1Okay, about a quarter of the tasks fully completed.
Speaker 2Right and a partial completion score of 34.4% across all 175 tasks. However, this top performance came at a cost.
Speaker 1Ah, the efficiency part.
Speaker 2Exactly. It required a relatively high average number of steps nearly 30 calls to the language model per task on average and had the highest average cost, over $6 per instance.
Speaker 1Wow, $6 per task attempt. That adds up fast. So most capable but also most resource-hungry. What about the other models they tested? Any surprises?
Speaker 2Well, gemini 2.0 Flash actually came in second best in terms of capability 11.4% full completion, 19.0% partial score. But interestingly it took an even higher average number of steps, almost 40, but its cost was way lower, less than 80 cents per instance.
Speaker 1More steps, but much cheaper. How does that work?
Speaker 2The qualitative analysis suggested. This higher step count for Gemini Flash was often because the agent got stuck in unproductive loops or just sort of wandered around the environment without a clear goal.
Performance Results Across AI Models
Speaker 1Sometimes so it was busy, but not always effectively busy. High activity but lower success than Claude. That paints a picture. Yeah, what about the open source models in the lineup? How did they stack up?
Speaker 2Among the open weight models, lama 3.1, the big one 405 billion parameters performed the best. Its results were pretty close to GPT-4.0.
Speaker 1How did GPT-4.0 do?
Speaker 2GP performed the best Its results were pretty close to GPT-4-0. How did GPT-4-0 do? Gpt-4-0 had an 8.6% full completion and a 16.7% partial score, so LAMA 3.1 was in that ballpark Okay. However, lama 3.1 tended to take more steps and was also more expensive to run than GPT-4-0. Right, and the researchers made an interesting side note about GPT-4-0.
Speaker 2It seemed better at recognizing when a task was maybe too difficult and just deciding to give up early Knowing when to quit, yeah, which actually helped save on steps and costs compared to models that might keep trying fruitlessly for longer.
Speaker 1That's a really interesting capability efficient failure in a way. And what about the smaller version of LAMA3? There was a 70B parameter, one too right.
Speaker 2Yes, lama 3.3, the 70 billion parameter model. It showed surprisingly good performance, achieving a 6.9% full completion rate.
Speaker 1That's not far behind the huge 405B one or GPT-40.
Speaker 2Exactly. It was comparable to the much larger LAMA 3.1, but came at a significantly lower cost. This suggests we're seeing real progress in the capabilities of smaller, more efficient models for these agent tasks.
Speaker 1That's definitely an encouraging sign for making these technologies more practical and accessible. Now the paper also looked at how these agents performed across the different platforms within the agent company, like GitLab and RocketChat. What did those results tell us?
Speaker 2Yes, breaking down performance by platform gives some really interesting insights. A key finding was that almost all the models struggled quite a bit with RocketChat.
Speaker 1The communication platform where they talked to the simulated colleagues.
Speaker 2That's the one. This really underscores that current language models still have significant room to improve when it comes to effectively communicating and collaborating in a social context, even a simulated one.
Speaker 1That certainly aligns with some of the earlier skepticism we talked about regarding AI's ability to truly handle social nuances. What about the other platforms, like the code repository or the file sharing?
Speaker 2They also observed relatively low success rates on tasks involving own cloud, the simulated online office suite.
Speaker 1Why was that?
Speaker 2It's likely due to the complexity of its web-based user interface. Navigating those kinds of intricate UIs dealing with multiple elements, maybe pop-ups it proved to be a real challenge for the agents.
Speaker 1Right, highlighting the need for better web browsing capabilities.
Speaker 2Definitely, especially for complex interactive websites. Gitlab and Plain, which are arguably more structured environments focused on code and task tracking, generally saw somewhat better performance compared to RocketJet and OwnCloud.
Speaker 1So the agents seem to be more successful in the more structured, perhaps less socially interactive or less UI complex parts of the simulated company. That's a key distinction. The benchmark also looked at performance across different categories of tasks, like software development versus finance. What did those comparisons reveal?
Speaker 2This is another really insightful part of the analysis. They categorized tasks based on roles Software development, engineering, sde, project management, pm, data science, ds, admin, hr, finance and a catch-all other data science, ds, admin, hr, finance and to catch all other and surprisingly they found that the success rates in data science, administrative and finance tasks were among the lowest.
Speaker 1Really Lower than software engineering.
Speaker 2Yes, many models failed to complete any tasks in those categories successfully. Even the top-performing Claude model showed much lower success rates there compared to, say, sde tasks.
Speaker 1That's quite unexpected, isn't it? You might intuitively think that some of those admin or financial tasks may be, involving spreadsheets or forms, would be relatively more straightforward for an AI compared to complex coding.
Speaker 2Exactly. It feels counterintuitive. While some of these tasks might seem conceptually simpler for humans than writing complex code, the LLM-powered agents actually struggled more with them. Conversely, software engineering tasks, which often require a higher degree of specialized knowledge for humans, saw relatively higher success rates in the benchmark.
Speaker 1Why do you think we're seeing this kind of reverse intuition in the results? What's the hypothesis?
Speaker 2The researchers suggest it might be heavily influenced by the training data and existing benchmarks used to develop these LLMs.
Speaker 1Ah, the data bias.
Speaker 2Kind of yeah. There's a vast amount of publicly available training data related to coding GitHub, stack Overflow, coding tutorials, etc. And many prominent AI benchmarks specifically test coding abilities. Right benchmark specifically test coding abilities. In contrast, administrative and financial tasks often involve handling more private or proprietary company data and navigating complex, perhaps custom-built software interfaces. There's just much less public training data for that.
Speaker 1That makes a lot of sense.
Speaker 2Plus, tasks that require really understanding documents deeply, communicating effectively with nuances and automating repetitive but subtly varying processes seem to be areas where current LLMs still have significant limitations, even if they look easy on the surface.
Speaker 1So the availability of relevant training data and the focus of existing benchmarks likely play a huge role in shaping the current strengths and weaknesses. Weaknesses Okay. Now, beyond just the overall performance numbers, the researchers also highlighted some specific and, frankly, rather intriguing examples of common agent failures. What were some of the more notable ones?
Speaker 2There were several really illustrative examples that show where things go wrong. One showed a clear lack of common sense.
Speaker 1Okay.
Speaker 2An agent was instructed write the responses to workspace answer dot dot docs. Simple enough. Right Seems like it Open the word doc type in it. Well, the agent treated the file as if it were plain text. It just tried to write the content directly into it, completely failing to recognize the dot docs extension means it's a specific format that needs you know word or a compatible editor.
Speaker 1Wow, that's a really basic level of file type understanding that we just take for granted. What other kinds of failures did they observe Any social blunders?
Speaker 2They definitely saw examples of a lack of social intelligence. Going back to that rocket chat difficulty In one task, an agent successfully asked a simulated colleague let's call him Alex could you tell me who I should introduce myself to next on the team?
Speaker 1Okay, reasonable question.
Speaker 2Alex responded perfectly clearly, suggesting Chen Xinyi nest. She's on our front end team.
Speaker 1Right clear instruction.
Speaker 2But the agent then just stopped. It failed to follow up and actually talk to Chen Xinyi. It prematurely marked the task as complete.
Speaker 1So it could process the explicit information, the name, but fail to grasp the social context and the obvious implied next step. That's quite revealing. What about problems with their ability to navigate the web you mentioned? Own cloud was tough.
Speaker 2Yeah, web browsing was definitely a recurring issue. Agents frequently got stuck on simple UI elements that a human would instantly dismiss, like those annoying pop-up windows asking you to download a mobile app that sometimes appear on sites like own cloud.
Speaker 1Oh yeah those things. A human just clicks X.
Speaker 2Right, but the agent might just get stuck there, unable to proceed. They also struggled with tasks that required multiple steps, like downloading a file from own cloud, Often got tripped up by the sequence of clicks needed or intermediate pages or pop-ups.
Speaker 1Those are the kinds of everyday web annoyances that humans learn to navigate almost subconsciously, but they seem to be major roadblocks for these agents. Were there any other really surprising ways they failed?
Speaker 2One particularly interesting example they described as deceiving oneself.
Speaker 1Deceiving oneself how.
Speaker 2In a scenario where an agent couldn't find the correct person to ask a question on Rocket Chat. Maybe the user wasn't online or didn't exist as expected.
Speaker 1Okay.
Speaker 2It attempted this bizarre workaround. It found another user and tried to rename that user to the name of the person it was actually looking for.
Speaker 1Wait, it tried to rename someone else. That's weirdly creative but totally wrong.
Speaker 2Exactly. It highlights this kind of flawed, almost desperate, problem-solving strategy, where the agent tries to force the world to fit its plan in a completely nonsensical way, rather than adapting or recognizing the failure.
Speaker 1That's a really fascinating and somewhat unsettling illustration of how these agents can go off the rails. It really underscores the difference between mimicking actions and having a true understanding. So, based on all of these findings, the scores, the platform differences, these failure modes, what are the major implications? What did the researchers conclude from the agent company results?
Speaker 2Well, the primary implication really is that, while current AI agents are definitely showing progress in automating certain types of work, Like some coding tasks. Right, they are still a long long way from being able to independently handle the vast majority of human work tasks, even within the relatively simplified, controlled environment of this benchmark.
Speaker 1And the weak spots are clear.
Speaker 2Yeah, the areas where they particularly struggle are those that involve nuanced social interaction, navigating complex professional software UIs effectively, and tasks that depend on accessing or reasoning about private or non-publicly available information.
Speaker 1However, they also pointed to some positive trends in the development of these AI models. Right, it wasn't all bad news.
Speaker 2No, definitely not. They emphasized the significant advancements in both the capabilities and the efficiency of large language models. They cited models like Gemini 2.0 Flash and those smaller capable Lama 3 versions as notable examples, and they also observed that the performance gap between the closed source giants and the best open source models appears to be narrowing, which is generally seen as encouraging for wider access and innovation.
Speaker 1And they also acknowledged that the aging company itself isn't perfect, that it has some limitations. What are those? It's important to keep those in mind.
Speaker 2Absolutely. They noted that the current set of tasks in the agent company tends to be on the more straightforward side, largely because they needed reliable, automated evaluation. They don't yet include more complex, open-ended, maybe creative tasks.
Speaker 1Things that are harder to score automatically.
Speaker 2Exactly. Additionally, they've primarily used one specific agent framework. Open hands for their testing. It's possible that different agent architectures might yield different results.
Speaker 1Fair point.
Implications and Future Research Directions
Speaker 2Ideally, they'd also have a direct comparison showing how human professionals perform on these exact same tasks within the benchmark, but that wasn't feasible for this initial study.
Speaker 1That would be the ultimate baseline.
Speaker 2It would. And finally, they acknowledged that the task creation was largely based on the researcher's own introspection and experience, which could potentially introduce some degree of bias in task selection or framing.
Speaker 1So a valuable first step, a great foundation, but definitely not the final word. What future improvements or expansions do they envision for the agent company or for similar benchmarks going forward?
Speaker 2They suggest several directions. Expanding the benchmark to include tasks from a wider range of industries beyond just software is one.
Speaker 1Like health care, admin or logistics.
Speaker 2Potentially, yeah. Also maybe incorporating tasks that involve physical actions, although that's a whole other level of complexity. They also propose including tasks with more ambiguous or vague initial instructions to better reflect the uncertainty of real-world work.
Speaker 1Less hand-holding.
Speaker 2Right and adding higher-level, longer-horizon tasks, things that might span from initial product conceptualization all the way through to execution, involving multiple stages and roles.
Speaker 1More like real projects.
Speaker 2Exactly, and since the agent company is open source, they strongly encourage the research community to take it, build upon it and further develop this whole area of realistic AI agent evaluation.
Speaker 1It sounds like a really important contribution to the field and it's great that it's open for others to build on. Okay, to bring our discussion to a close, what are the key insights you think our listeners should really take away from this deep dive into the agent company benchmark?
Speaker 2I think the main takeaway is this While AI agents are making undeniable progress, they still face significant hurdles in autonomously performing most human work, especially tasks requiring common sense, social intelligence and adeptly navigating complex digital tools and workflows.
Speaker 1And it was quite surprising, or maybe revealing, to see that tasks that might seem more complex for humans, like software engineering, actually had higher success rates compared to seemingly more routine administrative or financial tasks.
Speaker 2Exactly. This really highlights the current strengths and weaknesses of LLMs, likely reflecting, as we discussed, the kind of data they've been trained on and the types of benchmarks that have driven their development so far.
Speaker 1So, as we think about the future of different professions and the ongoing evolution of AI, it really prompts us to consider well, which types of work are truly automatable soon and what uniquely human skills be it adaptability, true understanding, social navigation will remain absolutely indispensable for the foreseeable future.
Speaker 2This research certainly raises that crucial question for all of us. It gives us a more grounded perspective than just the headlines, and for those of you listening who are interested in, delving deeper into the agent company itself.
Speaker 1Remember it is an open source project.
Speaker 2Right, you mentioned the links. Yeah, you can find more information and access the code on their website, which is theagent-companycom.
Speaker 1Theagent-companycom.
Speaker 2And the code repository is on GitHub under the agent company. The agent company.
Speaker 1We'll make sure those links are available. Fantastic Well, thank you so much for taking this deep dive with us into the agent company Benchmark. It's provided a really insightful, much-needed, realistic look at the current capabilities and limitations of AI agents in a work context.
Speaker 2My pleasure. It's a fascinating and incredibly fast-moving area, and benchmarks like this, grounded in realism, are absolutely essential for helping us all understand where we're actually headed.