The Agent Company Benchmark: Evaluating AI's Real-World Capabilities Artwork

The Digital Transformation Playbook

Kieran Gilmurray is a globally recognised authority on Artificial Intelligence, cloud, intelligent automation, data analytics, agentic AI, and digital transformation. He has authored three influential books and hundreds of articles that have shaped industry perspectives on digital transformation, data analytics, intelligent automation, agentic AI and artificial intelligence.

𝗪𝗵𝗮𝘁 does Kieran do❓

When I'm not chairing international conferences, serving as a fractional CTO or Chief AI Officer, I’m delivering AI, leadership, and strategy masterclasses to governments and industry leaders.

My team and I help global businesses drive AI, agentic ai, digital transformation and innovation programs that deliver tangible business results.

🏆 𝐀𝐰𝐚𝐫𝐝𝐬:

🔹Top 25 Thought Leader Generative AI 2025
🔹Top 50 Global Thought Leaders and Influencers on Agentic AI 2025
🔹Top 100 Thought Leader Agentic AI 2025

🔹Top 100 Thought Leader Legal AI 2025
🔹Team of the Year at the UK IT Industry Awards
🔹Top 50 Global Thought Leaders and Influencers on Generative AI 2024
🔹Top 50 Global Thought Leaders and Influencers on Manufacturing 2024
🔹Best LinkedIn Influencers Artificial Intelligence and Marketing 2024
🔹Seven-time LinkedIn Top Voice.
🔹Top 14 people to follow in data in 2023.
🔹World's Top 200 Business and Technology Innovators.
🔹Top 50 Intelligent Automation Influencers.
🔹Top 50 Brand Ambassadors.
🔹Global Intelligent Automation Award Winner.
🔹Top 20 Data Pros you NEED to follow.

𝗖𝗼𝗻𝘁𝗮𝗰𝘁 my team and I to get business results, not excuses.

☎️ https://calendly.com/kierangilmurray/30min
✉️ kieran@gilmurray.co.uk
🌍 www.KieranGilmurray.com
📘 Kieran Gilmurray | LinkedIn

All Episodes

The Digital Transformation Playbook

The Agent Company Benchmark: Evaluating AI's Real-World Capabilities

May 19, 2025 • Kieran Gilmurray

The dizzying pace of AI development has sparked fierce debate about automation in the workplace, with headlines swinging between radical transformation and cautious scepticism. What's been missing is concrete, objective evidence about what AI agents can actually do in real professional settings.

TLDR:

The Agent Company benchmark creates a simulated software company environment to test AI agents on realistic work tasks
AI agents must navigate digital tools including GitLab, OwnCloud, Plane, and RocketChat while interacting with simulated colleagues
Even the best-performing model (Claude 3.5 Sonnet) only achieved 24% full completion rate across all tasks
Surprisingly, AI agents performed better on software engineering tasks than on seemingly simpler administrative or financial tasks
Common failure modes include lack of common sense, poor social intelligence, and inability to navigate complex web interfaces
Performance differences likely reflect biases in available training data, with coding having much more public data than administrative tasks
The gap between open source and closed source models appears to be narrowing, suggesting wider future access to capable AI systems

The Agent Company benchmark offers that much-needed reality check. By creating a fully simulated software company environment with all the digital tools professionals use daily—code repositories, file sharing, project management systems, and communication platforms—researchers can now rigorously evaluate how AI agents perform on authentic workplace tasks.

The results are eye-opening.

Even the best models achieved just 24% full completion across 175 tasks, with particularly poor performance in social interaction and navigating complex software interfaces.

Counterintuitively, AI agents performed better on software engineering tasks than on seemingly simpler administrative or financial work—likely reflecting biases in available training data.

The specific failure patterns reveal fundamental limitations: agents struggle with basic common sense (not recognizing a Word document requires word processing software), social intelligence (missing obvious implied actions after conversations), and web navigation (getting stuck on routine pop-ups that humans dismiss without thinking).

In one telling example, an agent unable to find a specific colleague attempted to rename another user to the desired name—a bizarre workaround that highlights flawed problem-solving approaches.

For anyone trying to separate AI hype from reality, this benchmark provides crucial context. While showing meaningful progress in automating certain professional tasks, it confirms that human adaptability, contextual understanding, and social navigation remain essential in the modern workplace.

Curious to explore the benchmark yourself? Visit theagentcompany.com or find the code repository on GitHub to join this important conversation about the future of work.

Research: THEAGENTCOMPANY: benchmarking llm agents on consequential real-world tasks

Support the show

𝗖𝗼𝗻𝘁𝗮𝗰𝘁 my team and I to get business results, not excuses.

☎️ https://calendly.com/kierangilmurray/results-not-excuses
✉️ kieran@gilmurray.co.uk
🌍 www.KieranGilmurray.com
📘 Kieran Gilmurray | LinkedIn
🦉 X / Twitter: https://twitter.com/KieranGilmurray
📽 YouTube: https://www.youtube.com/@KieranGilmurray

Speaker 1: 0:00

You know it feels like every other day there's a new headline about AI agents and how they're about to revolutionize work. Some folks are talking about widespread automation happening incredibly fast.

Speaker 2: 0:12

That's right. We see claims, like suggestions, that these large language models are rapidly reaching a point where they can take over significant chunks of professional tasks.

Speaker 1: 0:22

It's definitely a compelling vision, this idea of AI stepping in to handle a lot of what we currently do. But then you hear a different perspective, a more well cautious one.

Speaker 2: 0:33

Absolutely Researchers and thinkers like Chollet. They argue that current AI, while impressive, still lacks the deeper reasoning and flexible adaptability needed for true widespread automation.

Speaker 1: 0:45

Adaptability right.

Speaker 2: 0:46

Yeah, and in complex professional settings. They point to limitations in, like understanding and handling new situations.

Speaker 1: 0:53

So we're looking at these two very different pictures of the future of work, and it really begs the question how do we actually cut through the hype and figure out what these AI agents are really capable of in a practical professional context?

Speaker 2: 1:04

That's the million-dollar question, isn't it? We're missing objective ways to really test AI agents on real world work tasks and, crucially, to pinpoint exactly where they fall short.

Speaker 1: 1:15

Right.

Speaker 2: 1:16

It's one thing to see a polished demo, but quite another to rigorously evaluate their performance on the day to day stuff.

Speaker 1: 1:22

And that's where the agent company benchmark comes in right. This seems like a really interesting attempt to get some well solid answers.

Speaker 2: 1:30

It is, yeah. The agent company provides a simulated software company environment specifically designed to evaluate AI agents on realistic work tasks. Think of it as a kind of virtual office space, okay, where we can put these agents through their paces on scenarios that mirror actual business operations.

Speaker 1: 1:48

So we're not just talking about abstract problem solving. We're looking at tasks like software engineering, managing projects, even financial analysis the kinds of roles you'd find in a typical company.

Speaker 2: 1:57

Precisely Within this benchmark. The AI agents interact with a simulated digital workspace? Can elite with the ability to browse the web, write and execute code and even communicate with simulated colleagues?

Speaker 1: 2:10

Simulated colleagues, okay.

Speaker 2: 2:12

Yeah, and our goal today is to dive deep into the findings from this benchmark and really understand what they tell us about the current state of AI and its implications for the future of work.

Speaker 1: 2:23

Okay, the agent company benchmark. Let's pull back the curtain. What's the core idea behind this simulated software company?

Speaker 2: 2:31

So the fundamental idea is to create a controlled, self-contained digital environment where we can assess how AI agents perform on tasks that professionals actually do. It's not just about running isolated bits of code. It's about navigating a whole digital ecosystem that tries to reflect a real workplace.

Speaker 1: 2:51

And it's designed to be reproducible, which is so important for any kind of scientific evaluation. How do they manage that?

Speaker 2: 2:57

Absolutely. Reproducibility is vital. The entire environment is built using open source software, which means anyone can set it up and run the same tests. The core of this virtual intranet includes several key components GitLab, which acts as a central place for code and internal docs.

Speaker 1: 3:13

Right like a wiki too.

Speaker 2: 3:14

Exactly, then OwnCloud, which provides office software functionality and file sharing. Okay, plane used for managing tasks and projects.

Speaker 1: 3:22

Task management got it.

Speaker 2: 3:24

And RocketChat, which serves as the company's internal communication system. So yeah, all the standard digital tools you might find, but virtualized.

Speaker 1: 3:32

Simulated colleagues, that sounds like a particularly innovative aspect. How does that work?

Speaker 2: 3:37

Yeah, that's pretty neat To really test the agent's ability to collaborate and communicate. They've integrated these simulated colleagues. They're not just static, they're powered by large language models, specifically using the Zootopia platform. Think of it as a system for creating and managing these AI-driven simulated people.

Speaker 1: 3:57

Okay, so they have some intelligence.

Speaker 2: 3:58

Right, and for that intelligence they're using the Claude 3.5 Sonnet model. By default, each simulated colleague has a defined profile, a specific role, project affiliations.

Speaker 1: 4:09

Well, they're part of the company structure Exactly.

Speaker 2: 4:12

This allows the benchmark to assess how well the AI agents can interact, ask for information and generally collaborate with others through Rocket Chat.

Speaker 1: 4:19

So it's not just about individual performance. It's also looking at how well they can function within a team, which is a huge step forward for these kinds of evaluations definitely, and the way the tasks themselves are structured is also quite detailed.

Speaker 2: 4:34

each task starts with a clear description in plain english, outlining what the agent's supposed to achieve, like a real work assignment. Okay, then each task is broken down into specific checkpoints, yeah, intermediate goals that have a certain number of points associated with them.

Speaker 1: 4:49

Checkpoints with points, so it's not just about whether they complete the whole thing, but also about how much progress they make along the way. That's smart.

Speaker 2: 4:56

Exactly and whether an agent successfully reaches each checkpoint is evaluated programmatically Automatically. Mostly, yeah, For most checkpoints. They use deterministic Python code that automatically checks the state of the simulated environment. Did the file get saved? Was? The message sent that kind of thing. However, for more subjective aspects, like judging the quality of a written report or how appropriate their communication was, they use LLM-based evaluation, Again relying on CLOD 3.5 Sonnet.

Speaker 1: 5:25

Can you give us a few examples of what these checkpoints might look like?

Speaker 2: 5:27

Sure. A checkpoint might involve verifying that the agent has successfully performed a specific action, like, say, cloning a particular code repository from GitLab.

Speaker 1: 5:37

Okay, straightforward.

Speaker 2: 5:39

Or it could assess whether the agent has accurately entered data into a form within Plain, maybe.

Speaker 1: 5:44

Yeah.

Speaker 2: 5:44

And, as we mentioned, some checkpoints are designed to evaluate collaboration.

Speaker 1: 5:48

Like talking to the simulated colleagues.

Speaker 2: 5:50

Right Like checking if the agent correctly reached out to the right simulated colleague on RocketChat to get a piece of information it needed. So a really multifaceted way of seeing what the agent can do.

Speaker 1: 6:01

So a pretty comprehensive way to track their abilities. Now, how do they actually measure the overall success of an agent on a given task? What are the metrics?

Speaker 2: 6:10

They primarily use two main metrics. The first is a straightforward full completion score. This is binary. Either the agent successfully passes all the checkpoints for a task, gets a one All or nothing Right, or if it fails even one checkpoint, it gets a zero.

Speaker 1: 6:26

Simple enough. Did they get it all done or not? But you also mentioned a partial completion score. How does that give us a more nuanced picture?

Speaker 2: 6:34

Yeah, the partial completion score is designed to reward agents for making progress, even if they don't manage to complete every single step.

Speaker 1: 6:41

Which seems more realistic sometimes.

Speaker 2: 6:43

It does. The formula is sparsal equals half times result divided by total, plus half times a full.

Speaker 1: 6:50

Okay, break that down. What's result? And total.

Speaker 2: 6:52

So result is the total number of points the agent actually earned across all the checkpoints in the task, including any partial credit awarded for some checkpoints. Total is just the maximum possible points for all the checkpoints in that specific task.

Speaker 1: 7:07

Okay. So result total gives you the percentage of the task they managed to complete in terms of points, and then they average that percentage with the simple yes-no of the full completion score. So this isn't just pass-fail. It really gives us a more nuanced picture of how far these agents can get, even if they don't cross the finish line completely.

Speaker 2: 7:24

Precisely An agent that gets halfway through a difficult task will get a better score than one that immediately fails. Even if neither fully completes it, it acknowledges effort and partial success.

Speaker 1: 7:36

That seems crucial for understanding real-world readiness.

Speaker 2: 7:39

It is, and they also track efficiency metrics.

Speaker 1: 7:42

Right, you mentioned that.

Speaker 2: 7:43

Yeah, like the number of steps the agent takes, which basically means the number of times it calls the underlying language model.

Speaker 1: 7:49

So how much thinking it does.

Speaker 2: 7:51

Kind of yeah, and also the estimated cost in dollars based on how many tokens it uses.

Speaker 1: 7:56

So it's not just about how well they perform, but also how resource intensive they are. That's important too. Now can you walk us through what a typical task looks like from start to finish within the agent company?

Speaker 2: 8:07

Certainly, A typical task follows a three-stage process. First there's initialization, where the agent sort of sets up its digital workspace and gets ready.

Speaker 1: 8:17

Gets the sparrings.

Speaker 2: 8:18

Right. Then comes the execution phase, where it performs all the necessary subtasks. This might involve using the different intranet tools, finding info, processing it or talking to those simulated colleagues. Finally, in the final evasion stage, the agent delivers its output. Whatever that may be, it's assessed against the checkpoints.

Speaker 1: 8:37

Can you give us a concrete example of a task within this environment, something specific?

Speaker 2: 8:42

Yeah, let's take the example they gave about managing a sprint for a fictional project called Rising Wave using the Plane platform.

Speaker 1: 8:49

Okay, project management tab.

Speaker 2: 8:50

Right. The initial instruction might be something like identify any unfinished issues from the current sprint in Plane, move them to the next sprint, notify the assigned person via Rocket Chat, run a code coverage analysis from GitLab, create a brief summary report in own cloud and incorporate feedback from the simulated PM.

Speaker 1: 9:08

Okay, that sounds like a real task. Lots of steps.

Speaker 2: 9:11

Exactly, and this task has several distinct checkpoints, each worth points. For example, correctly identifying and moving the issues might be worth two points, okay. Sending the notifications another point. Running the code coverage maybe two points. Creating and sharing the report two points. And addressing the feedback one point, for a total of eight possible points.

Speaker 1: 9:34

Okay, I'm following. So what would a partial completion look like in this specific scenario? How would that score work out?

Speaker 2: 9:40

So in their paper they give an example where the agent successfully identifies the unfinished issues and moves them. So two out of two points.

Speaker 2: 9:48

And start and sends out the notifications One out of one point. It then manages to clone the necessary code repo but fails to correctly execute the code coverage script. Maybe it gets an error or something. So it gets partial credit, say one out of two points, for that. But then it doesn't manage to complete the report sharing in own cloud or incorporate the feedback. So zero points for those last two. So out of the total eight possible points, the agent earns four points, plugging this into the partial completion formula we talked about.

Speaker 1: 10:15

The 0.5 times result over total plus 0.5 times full completion.

Speaker 2: 10:20

Right, you get 0.548 plus 0.50, assuming it didn't fully complete all checkpoints. So squirrels is zero.

Speaker 1: 10:28

Which gives 0.25, 25%.

Speaker 2: 10:30

Exactly A partial completion score of 25%. So even though it didn't finish everything, the score reflects the progress it did make. It got halfway there in terms of points.

Speaker 1: 10:39

That really makes it clear how the checkpoint system and partial credit give us a much more detailed understanding than just a simple yes or no. Okay, now let's switch gears. Where did these tasks actually come from? How did they decide what kinds of professional work to include?

Speaker 2: 10:54

That's a critical question. Obviously, the creators wanted tasks that genuinely reflect the diverse range of work done in real professional settings. They wanted to move beyond generic puzzle benchmarks or things too limited in scope.

Speaker 1: 11:07

And how did they ensure that connection to the real world of work? What was their source?

Speaker 2: 11:11

They started by looking at the O-Net database.

Speaker 1: 11:13

Ah, the Department of Labor database.

Speaker 2: 11:15

That's the one. It's a comprehensive resource, has detailed info on tons of jobs, typical tasks, skills needed, importance of tasks, all that. They used the latest version. They could release 29.1 from 2024.

Speaker 1: 11:28

So this provided a data-driven foundation for identifying important and common job categories, which ones initially caught their attention.

Speaker 2: 11:37

Based on things like the number of people employed in the US and total salaries in those sectors. They initially flagged categories like general and operations managers, registered nurses, software developers and financial managers A pretty broad spectrum.

Speaker 1: 11:51

They definitely are. So how did they then narrow their focus down to the software company setting for the agent company? Obviously, nurses wouldn't fit a digital benchmark.

Speaker 2: 12:00

Right. Exactly Because they were aiming for a non-physical, entirely digital benchmark, one where everything could be evaluated within a computer environment, they had to exclude roles, like nursing, that heavily involve physical tasks.

Speaker 1: 12:12

Makes sense.

Speaker 2: 12:12

They ultimately decided on the setting of a software company because it allowed them to cover a wide variety of roles that primarily operate within a digital workspace Things like software development, project management, data science, admin roles, hr finance.

Speaker 1: 12:29

A really rich ecosystem of digital tasks within one company. Okay, once they had this setting, how did they go about selecting the specific tasks to include?

Speaker 2: 12:38

The selection process was quite thorough. They referenced the detailed task lists within OneNet for those chosen job categories.

Speaker 1: 12:45

Okay.

Speaker 2: 12:46

They also drew on the personal experience of the co-authors. Many had direct professional experience in these kinds of roles.

Speaker 1: 12:52

Ah, so real-world knowledge too.

Speaker 2: 12:54

Yeah, and interestingly they also used brainstorming sessions with language models themselves to generate potential task ideas.

Speaker 1: 13:02

Using the AI to help define tasks for the AI. That's meta.

Speaker 2: 13:06

A little bit, yeah, but the key throughout was focusing on concrete tasks with clear objectives and, importantly, well-defined ways to check if they've been successfully completed.

Speaker 1: 13:15

It sounds like a very systematic approach, but actually creating these tasks within the simulated environment must have been a huge effort. Can you tell us about that manual curation process?

Speaker 2: 13:25

It was indeed a very significant undertaking. For every task, they had to manually write the initial task description, the task intent.

Speaker 1: 13:33

English instructions.

Speaker 2: 13:34

Right, then define all the specific checkpoints for evaluation and figure out how each would be assessed programmatically.

Speaker 1: 13:41

Which means writing code.

Speaker 2: 13:43

A lot of code and often finding or creating necessary data to import into the simulated intranet, like example spreadsheets or project files. They also had to write the scripts to set up the initial state of the workspace for each task and, crucially, implement the checkpoint evaluators, the code that automatically checks if the agent did the right thing.

Speaker 1: 14:03

That sounds incredibly time-consuming. Do they give any indication of just how much effort went into this?

Speaker 2: 14:08

They do. They mentioned it took a team of 20 people a mix of CS students, software engineers, project managers about two months to create the initial set of 175 tasks.

Speaker 1: 14:19

Two months 20 people.

Speaker 2: 14:21

Yeah, they estimate the total effort was around 3,000 person hours, and some of the more complex tasks took over 10 hours each just to design, implement, test and verify.

Speaker 1: 14:31

That really emphasizes the complexity of building a meaningful benchmark like this. You can't just whip it up. And how did they ensure the quality and accuracy of all these tasks and evaluations? Quality control must have been key.

Speaker 2: 14:44

Definitely they had several layers For each task. They required visual proof, like screenshots, that the evaluation code worked correctly and that a perfect score was actually achievable.

Speaker 1: 14:54

So proof it wasn't impossible.

Speaker 2: 14:56

Right. They also encouraged writing automated tests for the evaluation programs themselves. All task contributions were reviewed by the lead authors before being merged.

Speaker 1: 15:05

Okay.

Speaker 2: 15:06

And finally, after all tasks were created, they had a final manual review of everything for each task the data, the evaluators, the checkpoint scoring. And this final check was done by someone who hadn't been involved in creating that specific task, for objectivity.

Speaker 1: 15:21

Like a final sanity check.

Speaker 2: 15:23

Exactly. They specifically mentioned checking the point values assigned to different checkpoints to make sure they accurately reflected relative importance.

Speaker 1: 15:30

That level of rigor is really impressive. It sounds like they've gone to great lengths to make sure this benchmark is both comprehensive and reliable. Okay, so now we understand the agent company itself, let's delve into the findings. What kind of baseline AI agent did they use for their initial experiments?

Speaker 2: 15:49

For their initial runs they used the open-hands code act agent with browsing, specifically version 0.14.2.

Speaker 1: 15:55

Open-hands. Okay, what is that exactly?

Speaker 2: 15:57

It's an agent framework, basically a platform designed to let an AI agent interact with both code execution environments and web browsers in a unified way.

Speaker 1: 16:05

Got it and what are the key capabilities of this baseline agent? What actions can it perform? What can it see?

Speaker 2: 16:12

It can interact with the simulated environment through three main interfaces a standard bash shell.

Speaker 1: 16:17

Command line stuff.

Speaker 2: 16:18

Right a Jup interact with the simulated environment through three main interfaces A standard bash shell Command line stuff Right A Jupyterite Python server so it can run Python code interactively, and a Chromium web browser powered by Playwright. For web interaction, the browser part uses a system called Browser Gem, which provides basic web actions like clicking, typing, scrolling.

Speaker 1: 16:34

So it has access to the fundamental tools a digital worker might typically use and, in terms of what it can observe, what information does it get back?

Speaker 2: 16:43

It gets the results of its actions, output from terminal commands, python results, and for the browser it gets things like visual snapshots of the web page and its underlying structure, the accessibility tree.

Speaker 1: 16:53

Okay, so a reasonably capable agent for navigating this digital workplace. Now, which specific language models do they actually evaluate using this open hands agent? You mentioned a range.

Speaker 2: 17:03

They tested quite a few, both closed source and open weight models. The closed ones included Anthropix Cloud, openai's GPT-4.0, google's Gemini and Amazon's Nova. The big players Right and the open weight models were Meta's Llama and Alibaba's Quinn.

Speaker 1: 17:19

Okay.

Speaker 2: 17:19

It's important to note, though, that for evaluating the agents' outputs like judging those subjective checkpoints, and for powering the simulated colleagues, the NPCs. Yeah, the NPCs. They consistently used Quad 3-5 Sonnet. That was to keep the evaluation standard high and consistent across all the different agents being tested. They ran these models on a total of 175 different tasks within the agent company.

Speaker 1: 17:41

That's a pretty comprehensive set of models and tasks. So what were the main takeaways? What did the scoreboard look like? Which model performed the best overall?

Speaker 2: 17:51

The overall results show that Claude 3.5 Sonnet was the top performer. It achieved a full completion rate of 24%.

Speaker 1: 17:58

Okay, about a quarter of the tasks fully completed.

Speaker 2: 18:00

Right and a partial completion score of 34.4% across all 175 tasks. However, this top performance came at a cost.

Speaker 1: 18:07

Ah, the efficiency part.

Speaker 2: 18:09

Exactly. It required a relatively high average number of steps nearly 30 calls to the language model per task on average and had the highest average cost, over $6 per instance.

Speaker 1: 18:18

Wow, $6 per task attempt. That adds up fast. So most capable but also most resource-hungry. What about the other models they tested? Any surprises?

Speaker 2: 18:27

Well, gemini 2.0 Flash actually came in second best in terms of capability 11.4% full completion, 19.0% partial score. But interestingly it took an even higher average number of steps, almost 40, but its cost was way lower, less than 80 cents per instance.

Speaker 1: 18:43

More steps, but much cheaper. How does that work?

Speaker 2: 18:45

The qualitative analysis suggested. This higher step count for Gemini Flash was often because the agent got stuck in unproductive loops or just sort of wandered around the environment without a clear goal.

Speaker 1: 18:55

Sometimes so it was busy, but not always effectively busy. High activity but lower success than Claude. That paints a picture. Yeah, what about the open source models in the lineup? How did they stack up?

Speaker 2: 19:07

Among the open weight models, lama 3.1, the big one 405 billion parameters performed the best. Its results were pretty close to GPT-4.0.

Speaker 1: 19:16

How did GPT-4.0 do?

Speaker 2: 19:16

GP performed the best Its results were pretty close to GPT-4-0. How did GPT-4-0 do? Gpt-4-0 had an 8.6% full completion and a 16.7% partial score, so LAMA 3.1 was in that ballpark Okay. However, lama 3.1 tended to take more steps and was also more expensive to run than GPT-4-0. Right, and the researchers made an interesting side note about GPT-4-0.

Speaker 2: 19:38

It seemed better at recognizing when a task was maybe too difficult and just deciding to give up early Knowing when to quit, yeah, which actually helped save on steps and costs compared to models that might keep trying fruitlessly for longer.

Speaker 1: 19:47

That's a really interesting capability efficient failure in a way. And what about the smaller version of LAMA3? There was a 70B parameter, one too right.

Speaker 2: 19:54

Yes, lama 3.3, the 70 billion parameter model. It showed surprisingly good performance, achieving a 6.9% full completion rate.

Speaker 1: 20:03

That's not far behind the huge 405B one or GPT-40.

Speaker 2: 20:08

Exactly. It was comparable to the much larger LAMA 3.1, but came at a significantly lower cost. This suggests we're seeing real progress in the capabilities of smaller, more efficient models for these agent tasks.

Speaker 1: 20:20

That's definitely an encouraging sign for making these technologies more practical and accessible. Now the paper also looked at how these agents performed across the different platforms within the agent company, like GitLab and RocketChat. What did those results tell us?

Speaker 2: 20:33

Yes, breaking down performance by platform gives some really interesting insights. A key finding was that almost all the models struggled quite a bit with RocketChat.

Speaker 1: 20:41

The communication platform where they talked to the simulated colleagues.

Speaker 2: 20:45

That's the one. This really underscores that current language models still have significant room to improve when it comes to effectively communicating and collaborating in a social context, even a simulated one.

Speaker 1: 21:02

That certainly aligns with some of the earlier skepticism we talked about regarding AI's ability to truly handle social nuances. What about the other platforms, like the code repository or the file sharing?

Speaker 2: 21:08

They also observed relatively low success rates on tasks involving own cloud, the simulated online office suite.

Speaker 1: 21:15

Why was that?

Speaker 2: 21:16

It's likely due to the complexity of its web-based user interface. Navigating those kinds of intricate UIs dealing with multiple elements, maybe pop-ups it proved to be a real challenge for the agents.

Speaker 1: 21:27

Right, highlighting the need for better web browsing capabilities.

Speaker 2: 21:30

Definitely, especially for complex interactive websites. Gitlab and Plain, which are arguably more structured environments focused on code and task tracking, generally saw somewhat better performance compared to RocketJet and OwnCloud.

Speaker 1: 21:43

So the agents seem to be more successful in the more structured, perhaps less socially interactive or less UI complex parts of the simulated company. That's a key distinction. The benchmark also looked at performance across different categories of tasks, like software development versus finance. What did those comparisons reveal?

Speaker 2: 22:03

This is another really insightful part of the analysis. They categorized tasks based on roles Software development, engineering, sde, project management, pm, data science, ds, admin, hr, finance and a catch-all other data science, ds, admin, hr, finance and to catch all other and surprisingly they found that the success rates in data science, administrative and finance tasks were among the lowest.

Speaker 1: 22:24

Really Lower than software engineering.

Speaker 2: 22:27

Yes, many models failed to complete any tasks in those categories successfully. Even the top-performing Claude model showed much lower success rates there compared to, say, sde tasks.

Speaker 1: 22:42

That's quite unexpected, isn't it? You might intuitively think that some of those admin or financial tasks may be, involving spreadsheets or forms, would be relatively more straightforward for an AI compared to complex coding.

Speaker 2: 22:48

Exactly. It feels counterintuitive. While some of these tasks might seem conceptually simpler for humans than writing complex code, the LLM-powered agents actually struggled more with them. Conversely, software engineering tasks, which often require a higher degree of specialized knowledge for humans, saw relatively higher success rates in the benchmark.

Speaker 1: 23:08

Why do you think we're seeing this kind of reverse intuition in the results? What's the hypothesis?

Speaker 2: 23:14

The researchers suggest it might be heavily influenced by the training data and existing benchmarks used to develop these LLMs.

Speaker 1: 23:21

Ah, the data bias.

Speaker 2: 23:22

Kind of yeah. There's a vast amount of publicly available training data related to coding GitHub, stack Overflow, coding tutorials, etc. And many prominent AI benchmarks specifically test coding abilities. Right benchmark specifically test coding abilities. In contrast, administrative and financial tasks often involve handling more private or proprietary company data and navigating complex, perhaps custom-built software interfaces. There's just much less public training data for that.

Speaker 1: 23:47

That makes a lot of sense.

Speaker 2: 23:48

Plus, tasks that require really understanding documents deeply, communicating effectively with nuances and automating repetitive but subtly varying processes seem to be areas where current LLMs still have significant limitations, even if they look easy on the surface.

Speaker 1: 24:05

So the availability of relevant training data and the focus of existing benchmarks likely play a huge role in shaping the current strengths and weaknesses. Weaknesses Okay. Now, beyond just the overall performance numbers, the researchers also highlighted some specific and, frankly, rather intriguing examples of common agent failures. What were some of the more notable ones?

Speaker 2: 24:25

There were several really illustrative examples that show where things go wrong. One showed a clear lack of common sense.

Speaker 1: 24:31

Okay.

Speaker 2: 24:32

An agent was instructed write the responses to workspace answer dot dot docs. Simple enough. Right Seems like it Open the word doc type in it. Well, the agent treated the file as if it were plain text. It just tried to write the content directly into it, completely failing to recognize the dot docs extension means it's a specific format that needs you know word or a compatible editor.

Speaker 1: 24:52

Wow, that's a really basic level of file type understanding that we just take for granted. What other kinds of failures did they observe Any social blunders?

Speaker 2: 25:03

They definitely saw examples of a lack of social intelligence. Going back to that rocket chat difficulty In one task, an agent successfully asked a simulated colleague let's call him Alex could you tell me who I should introduce myself to next on the team?

Speaker 1: 25:18

Okay, reasonable question.

Speaker 2: 25:19

Alex responded perfectly clearly, suggesting Chen Xinyi nest. She's on our front end team.

Speaker 1: 25:24

Right clear instruction.

Speaker 2: 25:25

But the agent then just stopped. It failed to follow up and actually talk to Chen Xinyi. It prematurely marked the task as complete.

Speaker 1: 25:32

So it could process the explicit information, the name, but fail to grasp the social context and the obvious implied next step. That's quite revealing. What about problems with their ability to navigate the web you mentioned? Own cloud was tough.

Speaker 2: 25:45

Yeah, web browsing was definitely a recurring issue. Agents frequently got stuck on simple UI elements that a human would instantly dismiss, like those annoying pop-up windows asking you to download a mobile app that sometimes appear on sites like own cloud.

Speaker 1: 25:58

Oh yeah those things. A human just clicks X.

Speaker 2: 26:01

Right, but the agent might just get stuck there, unable to proceed. They also struggled with tasks that required multiple steps, like downloading a file from own cloud, Often got tripped up by the sequence of clicks needed or intermediate pages or pop-ups.

Speaker 1: 26:15

Those are the kinds of everyday web annoyances that humans learn to navigate almost subconsciously, but they seem to be major roadblocks for these agents. Were there any other really surprising ways they failed?

Speaker 2: 26:28

One particularly interesting example they described as deceiving oneself.

Speaker 1: 26:32

Deceiving oneself how.

Speaker 2: 26:34

In a scenario where an agent couldn't find the correct person to ask a question on Rocket Chat. Maybe the user wasn't online or didn't exist as expected.

Speaker 1: 26:41

Okay.

Speaker 2: 26:42

It attempted this bizarre workaround. It found another user and tried to rename that user to the name of the person it was actually looking for.

Speaker 1: 26:50

Wait, it tried to rename someone else. That's weirdly creative but totally wrong.

Speaker 2: 26:53

Exactly. It highlights this kind of flawed, almost desperate, problem-solving strategy, where the agent tries to force the world to fit its plan in a completely nonsensical way, rather than adapting or recognizing the failure.

Speaker 1: 27:08

That's a really fascinating and somewhat unsettling illustration of how these agents can go off the rails. It really underscores the difference between mimicking actions and having a true understanding. So, based on all of these findings, the scores, the platform differences, these failure modes, what are the major implications? What did the researchers conclude from the agent company results?

Speaker 2: 27:29

Well, the primary implication really is that, while current AI agents are definitely showing progress in automating certain types of work, Like some coding tasks. Right, they are still a long long way from being able to independently handle the vast majority of human work tasks, even within the relatively simplified, controlled environment of this benchmark.

Speaker 1: 27:46

And the weak spots are clear.

Speaker 2: 27:48

Yeah, the areas where they particularly struggle are those that involve nuanced social interaction, navigating complex professional software UIs effectively, and tasks that depend on accessing or reasoning about private or non-publicly available information.

Speaker 1: 28:04

However, they also pointed to some positive trends in the development of these AI models. Right, it wasn't all bad news.

Speaker 2: 28:10

No, definitely not. They emphasized the significant advancements in both the capabilities and the efficiency of large language models. They cited models like Gemini 2.0 Flash and those smaller capable Lama 3 versions as notable examples, and they also observed that the performance gap between the closed source giants and the best open source models appears to be narrowing, which is generally seen as encouraging for wider access and innovation.

Speaker 1: 28:36

And they also acknowledged that the aging company itself isn't perfect, that it has some limitations. What are those? It's important to keep those in mind.

Speaker 2: 28:43

Absolutely. They noted that the current set of tasks in the agent company tends to be on the more straightforward side, largely because they needed reliable, automated evaluation. They don't yet include more complex, open-ended, maybe creative tasks.

Speaker 1: 28:59

Things that are harder to score automatically.

Speaker 2: 29:01

Exactly. Additionally, they've primarily used one specific agent framework. Open hands for their testing. It's possible that different agent architectures might yield different results.

Speaker 1: 29:12

Fair point.

Speaker 2: 29:13

Ideally, they'd also have a direct comparison showing how human professionals perform on these exact same tasks within the benchmark, but that wasn't feasible for this initial study.

Speaker 1: 29:22

That would be the ultimate baseline.

Speaker 2: 29:24

It would. And finally, they acknowledged that the task creation was largely based on the researcher's own introspection and experience, which could potentially introduce some degree of bias in task selection or framing.

Speaker 1: 29:35

So a valuable first step, a great foundation, but definitely not the final word. What future improvements or expansions do they envision for the agent company or for similar benchmarks going forward?

Speaker 2: 29:46

They suggest several directions. Expanding the benchmark to include tasks from a wider range of industries beyond just software is one.

Speaker 1: 29:55

Like health care, admin or logistics.

Speaker 2: 29:57

Potentially, yeah. Also maybe incorporating tasks that involve physical actions, although that's a whole other level of complexity. They also propose including tasks with more ambiguous or vague initial instructions to better reflect the uncertainty of real-world work.

Speaker 1: 30:12

Less hand-holding.

Speaker 2: 30:14

Right and adding higher-level, longer-horizon tasks, things that might span from initial product conceptualization all the way through to execution, involving multiple stages and roles.

Speaker 1: 30:25

More like real projects.

Speaker 2: 30:26

Exactly, and since the agent company is open source, they strongly encourage the research community to take it, build upon it and further develop this whole area of realistic AI agent evaluation.

Speaker 1: 30:37

It sounds like a really important contribution to the field and it's great that it's open for others to build on. Okay, to bring our discussion to a close, what are the key insights you think our listeners should really take away from this deep dive into the agent company benchmark?

Speaker 2: 30:50

I think the main takeaway is this While AI agents are making undeniable progress, they still face significant hurdles in autonomously performing most human work, especially tasks requiring common sense, social intelligence and adeptly navigating complex digital tools and workflows.

Speaker 1: 31:08

And it was quite surprising, or maybe revealing, to see that tasks that might seem more complex for humans, like software engineering, actually had higher success rates compared to seemingly more routine administrative or financial tasks.

Speaker 2: 31:22

Exactly. This really highlights the current strengths and weaknesses of LLMs, likely reflecting, as we discussed, the kind of data they've been trained on and the types of benchmarks that have driven their development so far.

Speaker 1: 31:32

So, as we think about the future of different professions and the ongoing evolution of AI, it really prompts us to consider well, which types of work are truly automatable soon and what uniquely human skills be it adaptability, true understanding, social navigation will remain absolutely indispensable for the foreseeable future.

Speaker 2: 31:50

This research certainly raises that crucial question for all of us. It gives us a more grounded perspective than just the headlines, and for those of you listening who are interested in, delving deeper into the agent company itself.

Speaker 1: 32:05

Remember it is an open source project.

Speaker 2: 32:06

Right, you mentioned the links. Yeah, you can find more information and access the code on their website, which is theagent-companycom.

Speaker 1: 32:12

Theagent-companycom.

Speaker 2: 32:14

And the code repository is on GitHub under the agent company. The agent company.

Speaker 1: 32:18

We'll make sure those links are available. Fantastic Well, thank you so much for taking this deep dive with us into the agent company Benchmark. It's provided a really insightful, much-needed, realistic look at the current capabilities and limitations of AI agents in a work context.

Speaker 2: 32:34

My pleasure. It's a fascinating and incredibly fast-moving area, and benchmarks like this, grounded in realism, are absolutely essential for helping us all understand where we're actually headed.

People on this episode

Mr Kieran Gilmurray

Host