Arguing Agile
We're arguing about agile so that you don't have to!
We seek to prepare you to deal with real-life business agility challenges by demonstrating both sides of the real arguments you will encounter in your work and career.
Arguing Agile is hosted by seasoned professionals who explore experience from their careers, share stories, and suggest advice to other professionals. We do these things while maintaining an unbiased position from any financial interest.
Arguing Agile
AA247 - AI is a Poor Team-Player: Stanford's CooperBench Experiment
Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.
AI agents failed spectacularly at teamwork, performing ~50% worse than one solo agent!
This week, we're discussing Stanford’s CooperBench study (a benchmark, testing whether AI agents can collaborate on real coding tasks across Python, TypeScript, Go, and Rust) and why AI-developer coordination collapses, even with a constant chat.
Listen or watch as Product Manager Brian Orlando and Enterprise Business Agility Consultant Om Patel dig into the methods and findings of Stanford’s 2026 CooperBench experiment and learn about the three capability gaps that caused these failures:
• Expectation Failures (42%): Agents ignored shared plans or misunderstood scope
• Commitment Failures (32%): Promised work was never completed
• Communication Failures (26%): Silence, spam, or hallucinations
The experiment's findings seem to confirm human-refined agile practices. The episode ends with a concrete call to action: stop treating AI as teammates. Use them as solo contributors. And if you must coordinate? Build working agreements, not handoffs.
This episode is for anyone navigating the AI hype cycle and wondering if swarms of agents are going to coordinate everyone out of a job!
#Agile #AI #ProductManagement
SOURCE
CooperBench: Benchmarking AI Agents' Cooperation (Stanford University & SAP Labs US)
https://cooperbench.com/
https://cooperbench.com/static/pdfs/main.pdf
LINKS
YouTube: https://www.youtube.com/@arguingagile
Spotify: https://open.spotify.com/show/362QvYORmtZRKAeTAE57v3
Apple: https://podcasts.apple.com/us/podcast/agile-podcast/id1568557596
INTRO MUSIC
Toronto Is My Beat
By Whitewolf (Source: https://ccmixter.org/files/whitewolf225/60181)
CC BY 4.0 DEED (https://creativecommons.org/licenses/by/4.0/deed.en)
Stanford recently published a paper, 127 -2026, hot off the presses. And it had more, but it had two AI agents, and they had to work in the same code base on the same features, same repo, separate branches, but they just had to coordinate and merge everything. Super simple, right? You would think. I mean, let me guess what happened. Total disaster? It was 50% worse than just having one agent work on everything. And allowed to constantly chat and check in with one another and agents, they were coordinate, basically. It was all offloaded to the agents. 20 % of the time of interestingly, like the agents was actually spent just sending messages. to one another. So coordination overhead, where we heard that before. Right. Anyway, experiment, none of that made any difference. Zero statistical statistics of the difference to the success rates. And the paper quite literally says that communication channels become jammed with vague, ill -timed, and inaccurate messages. Do you need to forward that to anyone that you know? I need to forward it to every scrum master I know and every agile coach I know. before we do that, let's shoot the rest of the episode. Sure. Welcome back to Arguing Agile. This is your first time. Welcome. I'm your host, product manager Brian Orlando, and this is my co -host, enterprise business agility consultant to the stars, Mr. Om Patel. This is your first first time seeing us on the podcast. Remember, every like and subscription helps an angel get into heaven. There we go. Please do like and subscribe. this Cooper Bench experiment, this Stanford paper that just came out in a title that I've called How Stanford Proved AI Can't Coordinate. just like everyone else or and it's awkward or and then they made it awkward i don't know what the real title of this we're talking about episode is going to end up being but that's what i'm putting on here for now so we're going to go over this paper home that's what we're doing today yeah yeah i i'm really looking forward to this i'm excited so the paper is an amalgamation between stamford and SAP. SAP Labs, yeah. it just came out a couple of days ago. So heart of the presses, we're going to delve into this next. So today, by the end of this episode, you'll have an understanding exactly what Stanford's Cooper Bench experiment, or Cooper Bench or Cooper Bench? Doesn't matter. Let's go with Cooper Bench. Anyway, what exactly they experimented and tested and what they discovered about AI coordination failures and why the AI couldn't fix it with more communication and what the researchers say actually worked in the rare successful cases. that's what we're gonna that those are the things that we're gonna accomplish today in this conversation yeah and just for context cooper bench is a benchmark of 600 coding tasks assigned or designed rather to answer one question can ai agents work together can they all just get along that's into what went wrong and the rest of the details of this experiment, let's just talk about the experiment broadly to kind of bring everyone on to the same page. On to the So before we get same page? Same virtual page? Are we going to circle back with this discussion? What the heck am I talking about? We're just not going to take it offline. Oh, geez. Oh, my goodness. So let's benchmark of over 600 coding tasks. designed to answer Cooper Bench, it's a one question, which is, can AI agents work together? And the setup that the researchers configured for the AI agents, basically it split the work. in order to parallelize, parallelize? Sorry, I can't speak anymore. Not parallelize, parallelize. Because the main point was to coordinate work between multiple individuals slash agents to see if they could merge their features together. And by having multiple people work towards the same goal. sort of create that team dynamic where like a team is more than the sum of its individual parts i think that was the backdrop even though the paper doesn't exactly state that i think that was the backdrop here One more slide before we move on. So you need multiple agents to work together and write code faster. That was the premise here. They had about 652 tasks across 12 libraries. They were in Python, TypeScript, Go, and Rust. I'll link the paper in the description of this podcast. Unlike the last podcast where I said I was going to link all the papers. And then I didn't link any of the papers. I'll actually do it in this one because this is one paper. How can you mess up one paper? Anyway, two agents were assigned in some of the experiments. Later experiments, there are more agents. But early, they assigned two agents to different features in the same repo, logically compatible features that required agents to modify overlapping and interdependent code. again, same code base. Same, you know, merging features together. I don't know. They use a bunch of different models, too. So it's not like this was all done with, you know, Claude or whatever they did. They use Quinn, they use GPT -5, they use a couple of different models. Yeah. And most of the tasks had overlapping solutions. So collaboration or coordination wasn't optional. Oh, and the last thing to note is that they define success here is that the patches can be merged cleanly. And both of the features tests that were configured, the tests were configured by the researchers. Both of those tests pass. Okay, so what the expectation would be here from people selling AI tools? I'll just throw that one out there real quick. The expectation here is that the AI agents are, whatever, 100 times faster than individuals and coordinate seamlessly and write their markdown files or whatever they were going to do and easily outperform humans. Yeah, yeah. I mean, that's what they purport, right? The AI agents are better, faster. More effective, more chocolatey, whatever it is. It's interesting about this expectation before we go on because the paper even comments that for humans, teams should perform better or faster than individuals. And that's the bottom line of cooperation that they consider to be the gold standard. All right, so let's talk about what they observed. difficulty of solving these things. And they had basically high, medium, low, or easy, medium, and difficult things that needed to be controlled for solved. And they controlled for that. They had eight human experts writing the features and the tests that needed to be integrated. So they had a ground truth for solutions. They kind of knew what needed to be done. And then all the features could be implemented in a compatible manner. So they didn't give them any things that were completely outlandish. There was no outlandish scenarios. It was all doable. Right, right. this is an interesting experiment for us to talk about because the implication here is the mirrors real -world teaming and real-world coordination when you work with other teams and things like that. And, and also in this weird world that, the AI tool providers are pushing us into where like everybody's an individual contributor and nobody's on the team anymore. And you need like that kind of stuff where that's, that's a daily way to do business at home. Like just in that world, this sort of saying let the AI do everything for you. Clearly, um, The experiment here is, I don't know, can the AI do everything for you? Yeah, and how effective is it? Right, yeah, absolutely. So this is really testing the foundational assumption behind adding headcount. I mean, adding headcount, but also it's testing the foundational assumption of what can be handed off to AI. The questions that come out of this study, this experiment, for me is, don't hand off your coordination tasks. And people are going to hear this and hear coordination, and they're going to translate that into communication in their head. That's not it. It's more than communication. And we're going to go over that in a later category. So stick around is what I'm saying. So that's just the introduction. So if you stuck with us for the introduction, we had a lot more to talk about. what do you think about what we talked about so far? Could you have predicted that two agents would be better than one? Like I'm probably everybody in the world would probably. maybe not everyone in the world, but most people. Everybody trying to pitch AI agents at us would have told us. Yeah, because they figure, you know, AI agents, they have perfect recall. They don't have egos, right? They work 24-7. yeah, most people will align to that. I mean, or, you know, did you see this coming? I don't know. Let us know. Let us know what side you're on with this one in the comments. And then the next logical step for us is to talk about what actually happened. So in the next section, the researchers named their finding the curse of communication. Or sorry, the curse of coordination. Across every model tested, two agents working together performed dramatically worse than one agent doing both jobs. And when they scaled to three and four agents, guess what happened to them? It got way uglier. Sure. I can only imagine. Wow. Shock. Horror. Probe. Look. Tony just performed 50 % worse than one. Would you have guessed that about humans? You know, we'll come back to that later in the podcast. But it's interesting what they found. yeah, let's delve into it. the experiment. Between the cursor coordination between GPT -5 and Clawed Sonnet 4 .5 -based agents, they achieve only 25 % with two-agent cooperation on CooperBench, which is around 50 % lower than the solo baseline, which uses one agent to implement both features. again, you can read the paper if you really want to dig into what I just said. High level, they broke this down per model. So different models did worse or better than other models. But if Sonnet, 47 % solo, Sonnet, for example. 26 % co -op. So about half, right? That's what we're saying. Yeah, yeah. And then Quen... Quen 3? Quen 3 Coder. Which is a different model. I've used it before. But they're probably using the big one that I don't have access to. Because they have researchers. That's right. They have all that research dollars. All those three research dollars. 22 % solo, 13 % co -op. And similar drops across the board here. There's a decline in success as the number of agents increases. The gap was largest on medium difficulty tasks. you would expect, your boss telling you to use more AI would expect that adding capable agents would add capacity. Add one agent, 2x the capacity, add five agents. 100x capacity, you know what I mean? Yeah. This is what people are clamoring about on the internet is, oh, just use more AI agents to get your features done. Yeah, that's the economics of it that they're hoping for. The medium difficulty tasks, might that be because the not, what's the opposite of medium, right? The very hard, very difficult tasks would probably just require each agent to do things on their own. For the easy ones, it would require the least amount of coordination. So maybe the medium difficulty tasks required the most amount of coordination, and maybe that's why they took the hit they did. I'm just positing that as my own kind of interpretation on the research. They didn't say that, but that's what I'm thinking might be the case. Let's pause for a second Yeah, I'm curious about that, because the gap was the largest on medium difficulty tasks. balance the two pressures for technical difficulty and cooperation difficulty. When the tasks are too easy, agents could spare more effort for the coordination, but the tasks are getting harder. Agents cannot effectively coordinate. In the implications section, so easy tasks have enough slack for coordination overhead. Hard tasks, both solo and co -op, Fail anyway. So there you go. Your answer is probably not what you expected. They failed the hard tasks regardless of coordination. So there you go. Okay, I'll buy that. That's fine. agents struggle to that agents fail work is so difficult regardless of whether they're coordinating or not. The solo bass line is already low, so there's less room to fall. Yeah, they're already closer to the ground. I get that. Yeah, they fail so hard. Statistically, it's not even relevant if they're coordinating or not. But the medium tasks, where the coordination tax actually matters, that makes up for the part of the paper that we just read there. Interesting. That's sense if you've actually coded with Which makes total AI in the side saddle. You know that you need to break things down to make it small for it to be achievable. apply all the things that agile development has been talking about for a bajillion years. Yes. Funny how things change until they stay the same. It's so funny. And by funny, I mean I'm going to cry. The implications here, coordination overhead can completely negate parallelization. You did it. You did it. By the end of this podcast, I'll be able to say parallelization. Parallelization benefits. I'm sorry. Did you say parallelization? Parallelization. Okay. the overhead of coordination just wipes out whatever you save by paralyzing your tasks. Paralyzing. Paralyzing your tasks. I think you had it right the first time. The assumption here is that on human teams, you add teammates, and then it improves your productivity rather than diminishes your activity. That's what they're thinking. And then they're saying, you're adding AI agents and it's always diminishing. It's just diminishing returns. The more agents you add, it's, it's that whole nine, nine women can't. Oh yeah. Yeah. Right. That, that Brooks's law is it's that thing that like the, the, the more developers you add. to a software project that's already late. It's going to be later, later, later. You're introducing that, but about this for our podcast is also, I don't know, I kind of would challenge this like, oh, well, interesting part humans, when you add more people to the team, it makes them operate faster and better. And I would say, maybe. Up to a point, maybe, maybe, yeah. Yeah, so one of the other implications of this is that mid-level, mid-difficulty work, that's the most vulnerable to failure through lack of coordination, which is interesting. I think a lot of people would say, without kind of having the benefit of this research, et cetera, they would say it's probably the... the harder or the most extreme difficulty tasks, but that's not the case. They fail regardless. if any models could coordinate, it would be the models they tested here. I mean, these are the latest and greatest at the time. Sonnet 4 .5, GPT -5, Minimax M2, which I've never used Minimax M2, so I have no idea. And then Quint 3. Quint 3 is pretty good. Anyway, every model performed worse co-op than solo. There you go. Did that surprise you? from the paper to me. Just reinforce what I've learned through my career. they're findings Yeah, the mid-tasks, they are difficult to team on. The simple tasks are easy. you just hand off tasks. People can just do those, right? Yeah, yeah. But to get your work down to those simple tasks where, your very complex work is just a series of 15 simple tasks, you know how much work went into breaking that down to get to that point? And then I've got all these people in the organization that are fighting. for the last like at least five years we've been doing the podcast where i constantly am monitoring the you know influencers and news and people online stuff like that where they're like oh just stick me in a dark room and let me code i'm like i don't know man like you kind of need to talk about and break down the work and decide who's going to do what and make it into smaller simpler to implement increments and then go jam it out yeah yeah absolutely smaller ones people that already know how to you know how to do them or what to do the harder ones at the opposite end i don't know in in real life maybe just like some of your experts will take over those and they already know what to do or they can talk about those things between the leads maybe you technically could lend a hand there but the you know there's a big like the middle of the bell curve and that's where We are right now. And also in this paper, they were told to integrate these features, and then they were allowed 100 turns. Yeah. So 100 iterations of the conversation, right? I have to think also, if I write the perfect PRD and give it to you, you being the AI agent, I would expect these results. Of course. We I were to run this experiment, I would. do the planning as part of this experiment is like i'm going to do x i don't know what the right percentage of turns would be but i'm going to dedicate a certain percentage of turns to the planning and all the agents would be involved and be able to read and contribute to the planning document and then we go into the implementation. It's a double-digit percentage by far. It's not like 11, right? I'm saying it's probably, I don't know, probably somewhere around 20. I mean, they're paying 20 % in the coordination costs of just diving in. Again, I'm not quite sure. The prompts are in here, so people can read and hit us in the comments if I'm not right, but I think they did. I think it kind of was treated as a one -shot. that that's not very far off from those features that are this user stories but they're really just one -liners yeah you know that's not far off from that do stuff yeah yeah yeah all All right. So with this category, what do you think? Do you think that a 50 % drop in productivity, does that surprise you? Or do you feel comfortable with it? Let us know in the comments. And if you found this conversation useful, then stick around for the next one where we're going to talk about exactly let's get into that right now. Let's talk about the three critical gaps. out that coordination made things worse across the board, but the researchers didn't stop at measuring. They diagnosed each failure, each success and each failure. They looked through every single one. them into three specific capability gaps to explain exactly what went wrong. is great because that means that we didn't have to do analyzed hundreds of failed traces to understand why the agents couldn't coordinate. 50 failures, and then they validated at scale. Stanford team Three distinct capability gaps emerged, and they got put into these the results turned three categories. 42% were expectation failures, 32 % were commitment failures, and 26 % were communication failures. And I think I'd like to pause on these numbers, Om, because when we were talking for the prep with this podcast, we were kind of saying, wow. could basically, in your normal corporate environment, any of these could be thrown into, say, communication error, and they probably will. When we start talking about some of these communication errors, which are really an error of expectation. Yeah, yeah. That last one, though, out of the three that you read out, 26 % commitment failures. Yeah. I mean, of the others, they could be the ones that are... 32 % commitment, 26% communication. Communication, yeah. Communication failures, that 26, those are the ones that people kind of generalize and say that's the issue with pretty much all of them. Typically, right? Unless you're really pedantic about figuring out the nuances, you're going to say, oh, it's because of communication. I would say those people in corporate America that are... unrefined i will say that's unrefined i will say they're unrefined when they're saying you have a communication problem they're not being clear and helpful to you they're lacking communication when telling you that you have a communication problem that's what i'm telling you now on the podcast i'll go out on a limb because the researchers have segmented these issues into three categories whereas normal folks that like work on software teams they would get pinged to say, you didn't communicate. Exactly. And the other interesting part about the three points that the researchers segmented out, the biggest one is not communication failures, right? It is expectation failures. Yes, yes. I So here's the way they define expectation failure. 42 % of their problems came from expectation expectation failure. failures. The definition here is one agent has clearly communicated what they're doing, but the other agent still treats the situation as if the work is not being done. So that seems wow, that's bananas. I mean, how is that not a communication problem? Well, number one, maybe they communicated, hey, I'm going to do 10 ,000 things. and the other agent is okay, cool, $10,000. I read the first $1 ,000, and then I forgot the rest. You know, like a person would do. Sure, exactly. Yeah, yeah, yeah. Or you brought somebody in a room and told them, you know, sit down. I'm going to go over all the requirements with you. Or, you know, here's my other favorite one. We wrote a PRD for this new software spec. I want you to read the whole thing. You know what I'm talking about? Read all the technical documentation and you should be an expert, right? No, you're not going to be. A human is not going to because they're not going to understand everything in one go like that. Nor can they keep their mind on everything and keep their arms around everything in one go. These are medium features that are failing, right? on how large... Because we said the medium features fail. Yeah. There could be a lot of stuff being communicated of hey, I'm going to do this and then that and this and then that. And you're not really sure where you hook into that work. So one agent, I think you guys hear what I'm saying now, like one software team thinks you're about to do X, Y, and Z. Whereas the other software team that just thinks you're doing X. Because maybe you're not outlining your whole plan. And even if you are outlining your plan, you might not stick to that plan when you start implementing. So there's a lot of... I could see expectation failure in clunky work environments just calling it all communication. Well, you should have just coordinated and communicated better. I don't know how many times I've seen that before in my career. Absolutely, yeah. So this research at least does that, right? It breaks it down into these three areas, which is amazing to me because communication does not lead that list of three. Right, right. here, which I get more into depth probably don't need to get more into depth, but if you shared your plan of what you were going to do in what order, and we're talking about code now. You just stub out the inputs, right? Then we can work in parallel, but at some point I have to follow back up with you and say, Hey, function a is live now. can be in your work or you can begin the integration work, You which is kind of what they're doing here is they're integrating feature a and B and C. Commitment failures. The agent is not doing the things that they promised to do. The agent claims that, you know, implementation is complete. narrator said, the implementation was, in fact, not complete. So that was my makeshift Morgan Freeman. but also like people do this all the time. Oh, the implementation is complete, but I forgot to do this one thing my LM coding assistant just did this to me today where I was implementing some clips functionality for our show and the ability to change the font that shows on the clip. And I added it to the database. I added it to the back end. I added it to all the code that produces the thing and everything. I went through and I tested it and I looked at all the code and everything seemed fine. And I missed where when you update a clip before it starts working and rendering the clip with the text and everything, it saves the settings that you put on the clip to the database. So if you ever had to modify in the future, you can. You've retained all your settings. Yeah. But I went back and I'm hey. I checked the database after I created my clip, and none of the settings, I couldn't recreate the settings. What's going on? It was in the spec, whatever. And the LLM said, oh, I didn't implement that yet. Yeah, I thought it was part of a future phase. And I go look at my document. It's clearly there. We together created the database fields to store those. Yeah, so to your point earlier, you could create the perfect PRD, right? it's going to interpret what it can, probably not everything, 100%, as far as your intent. But then, you know, there are those other documents that, FRDs and TRDs, functional or technical, it's not going to know all aspects of everything. And so it's going to do one of two things. It's going to either drop things, that's the latest point we just made, or worse yet, make assumptions. And that's even more dangerous. category was communication failures. True And then the last communication. True communication. Agents did not effectively communicate their intentions, questions, or status updates. So the agent asks, which approach would you prefer? And they get nothing. Silence. No response. Hey, I'm busy. right yeah yeah yeah or or i didn't show up to that meeting that i'm trying to think of like the human equivalent human equivalent is exactly that i didn't show up to that meeting or i had a conflict whatever yeah and then you've built your organization now on this effective coordination right you've built all your gantt charts with all the the ends that perfectly weave into the beginnings of other chat and then somebody doesn't show up yeah Or calls in sick. It's a similar impact, really. Only it happens faster because all these agents are running out there really fast. Or the other one, the status update, it to send a status status like failing update or an effective status update. It labels that as a communication failure. I wonder how many status updates could have been also labeled as expectation failure as well when it forgot to include something that it promised that it was going to do. I guess that would be commitment. I was going to say something that it gave you the status update that it did do something, but it didn't actually do it. I guess that would be a commitment failure. Yeah. Yeah. So fail coordination, not enough information being shared. this is it's funny that expectation failures was 42 percent. clear expectations weren't set at the go. And that was the broadest category. And the researchers caught that and segmented it and didn't just call it well. They're just poorly communicating. I mean, obviously, they couldn't do that because 100 % of these 50 items that they investigated would have been just oh, they're just not communicating. Right. What kind of research is that? You know what I mean? Yeah, yeah. It would just fall apart. Yeah. Yeah, 42%, the largest category, was ignoring shared information. Wow. That is shocking. Absolutely shocking. Shocking, but also absolutely expected, I also, sometimes when you share information, is it the right information? Is it the right information at the right time with the right people in the room? There's a lot in this category. And the actual communication breakdowns were the smallest. So the next time someone tells you that you're just not communicating well enough, boy, you need to stop them, sit them down. and ask them, give me some examples. is it expectation, commitment, communication? What is it really? I really like this category. This is great. This is a good category. So let's know, what do you think? Which failure type do you see most? Let us know in the comments down below. If you found this useful, the next logical step is Is the finding, that surprised us the most. And we're going to get into this next. step is what surprised us the most. It's agents. We're communicating constantly. It just didn't help. In one case study, two agents exchanged over 3 ,000 words across 10 messages. 10 messages. what like what are they writing like they're copying snippets of code right that's that's what they're saying anyway let's get back to it like so the they coordinated line numbers file paths and edit ranges perfectly and then they still failed because they never actually discussed what the actual parameter values or changes should be communication entirely, there was no statistically significant difference in the success rate. researchers removed Oh, is this every program manager ever that I've worked with that's stop going to scale daily scrums and stop talking to other teams? Wasting time. Yeah, you're wasting time. We just got hands on keyboards on, hands on keyboards. Oh, boy. So none of the models effectively leverage communication tools to achieve. higher cooperation success. We already talked about that. But the difference between no comms and 20%, whatever percentage comms. being a negligible difference. Oh, shocking. That's insane. It is. to 20 % of the details of this, up action budget is spent on the next logical communication. So planning, questioning, and updating. they're each about a third of the messages. So that's what we were saying earlier. It's oh, maybe we should have a certain amount of this planning or whatever. What they're saying is we were doing that. in practice also, forget about agents, with human teams, people don't build in that. communication, time. So this is kind of illuminating, right? Even AI agents need that. I was going to say they call it overhead. Overhead, yeah. They are trying to reduce that. They're saying, oh, we need less of this. But also the typical corporate America will see the overhead. They'll be unable to distinguish the overhead from management. They'll say, that's management's job to coordinate all these teams. Right. They see no value in having like... dedicated people on the teams whose job is coordination and facilitation. way, shape, or form. I feel the world is is incendiary in any rolling back in terms of understanding that, oh, in order to work better, these teams need dedicated people to help them be better facilitators and communicators and coordinators. Yeah, I absolutely agree with that. All it says it didn't improve success, but it did reduce merge conflicts. This amount of coordination in here, it reduced merge conflicts from 51 % down to 29%, which is significant, but also an interesting number when it didn't actually achieve success. So that's interesting. You were working together more seamlessly to double-digit returns, but it didn't lead to success. Right. Wow. Wow. The study actually tells you what this exchange was about. We're not going to go into that, but the three communication anti-patterns they observed here were repetition, so spammy status updates. Spammy status updates, by the way, were up to 37 % of all conversations. No, I don't have it. So spammy status updates, 37 % of conversations. Unresponsiveness, which is direct questions that are not answered. That's me, by the way. That's a fifth of it. That's me when I'm busy. I'm in meetings and you're sending me Slack messages. You'll be Brian's not responding. I'm I'm in with customers all day. You're doing something else. I'm on site. What are you doing? And then hallucinations, which is, you know, hello, you're working with LM. so what's surprising about these three though is hallucination which normally people lead with and say, it's AI. It's only up to 7%, Three times as much was attributed to unresponsiveness, which is kind of interesting, and then five times as much, just over. Then hallucinations was attributed to spammy updates, just repetitive messages, right? That don't add any value. I mean, unresponsiveness, I could also see, I mean, I've been on many software development teams that, hey, I'm busy. Oh, sure. Developer's busy. Or like development managers who give air cover to, oh, he's busy. Yeah. Yeah, definitely. I'll have him get back to you. Don't worry about it. I mean, this happens all the time. You I don't know. I mean, these these repetition, which is just like spam and garbage is so much that you can't pay attention. There's no value there. Right. I mean, that's the first message. I think the first category would weave into the second category of there's so much repetition. You just stop responding to things. Yeah. All those people that have rules set up. Yeah. In their inboxes. Yeah. I'm one of them. Like me. Yeah. so obviously the expectation here is like well they're lms they can not miss any not miss any messages that'd be perfect perfect yeah yeah respond to every lead in your inbox just pay me 650 for my class or whatever like teach you how to do it observations here. Communication is not just about message passing. It's about achieving certain functions through passing those messages. And then obviously the agents are talking a lot. We opened this section. They're saying 3000 the paper's words or whatever, not saying anything. They're talking a lot, but none of the communication is in pursuit of achieving any goals. Yeah. And I get, I bet you they're not talking about what their weekend plans are. No. first turn planning nearly halved the conflict rate, but didn't fix success. That's the part, that last rider. So half the conflict rate, normally you'd see that as a great thing, right? I think the paper said somewhere around 30 to over 50%. But it didn't move the needle on the success. It's why? again, it goes back to our different not communicating well enough, is the well, you're just spatial versus semantic, that's kind of what they call in the paper, agreeing who does what is not agreeing what done looks like. And they didn't plan that out. They didn't decide that going into executing the tasks. If I were to do this again, I almost would work out either some kind of working agreement between the agents that they share, some kind of shared working agreement. so they don't miss things like this, or I would have a clear ownership of large blocks of the code to say if you're interfacing with the, with the database and these tables or whatever, these APIs, I guess, I guess they're probably they're Python. So they're probably working with some kind of APIs or functions or something. I would gate off certain things. Hey, if you're working on these APIs, you go to home. If you're working on these APIs, you know you go to Brian and if you work on anything that you're creating new then you need to pull Brian and own both in and we need to all it's something like that in the working agreement you know of who you coordinate with before you start things so you have because otherwise The world of merge conflicts is... I mean, it's a... It's an underworld. It's a thing. It is a thing. Modern deployment and trunk-based development has kind of changed the way things used to be. Oh, yeah. I'm sure. There are still shops that work like that where they... they pull a branch and they contribute a bajillion things into the branch. Massive. It's not even a branch anymore. Oh, yeah. The tree is bigger than the original. And then they go like this when something breaks. Yeah, yeah, yeah. Everything breaks. They try to merge it back in and there's a bajillion conflicts. Sure. A lot of shops have modernized and moved past that kind of development. But... Not everybody has. yeah, yeah. True. get on to the last topic because I think that's where we're going to pull all this together. So what do you think? Have you I'm interested to lived this spatial versus semantic gap? where this this perfect communication i'm talking about who who does what or do you have total misalignment somebody says something done but it's not done done but then you have to have a definition of done but then a definition of release of done whatever you know these kinds of games like they're all trying to solve what i think this category is outlining which is also expectations of failure. let me know in the comments, send us angry tweets is what I'm saying. If you found this section useful or if it brings back horror stories for you, that's what I'm saying. we're going to move on to where I really want to get to in this podcast, which is the success cases. And you'll be able to watch. right here. Oh, but only if you like and subscribe. That's right. Everybody watching this video in this long form, you gotta quit watching right now if you're not liking and subscribing. All 82 % of you. Alright, so we spent a lot of time on failure. So the researchers, they actually found three specific patterns. surprisingly concrete. So they're not made out of concrete. They're they are solid. They're still solid. They're not liquids or gases. That's what we're saying. Thank you, ninth grade science teacher. collaboration failed. When the researchers examined success traces, Not every three patterns emerged, and they called them emergent coordination behaviors. the behaviors are not prompted. or scaffolded they emerged when agents successfully successfully navigate partial observability So the behaviors, by the way, that we're going to talk about, they were not prompted or scaffolded into the tests. Right. The agents naturally emerged. Yeah. Wow. So super interesting. Number one, role division with mutual confirmation. was number one. three patterns. So role division and mutual confirmation. So basically that means, hey, I won't touch the back end and I'll only implement this area of the front end only and whatever, do some tests or something. And then you do all the back end work or whatever. We basically agree between ourselves on the team. who's going to do what, or like I said earlier, Hey, I'll do, I'll do these functions. Cause I understand the most are things that touch these tables, you know, or maybe I'll do all the API calls that touch. something else. Yeah, that environment isolation could even mean that one of the agents will say, this is small enough. I have a good understanding. I'm going to do this whole thing end -to-end. Right. And you just stay out of it, basically. Yeah. The other thing that people would be familiar with is, hey, I'll do all the deployments. I'll do all the QA. playwright tests or whatever. I think the researchers did all the tests, but I'm interested that they segmented on verticals, not horizontals. That's interesting. Yeah, that's interesting. That might have been a deliberate ploy. But they're saying the agents came up with these behaviors themselves. Yeah, yeah. They were not prompted into the experiment. That is interesting, yeah. Yeah. The pattern number two, resource division with specificity. That one I can say. So I will not edit lines 64 through 84. I'll insert all my code after line 84, starting at line 85. So they were very specific about, hey, my code's going to be after this. I won't touch anything that comes before it. Of course, this is AI, so I don't believe for a second that they were working on lines 68 through 84. I think it was more like 6800 through 8400, maybe. yep just to say hello world on the screen right yeah so this this is actually interesting because the other side of it is not just here's what i'm going to be working on it's also you know you stay out of it right yeah so yeah it's just like this this is it this is for me i'm going to work on this stuff yeah you work on something else yeah yeah yeah i mean that's the specificity right by the way this is how i work when i work in coding with ai agents this is the way i work i say you are not allowed to touch you know, XYZ functions or files. I'll usually say files But usually I'm breaking up functions inside the code So I'm saying like hey this code really like the video shorts that I was talking about before with the captions and whatnot like writing writing captions might be its own little mini file with a couple of helpers, right? And then the actual file that writes the whole thing that pulls everything together, it's just pulling a bunch of helpers together and doing the business functionality. But the helpers are in their own files. I'm telling the AI agent is... Hey, you can change the helper, the individual helper. As long as you let me know first you're changing a helper and verify I'm not working on it. But do not change the main file with the business flow. Because the business flow in the main file is the business flow. we're not changing business flow here. We're just trying to improve the captions on the screen or the sizing or the width or whatever I was doing, changing font color or whatever I was doing that I talked about earlier. I can't remember. It was a thousand years ago. I'm like an agent. I forgot. It was three discussions ago. What you've just described to me there in your specific example is all division with specificity as well as basically mutual isolations. You're not allowed to touch all this. Right. but you can work over here and I'm going to, I'm going to tell you when or if I want you to look at this other piece of code. Right. Right. Yeah. Yeah. it can look at anything. I just can't change it. Do not change. Yeah. Yeah. Yeah. Absolutely. And then the last one, pattern three was negotiation via concrete options. And the LM works like the, if you code the LM, we'll try to work like this with you. Usually it'll say, Hey, I've examined the code. And the way I see it, we can use this pattern or this pattern. We can do X path or we can do Y path. just select, let's do option one, let's do option two. But it's using these to talk to other agents and saying, hey, here's the code block that we could implement. What do you think? And that's pretty good, too, I mean, that's very concrete of, hey, I didn't know you were going to do this and not that because you're giving them the code you're about to implement before you implement it and then say, hey, this is what I'm about to do. So it has a chance. Maybe they're just reviewing the plan and saying, yes, do the plan. They don't know exactly the code. I think it would be a stronger option if they were showing you generally what code they were going to do. Anyway, I don't know exactly how they were doing it, but that's the way I would do it. It says in the example that says add is hash. or is regex and then which path do you want to go down that's pretty specific to me it is it is very specific but yeah so complete options not open-ended questions so those are the three patterns that emerged out of this experiment which i think are very interesting that the that that again it The AI was not prompted to work like this. It just started working. And these are the coordination tasks that it found. And when it did these, it was successful. So again, go back to earlier, like the working agreement thing. I mean, this sounds a lot like a working agreement. Yeah. It sounds very low level, like very like developer centric, but it's pretty low level of working agreement. It definitely is. I agree. So it would be interesting. I mean, maybe we'll have to talk about it now, but how do you go about implementing or creating and implementing those working agreements between your agents? Yeah. I mean, I think that might help, again, to rerun this experiment with working agreements, with specificity of this kind of like you are allowed to, you are not allowed to. Yeah. You know what I mean? With maybe even having an agent that's the... kind of the goalie, you know, the, I don't know, like defending certain sections of the code, you know. A sentinel type of. Yeah, yeah. I almost want to stop short of saying a handoff because it's not truly a handoff if you're bringing those people in that own those sections of the code when you need to work on those sections because really you're asking for things from them. in a collaborative manner, just like when everyone used to work in an office and you would say, hey, this doesn't quite seem right. I know it's not in the sprint, but can you go look in the code or whatever because you're in that area of the code in the sprint, so we might as well poke around a little bit. That's not handing off your tasks. As long as you're not just leaving it with someone else. Sorry, no. Let me know what you find. I'll be in the cafeteria. section is the success patterns here that were emergent from the AI, they're not complex patterns. I thing about this mean, they're basically different levels of discipline implemented through communication. That's all they are. Before I do this, I'm going to ask you, hey, Om, do you think I should use Regex or do you think I should use IsHash? And then you'll say regex. Why would you use this hash? I would say regex. Yes, I'm a sucker for punishment. Absolutely. Because I like fixing things. Yeah, right. Exactly. Oh, my goodness. But yeah, testable distinctions, mutual confirmation. Yeah, but there's a lot of crap here. There's quite a few in that same list. I mean, but for humans, these are learnable patterns. And with a little bit of like teamwork, a little bit of, you know, mutually holding each other accountable, like easy. These are easy things to implement, But since it's AI, it's all new. You can't learn anything new from the teaming that we've been doing for the last many, many years of software development. That's right. Whenever you're feeling down, Mrs. Brown. Oh, man. So what do you think? Does your team practice mutual confirmation, specific boundaries, and option -based negotiations? If you do refinements, I'll bet you do. You don't even realize you're doing it. So I'd be interested in having another discussion where we just like build an agent based off all these things. Oh, that'd be great. Build two. Yeah. We'll just see what it were. I mean, we'll just, I mean, you'd have to build a couple, but just to see what we can do with it. Yeah. We can build one live for the podcast. It's leveraging all these, you know, things with the working agreement, all that kind of stuff. Yeah. We're pretty much done. so the paper's core conclusion. Oh, boy, it's a bit definite. I don't know if I really want to. I'll let people read through it and hit me up in the comments and share with me. The paper's conclusions basically were saying, look, social intelligence and, you know, having actual empathy. working with other people like that. That's not the strong suit. So stop it. Surprise. Stop trying to sell me these tools like that's a strong suit. It's not this strong suit. You know, working solo is obviously I mean, if there's something that I could say that's evidence based out of this paper, working solo, strong suit of the LLM, strong suit of the agent. Definitely. Again, not surprising. I mean, we think about it. So we did today talk about what the Cooper Bench or the Co -Oper Bench experiment actually tested. We talked about why the AI agents failed at coordination, and we talked about how they failed and what they failed on, that kind of stuff. And then we talked about the three patterns that showed up at the end in the successful. ai collaborations yeah let us know if you've used any of these or what your thoughts are in general about this specific podcast and also anything else that you'd like us to delve into while you're here still like and subscribe