Prithvi Rajasekaran of Anthropic - Office Hours, Episode 5 Artwork

Frontier Systems

Frontier Systems is where the people building the future of AI explain what they're working on.

Frontier Systems is where the people building the future of AI explain what they're working on. Our flagship show, Office Hours, airs live on YouTube every Friday from 12–1pm PT, with Anjney Midha and Mike Abbott taking real-time questions from students and the broader community alongside a weekly special guest - researchers, operators, investors, policymakers, and founders shaping the frontier. Beyond Office Hours, we publish lectures, interviews, and deep dives on the systems, ideas, and people defining AI's frontier. Recordings are available wherever you get your podcasts.

All Episodes

Frontier Systems

Prithvi Rajasekaran of Anthropic - Office Hours, Episode 5

May 22, 2026 • Anjney Midha • Season 1 • Episode 5

0:00 | 1:00:59

This week on Office Hours, Anjney sits down with Prithvi Rajasekaran, a member of Anthropic's labs team and author of a recent post on harness design for long-running coding agents. The conversation ranges across the substance of his post — the generator-evaluator loop borrowed loosely from GANs, the third "planner" agent that complicates the metaphor, the Ralph Wiggum loop, and why context resets unlock Sonnet 4.5 while plain compaction suffices for Opus 4.5 — and into the looser texture of the craft itself: the hours spent reading overnight agent transcripts, the "elicitation overhang" that makes a single prompt feel like a generation of capability, the temptation to over-engineer evals when a senior engineer's intuition would do, and the difficulty of designing async agent UX when waiting thirty minutes raises user expectations to a punishing pitch. Asked what to build, Prithvi's advice to students is to pick the Claude Agent SDK, find a hard problem you actually have, and spend your time in the dev loop where intuition is forged, and his predictions for the rest of the year are of a piece with that posture: models will keep surprising you, long-running agents will become a meaningfully larger share of how we interact with Claude, and the "Claude Code moment" will arrive for domains far outside software, his own recent one having come, somewhat sheepishly, from using Claude as a fitness coach.

SPEAKER_00 0:02

All right, welcome everybody to CS 153 Office Hours. We are super lucky to have with us someone here today who is um I I think it would be an understatement to say, like truly at the frontier. Uh we have thank you for joining us, Prithbe. Um Prithbe is at Anthropic. And a few weeks ago, when we started teaching the class, and we were trying to figure out, you know, who would be some practitioners to bring in to talk to you guys as students about what literally the state of the art is in real time on systems design. Uh that morning, a blog post came out from the Anthropic team on how to do harness design for long-running agents, which is really big, meaty, and unsolved problem. And uh Britvi was the author on that post. So thanks for joining us, Pritvi. Thanks so much for having me. Sad to be here. Tell us what could you just start with like what you yourself, you know, what you do at Anthropic, how you got here. We'll start with intros and then we'll go into the the questions that are starting to pile up from the kids.

SPEAKER_01 1:00

Yeah, happy to. Um, so I basically started my career in about 2018. Uh and actually my first job out of college was at a research lab. I was working at MIT at the time. Uh, so I like to say that I kind of had the opportunity to do like uh maybe not agent harnesses, but sort of the predecessor to that in the era before Gen AI, before agents. Um, and so over the course of my career, when you know models like uh ChatGPT and the cloud models came out, uh, it naturally kind of parlayed into kind of what's the infrastructure and everything that's directly around the model. Uh so that's kind of my background. Uh I started Anthropic last April. Uh, I worked on the applied AI team for about a year, and that's a really cool and unique team. We get to work very closely with customers who are at the state of the art, who are building really sophisticated things with uh with agents. Uh, and so that was a great learning opportunity for me to kind of see what's at the frontier, see what's at the state of the art, how people are building agentic products. And then more recently, uh, with some of the work that I've done, uh, I've transitioned over to the labs team, which is our sort of internal team at Anthropic that's doing all of our like state-of-the-art products, for example, like Cloud Design, which came out recently.

SPEAKER_00 2:03

Sweet. Um Yeah, can you talk a little bit about what is the scope of the labs team uh versus the previous team that you're working on?

SPEAKER_01 2:12

Yeah, totally. So the labs team in particular is sort of like our, you can think of it as maybe our like in-house incubator, but a lot of like the cool zero-to-one product ideas are coming out of labs. So it's kind of like a dedicated space where people on the team have that um that purview to go to try new things to figure out like what would resonate with you know our users and customers. Uh and my previous team, applied AI, like I mentioned, the mandate was really to go work directly in the field, work closely with uh folks that are building products around Cloud and help them make those products as best as possible with our expertise.

SPEAKER_00 2:45

And this before we sort of jump into the lab stream, do you think this pattern from an org systems design standpoint is one we should that is um you know going to repeat itself across the industry, like applied versus labs? Or do you think this is some you know somewhat idiosyncratic to anthropic?

SPEAKER_01 3:04

Yeah, it's a good question. I think one really interesting uh kind of like bifurcation that I've been noticing is um one side of it in terms of like building agentic products, like the skill set is like understanding agent harnesses, maybe understanding things like design, UI, things that like directly implement uh you know influence the customer. And then there's the other side, which is more traditional software engineering, which is like how do you take this thing, how do you scale it? Like, even you know, a big pro uh problem people have been asking a lot about lately, with like for us, it's cloud cloud managed agents, but you know, for other people in the field, it's like how do you put an agent in a box, deploy it, let it run for a long period of time. And so you see this interesting kind of split where there's some stuff that's like pure software engineering, there's some stuff that's like all the new stuff, and there's like a big Venn diagram in the middle.

SPEAKER_00 3:49

Um I think we should spend some time on definitions. Um because you you that right there, you know, as you were describing the differences between the two, um, you dropped a couple of things in that I still don't think have canonical definitions across the industry. So maybe we could start with um in your as as you define it, what is an agentic system versus not? Um and actually, you know what? Let's just start there.

SPEAKER_01 4:20

Yeah, it's a good question. I think so. The the agentic system for me at like the simplest level is sort of like it's it's the core, like when you think about a product, it's the core of your product, it's the agentic loop, how's the agent behaving? What is what are the tools? What's the system prompt? What can it and can it not do? All that good stuff is the agentic system. And then I kind of think about the layer around that as like the product experience, which uh I'm more of a front-end guy in my full stack career. So this is a bit of my bias, but okay, like what's how do you take that? How do you build an application around it? What's the UI? And then there's sort of a layer around that, which is like how do we deploy it, how do we serve it to millions of people? That's more like infra. That's my rough mental model, uh, a little bit of my own bias in there, but that's how I think about it.

SPEAKER_00 5:01

Cool. Awesome. Uh, that's helpful context setting. And why don't we just start running through the questions? Uh, the first one is your harness for long-running coding agents uses a generator evaluator loop inspired by GANs. What made you reach for that analogy specifically?

SPEAKER_01 5:18

Yeah, so it's it's an idea I'd had for quite a while. Like I had first learned about GANs in my first job. Uh, this was like in 2019, and I was just like fascinated with the concept. So when the agents came out, I had just been messing around and it was simple like prompting. Uh, I had a GAN that was like a like a startup idea generator. This was in like 2023 or something. Uh, and so at a certain point, I was like, okay, this kind of maps very cleanly to like software development lifecycle, right? There's like uh, you know, coder, there's a QA, all that good stuff. Um, so I tried it out, saw signs of life, and you know, got very excited about it and kind of just kept going deeper.

SPEAKER_00 5:54

And like I know that Gans as a was a maybe a design metaphor, but um, has it become is there a limitation to that metaphor you've discovered?

SPEAKER_01 6:07

Yeah, actually that's a really good question. I think like, I mean, obviously it's it's more like an inspired buy. Uh, I'm sure like in the nuts and bolts, there's like many differences. Um, but I think that, you know, like that's that's one easy way to think about it. I think like uh the metaphor breaks down a little bit because there's like actually a third agent, right, that I talk about, which is like the planner agent. Uh so it's actually more like a three-agent system as opposed to a two-agent system. Um, but even that planner agent, uh, that one in particular has like a very kind of clear scoped role in the whole kind of multi-agent system. And so once you kind of get into it, then it's kind of the interplay between the coder and the QA.

SPEAKER_00 6:45

Okay. That I have a bunch of fractal questions about the planner agent, but we'll keep moving on and come back to it if we have time. Um, you've said context resets beat compaction for long-running agents. What was the experiment or failure that convinced you of that?

SPEAKER_01 7:01

Yeah, so actually, this is a great question, and I think it's an important point of clarification. Um, so what I talked about in the post was basically uh a body of work that went on over you know a few months, and there were a few models in between. And so, what I was trying to clarify, I think uh there was a little bit of like a miscommunication there, which is that um the first experiments we did, which was specifically with uh Sonnet 4.5, and there was like a predecessor post that we put out, my colleague Justin Young uh had written. That one was the one where like context resets were the key unlock for the harness. And the reason being that model in particular had a lot of this like context anxiety behavior that we talked about. Um, to kind of like give a little bit more color behind the scenes. Like I had done a very early version of like the Ralph Wiggum loop. We put that model just in a simple loop, asked it to kind of just keep going. And it would literally um, it's funny, it would literally kind of find ways to just get out of the loop. You could kind of almost reading the transcript sense that you know it didn't want to go on for too long or it was like uncomfortable going on for too long. Um, I I ended up like even doing some other tricks on top of that where I put like a specific phrase of like, hey, do not exit the loop until you've said this specific phrase. And then it would uh it would say, I cannot in good fit say the specific phrase, and then it was a Python loop and then it exits out. So that's how we observed that behavior with that particular model. Um, but what was interesting is that you know, as the models changed, and I kind of talk about this in the post, the characteristics of the model changed, the capabilities changed, and so like the harness around it kind of uh evolved. So with um Opus 4.5, which is where you know I kind of first described uh described and discovered the pattern that I talked about in the post, that was one where like compaction worked just fine. We just let it rip. We followed the structure that you know I laid out, and it you know was able to work coherently. And so we switched in that you know from model to model from context resets to just using planal compaction.

SPEAKER_00 8:56

Um so that was a very dense answer. We're gonna, you know, not all the students in the class are um familiar with with machine learning systems at the level of let's say uh implementation details. So let's break down a couple of things that you talked about there. The first is the Ralph Wiggum loop. Could you describe what that is? And why the Ralph?

SPEAKER_01 9:18

For sure. So the Ralph loop, uh it you know became like very, very famous on like Twitter and stuff, especially like earlier this year. Um uh I believe uh Jeffrey Huntley was the creator, but I as far as I understand it, it's it's the concept of just like putting uh the model in like a naive loop and just kind of telling it to retry a task. And with certain models, you can see that just putting it in that loop leads it to continue making progress on your goal for a long period of time.

SPEAKER_00 9:43

You know, that to I'm gonna try and connect the dots a little bit to past speakers. So last year we had Ben Mann show up. Um, and he, you know, Ben's um one of the things Ben talked about in you know, the lectures up on YouTube for folks who want to see it from last year's class uh is this idea of elicitation overhang, right? Where the latent space of the model has lots and lots of capabilities that people don't really understand yet. And sometimes when someone says, oh, the model is dumb or it's not good at this, it's really the way you elicit the model's capabilities that matter. And the idea of an overhang is that that you know there's an overhang, there's a there's a lot of capabilities in the model that people aren't eliciting properly. And so, do you think the Ralph Wiggum loop is an example of how to elicit models to have to be more capable without actually needing fundamental the new weights? It's just it was there, or is does it represent some other, or am I am I not actually uh interpreting the notability of the of the Wiggum loop?

SPEAKER_01 10:44

Yeah, no, I think that's a great, that's a great example. And I I definitely like agree with this concept of like the elicitation overhang. We've seen like time and time again, it's it's kind of magical. You find the right prompt, like the right incantation of words, and all of a sudden it kind of gets you. Some people say like like a model generation ahead, even on particular capabilities. Right. So it is this like really magical thing that uh we have to experiment with in this era of technology.

SPEAKER_00 11:10

Cool. Um the next question is applied AI or AI engineering as a job title barely existed three years ago. How would you define the actual craft? And what separates someone good at it from someone who just knows the API?

SPEAKER_01 11:30

Oh, this is a really good question. Um, I think it so to me, in this particular point in time, the craft is like can you build agents? And you know, the word like harnesses or scaffolds is kind of like the common term, right? Which is like, what do you put around the base model? Um, really, what it comes down to in practice, uh, if we're talking about the the core harness in and of itself, is the system prompt, the tools that the agent is given. Uh, and then it's a lot, a lot of it is like very much like looking at the trajectory. Um, that's a that's the really fascinating thing about again this era of technology, which is like in software engineering, they're sort of like first principles. Uh, because you know, AI models are, you know, we're still working on interpretability. A lot of it is about just like observing the behavior and then having smart ideas and then iterating accordingly. I would say that's like actually a lot of the craft is like the experimentation and the willingness to experiment. And I think that the folks that are really skilled at this, um, A, they have a great understanding of like sort of the base principles. And I think that that can come from like studying really good agentic products. Like I myself, for example, had the opportunity to learn a lot from like just observing the Claude Code team, which is like a really, really like fortunate thing. I joined at a great time where I could see kind of the experiments they were doing, what was working, what wasn't. Um, and then the second layer above that is just like having a good intuition, um, which I know sounds kind of fuzzy, but you know, with these models, like folks that really have been in them for a while and can see something and understand, oh, okay, I've observed this pattern. Um, what can I do to rectify that? That's like within my control and doesn't need me to go down to the layer of the weights. Um, it is sort of a skill in and of itself.

SPEAKER_00 13:15

Um you know it's interesting. I I'm gonna I'm gonna yeah, I'm gonna write a follow-up question. We'll come back to it because I my main question for you is is that new? Is that fundamentally a new skill set, right? Like having a good intuition for you know what the shape of these systems and what they can't and can't do. Isn't that is that isn't that sort of that's just been a part of good software engineering, you know, for a long time. Or am I missing something? Like isn't uh sort of there are many ways to do software engineering correctly, right? But fundamentally like having a good intuition for what to delegate to what process in your architecture and then you know how to do a b testing in the wild, essentially, right? Because you you come up with a often a thesis for how your software should work, then you deploy it, and then you you experiment. Like that's literally the process of A B testing. It is is that not is there something fundamentally new about building applied machine learning systems versus not along those three axes?

SPEAKER_01 14:26

Yeah, it's actually that's a really good point. Um, I so my argument would be like at the highest level, like there are a lot of similarities for sure. Um, I think to me, what stands out is like I would think that if I went into like a very like large complex code base, um, it's probably complex to the point that like, yeah, like I as a person couldn't understand the whole thing without spending a lot of time and effort in that code base. But there's almost like um like like a like an epistemic nature of it where it's like, okay, but I know that if I really wanted to, I could understand it. I could find all the lines of code, I could trace through the entire uh flow. Uh but with you know AI systems, you don't necessarily have that fundamental underpinning. And so I think that's kind of like the difference.

SPEAKER_00 15:11

Okay. Let me let me try and push back a little bit, which is maybe the key uh one salient difference between um working on traditional software and working on AI software or AI systems, is that in the case of um traditional software design, you often start off with uh with sort of a user story, right? Where you say, like, this is the user's goal and I want to accomplish it. And so let me break down the user journey in like 10 steps, and then we build the software to accomplish those steps, and then we write unit tests to make sure that the software is basically meeting the requirements, right? Which is a somewhat deterministic way to design the system. In the case of an ML, uh, a salient difference is you don't start with a user journey or user goal, you start with an eval, right? And you say there's an eval required. Um, and that might just be a metric. And and then you don't actually, you're not super prescriptive on how to get there. You tell the agent, go figure it out. Um, and and that's a very different mindset, right? Because then you're actually what you're trying to do is give the agent a prompt, see how it performs, and then essentially reprompt it based on its current uh performance. Is that would you agree with roughly that delineation between applied AI and traditional software engineering? Or am I conflating the two things?

SPEAKER_01 16:36

No, I think I think that makes sense at a high level. I think uh so evals is like a huge topic and at risk of uh getting into a can of worms here. Like I think at the highest level, particularly with like, yeah, traditional machine learning systems, you know, you have like your classifier, your precision, your recall, so on and so forth. I think the thing that gets tough with agentix systems is because they're so much more capable, uh, measuring the capability is like so much more fuzzy, right? Um, so a lot of times, um, again, it's a little bit of a tangent from your question, but I do see a lot of a discourse around evals. And um, I think they're very useful. They're definitely part of my toolkit. But sometimes I think that uh folks can get it's it's actually a similar thing as uh I think happened in certain areas of software engineering where like junior engineers think that they have to have like 20 unit tests to ship something, versus like a senior engineer would say, okay, like what's the mechanism for me to get the signal that I need? Um I think evals can be the same way. And so sometimes people are like, oh, I can't ship this agentic product without an eval. That's not necessarily the case. There's many great things that are shipped without having an eval undergirding it. Um but of course, like if it's if you construct the eval in a good way and it gives you the signal that you need for your dev loop, then it's like super, super valuable. Um so all that to say that, yeah, I think it still is like a lot fuzzier, you know.

SPEAKER_00 17:53

Oh um, if we had to concretize this a little bit, you know, in the class we've talked, um we spent a lot of time on the production of models, and we broke it down in the early lectures on how it, you know, the pipeline often looks like pre-training, mid-training, post-training, and then you have sort of as part of post-training, you have RL and SFT, right, in this sort of continuous deployment loop. Um in on the lab side, since you guys are you know, part of the mission there is to then take the capabilities that's the that is that's the output of the that pipeline and then expose it in a way that's usable to users. What does that is is there a canonical pipeline you guys have developed yet that happens once you have a capability and you want to expose that as a product? Is there any standard um like when you guys approach a new product launch, right? What would be the equivalent of a pre-train, mid-train, post-train, but for for the labs team?

SPEAKER_01 18:46

Yeah, I actually don't think that there's a standard process. And in fact, like I think one of the things that makes Anthropics so great is that we have a lot of like bottoms-up innovation here. Uh, so people just kind of like try things. And you know, like to your point, like we obviously we we have a pipeline for producing models, but oftentimes, like, you know, until someone goes and has an idea and you know puts in the elbow grease to try it, uh, no one necessarily knows for sure that, oh, okay, now you know for a fact that it can do this thing if it's in this particular harness. Um, so even the stuff that I described in my blog post, um, I'd share those results internally. And I think it was a big like kind of like weight update for a lot of folks uh in the organization of like, oh wow, okay, like now we have proof that the models can do this thing. This is super cool. Uh, how does that influence some of the other initiatives and the other products that you know we're developing? Um so it's actually like a lot more organic than I think people realize.

SPEAKER_00 19:39

And maybe is that itself maybe a feature that like a explicit design or principle of the labs team, which is we don't do you know, we don't do PRDs over here, we don't do product requirement docs, we don't do roadmaps. It's much more like go forth and pursue whatever experiment you want. Because that in self in itself is an opinionated systems principle, right?

SPEAKER_01 20:00

Yeah, I think like for one, like anthropic as an organization overall is just like a very great environment for that, which is like I think a great like org design choice. Uh, I don't know how much of that is just like a natural extension of just like our culture, but we're like, you know, all pretty open with each other, which is amazing. Um, and then you know, like obviously the labs team, we do have like PRDs, we do have like kind of these these processes. I think it's kind of like the thing where uh if you're you know, they say like if you're an expert in a rule, you kind of know when to break it. I think everyone knows when to reach for like writing a document. Uh, I've definitely written a lot of docs to convey my ideas. Um, but we also have this awesome culture of just like, hey, if you can, we all have clawed code to show a demo first and like get the idea across. And then from there we can figure out kind of like how to make it a little bit more like um process oriented and structured.

SPEAKER_00 20:48

Uh like from a cultural traditions perspective, do you guys have any regular demo days internally or show and tell? Like what are the traditions in the group every week that where you guys come together and share ideas, insights, and/or prototypes?

SPEAKER_01 21:02

Yeah, yeah. There's definitely like uh a couple of meetings where it's like uh it's very encouraged to just like demo what you're working on. I think uh in the labs team in particular, we we very much have that culture where like you know, you can expect that everyone is kind of like tinkering on something on the side. Uh, and there's like uh regular forums where it's like, hey, okay, like I'm working on this, let me just show you guys and solicit your input. Um that can even be like async over like Slack and stuff. But yeah, overall we just have that culture where it's like, hey, tinkering is encouraged, and if you make something cool, share it with you know with the people on the team. And if and yeah, it's cool too, because then you can kind of get signal very fast of like, oh, okay, my teammates think this is cool. Maybe I should spend more time on this.

SPEAKER_00 21:40

Um, how big is the labs team?

SPEAKER_01 21:43

Um, I want to say it's around like 20 engineers or so, but actually got to check that.

SPEAKER_00 21:49

Got it, but but essentially small enough that you could you guys could all fit in like around you know, a conference room with like 10 pizzas or whatever. Yeah, for sure. Which is the AI version of Jeff Bezos' two pizza. We've scaled it up a little bit, but I conceptually the same idea. Cool. So we we should keep going. What does your day-to-day debugging loop look like when an agent fails on a long-running task?

SPEAKER_01 22:16

Yeah, this is a great question. I basically I uh would when I was working on, for example, the harness uh I spoke on in the blog post, uh I would uh I we had some infrastructure that I could use to just like kick off runs overnight, for example. Uh so oftentimes uh I actually became a little bit of a night owl working on this project. So oftentimes I would shift my development towards like the evening, do some stuff, kick it off overnight, wake up, read the transcripts. And then um the tricky thing about long running is that yeah, you know, these transcripts can go for a couple hours. So there's a lot of material to read. Um, I think over time I developed an intuition about like which parts of the transcript I cared about. Um, in the earlier parts of the dev loop, for example, like I would just run it and then within a few minutes I could kind of see if it was going off the rails. But as the system became better, it would take like more time to actually see what was happening. Um, and so then it's just it sounds like very straightforward, but it's literally just like reading the transcript, seeing what happened, seeing if um it's what you expected. Like, for example, like the the QA agent. If I saw that you know it identified a bug, but the the the reasoning that you kind of that is exposed in that trace looks a little bit off or doesn't match my own judgment, that's like, oh, okay, that's a signal. So that I, you know, kind of take notes. I come up with like 10, 12 bullets of like, okay, these are all the things that I saw wrong in this particular trace. And then you can use that in order to say, okay, how do I tune the harness? How do I tune the prompt accordingly?

SPEAKER_00 23:48

What um what's a result you got that surprised you where the model did something better or worse than you expected for non-obvious reasons?

SPEAKER_02 24:00

Hmm.

SPEAKER_01 24:02

That's a good question. I think um so so for me, like a lot of like these uh uh web libraries are like, you know, for example, I I put this uh example of like a like a digital audio workstation, like a music program. And so a lot of like the underlying web libraries are like relatively involved and like in some cases can be a little bit more esoteric. Like even if I'm like a great like JavaScript or like TechScript programmer, it takes me a while to go like learn the particularities and stuff. So I think for me, like there were a couple of cases where I saw the model was just kind of like you know, just chomping through this no problem, uh, getting it done, building the application around it. Um, I think the other thing that really surprised me is that uh I was able to design a harness that made Claude good at the Claude Agent SDK, which I kind of talked about. Uh and that took some particular tuning because at that time, like the Claude Agent SDK is like it's relatively new. So it's not necessarily like a library that's been around for 10 years, it's very much within like you know, the knowledge cutoff. And so the fact that, like, oh, okay, we can design something where we like give it the scaffolding and the principles and the documentation, and it's able to go make something like very, very coherent, even though this isn't like fully within its kind of like knowledge cutoff. I think that was like really surprising and really cool for me.

SPEAKER_00 25:20

Cool. Do you think by the end of the year more developers are going to be writing software on their phone or computer? We are ripe for new software development form factors? That was a question, I guess. Are we is it what do we expect to see more? Actually, let's start with the the first one. Yeah. By the end of the year, do you think more people are gonna be writing software on their phone or can or or laptops?

SPEAKER_01 25:47

I think the proliferation of like phone usage uh for coding is like definitely gonna go up. Uh, I think this is like my bias, but like I actually, you know, use like, for example, like Clark Code Remote and stuff a good amount at this point. So I'm definitely writing a lot of code on my phone. That being said, like I find that I still need to like sit down at the computer to like really like see it and like look at the PR and review it more carefully. So it's kind of like there's this um this gradient of like how much you can just trust the model and it's just gonna do a great job. Uh, how much you need to be like fully in the loop, and there's this sort of like in-between of like how much attention do I need to like give to this particular task to make sure it's done correctly. And so, like as that gradient moves right towards just like the model can handle it, I think we're gonna just see more and more like lines of code coming out of the phone where it's just like, hey, go do the thing, and then it comes back, you monitor it lightly. But it's at a certain point for like the really mission critical things, like I'm still like making sure that I'm like looking very carefully at what's happening, what's going in. And so, like, it's it's like the the high value human work is, I think, gonna stay on the computer, but more of a percentage of work is definitely gonna move to the phone.

SPEAKER_00 26:58

And uh okay, so then there's a next question, which is like are is it are we ripe for new software development form factors?

SPEAKER_01 27:11

I think the the final software development form factor is like almost like there's no form factor at all, right? Like you just you're not really in the loop, you just kick it off and you come back. Um I do think that there is like an intermediary where it's just like like I said, like the it's almost like your attention needs to be more like focused. Uh so like I'm definitely excited for like form factors like that, or just like easier to like review agentic outputs. Um but it's it's a little hard to say like how long like that paradigm is gonna be like the dominant paradigm before it's a paradigm where like, hey, anything that's like relatively grockable, like the agent is just gonna do by itself.

SPEAKER_00 27:51

Yep. What do you suggest? How we balance between learning and iteration speedslash deck depth. Sorry, was that debt or depth? Depth.

SPEAKER_01 28:06

Depth. Like depth versus breadth. Um it's a good question. I so uh before I joined Anthropic, I spent a lot of time just like reading and like talking to folks. Um I ran like a couple of events in New York where I got to like meet researchers. So I kind of did like a breadth first search personally. Um, and I think that like when you especially when you start, like it is good to like just get a lay of the land, understand what's going on. Even like I feel like AI is kind of like sports, right? Where you can just it's a very active like news cycle. You can be kind of reading what's the latest. But I do think at a certain point, once like if you're doing a good job, it's kind of easy to saturate breath. And at that point, I would uh I would start going aggressively towards depth. Like that's kind of how I felt when I joined Enthropic. I found a few things that I got particular a lot of state on. I mean, AI in and of itself is such a like big problem that you know you can apply AI to like particular domains. Like for me, like front-end coding, long-running uh agents and stuff became sort of focus areas for me. And then I you know gained a lot more state there. Um, advice that I give to a lot of like the the newer folks that join that I end up mentoring is sort of like optimize first for recall and then optimize for precision. So you end up somewhere where your time is like high precision, high recall.

SPEAKER_00 29:22

Cool. For students who want to build agents, what's the most overrated technique right now and what's underrated?

SPEAKER_01 29:34

I think overrated is like I definitely see uh I'm not like super online, but I definitely like feel that when I go on Twitter, there's always like a lot of like, oh, like this new shiny thing, and it's so great because XYZ and it's like 30% better and stuff. I like obviously there's like a lot of value in those things. I think some of them do stick. I think there's others that are like sort of like here today, gone tomorrow. Like one example for me is like I've the the arc of like uh certain retrieval techniques uh is really interesting because I feel like like Claude Code showed that like memory can be just like accessing the file system and writing files. It's so simple, but it's so powerful. Um and so like I try to pay attention to the things that are like um yeah, just the things that that seem like the simplest solution to the problem, basically. Um it can be hard sometimes to tell like what's durable, what's not durable. But um sometimes like I kind of see it, I I have kind of a smell myself where it's like, oh, okay, this seems like very hard to understand and a lot of promises are being made, but is it like is it kind of like elegant? Does it have the elegance? It's the same thing as like software libraries. You know, there's like certain libraries, like not to hate on like Redux, but as a front end guy, I always thought like Redux was like super overcomplicated. And like these days it's like Zoostan and like other libraries. And it's like, okay, like is the solution clean and elegant? If so, okay, like double tap on that, bring that into my uh into my arsenal or my toolkit. Uh, if it feels like, yeah, it's it's really hard to get into and it's sort of like a buzzword, that I try to pay like less attention to. And I give it a little bit of time to see if it's actually gonna stick.

SPEAKER_00 31:08

Cool. Your background is mechanical engineering at UCLA, then C3 AI, then nerd wallet, then enthropic. That's not the typical PhD to Frontier Lab path. What did each step actually teach you that ended up mattering?

SPEAKER_01 31:27

Yeah, this is a great question. Um, so a little bit of a backstory is like I was doing mechanical engineering at UCLA. Um, and I I wouldn't say I rate very highly for like hand-eye coordination. Um, I was just one of those people that really enjoyed physics. And so I was like, oh yeah, like let me get into mechanical engineering. Uh, I went to college and then uh realized that like, yeah, like the the people who were doing really well there were like the folks that were like hands-on tinkering with stuff, which was never really me. I just like wanted to do math and physics, and then you know, hopefully someone pays me for that. Um, so I I took a like a MATLAB programming class. It's like uh a language that only like I think like mechanical engineers and folks that don't actually program use. Uh, but I kind of fell in love with it. I thought programming was awesome, which is really cool. So my my senior year, I just kind of went all in. I taught myself Python, and that's kind of was like the origin story of my path into tech. Um, so from that experience, I learned that uh you really can just teach yourself things, which is like really really powerful. And I think it's like it's not unique to tech, but tech has a big advantage where you can just go self-study and just learn something.

SPEAKER_00 32:32

Um, how much are you dictating versus typing inputs lately?

SPEAKER_01 32:37

I uh I like dictating a lot for sure. Um so I try to dictate wherever possible. I mean, practically speaking, sometimes I actually need it helps me to type because I can like formalize my thoughts a little bit more. Uh so I I sort of and then also, you know, I'm sitting in the office, so it's not always polite to just be yapping at Claude. Uh so I I kind of switch between both where appropriate, but huge fan of dictation.

SPEAKER_00 33:02

If you had to redesign undergraduate computer science education, how would you do it to best prepare students for this new paradigm? Should it impact introductory courses or higher level courses?

SPEAKER_01 33:15

Oh man. Um, unfortunately, I actually don't have a CS undergrad, so I'm uh flying blind on this one a little bit. Um, but I do think that there are like CS fundamentals that ended up being like really valuable for me. Like, for example, just like understanding like big O notation and like time and space complexity. I think I didn't necessarily have to grind like six months of leak code to understand that per se, but like having that intuition ended up being really helpful in my career. So I definitely am not like a oh throw out all fundamentals type of person. Um, that being said, I think like, yeah, that because of the nature of computer science, it's really easy to get hands-on. And so folks that are getting that hands-on experience that are really staying at the state of the art, and like I said, just building some of the intuitions. Um, and also like just to be honest, it's a it's a uniquely challenging time in that things are changing so fast. And so I think that the folks that learn to be like nimble and to kind of just like stay on top of the curve, stay ahead of the curve, don't get attached to like the the kind of like meta guidance of like one particular paradigm of you know, like Silicon Valley is pretty valuable. Like I joined the industry in like 2020, and that was like a very particular time, right? It's like, oh, okay, you know, COVID era, uh, everyone wanted to go to like, you know, Google or Meta, and those are still like amazing companies, but it was like very much like that era had its own rules, and just understanding like every era has its own rules and just how to adapt as those rules shift rapidly, I think is probably the most important skill set.

SPEAKER_00 34:49

In the future, which spec domain company would you like to see Anthropic partner with and create value with? And why?

SPEAKER_01 34:59

Uh spec domain company?

SPEAKER_00 35:02

Yeah. Oh, which I guess specialized domain company.

SPEAKER_01 35:06

Ah. Hmm. That's a great question. I uh personally, just being a little bit of like a physics nerd, I think um, you know, a lot of people have said that like that when Claude can do like uh like hard science, like that's gonna be like a really, really big deal, or just you know, LLMs in general. So yeah, I would I would love to see his partner more on like you know, physics, space, like things like that, really like pushing the frontier of like how we understand our physical world. I think when we get to that point, that'll be super exciting.

SPEAKER_00 35:40

Agreed. Um, what kinds of research questions are basically only answerable inside of Frontier Lab right now? And what's still better done in academia?

SPEAKER_02 35:52

Hmm.

SPEAKER_01 35:56

That's a good question. I think um I can't really speak for like uh the research research side of the house because I'm not there, I guess from my POV. Like um the cool thing about models is like it's fairly egalitarian, right? Like, you know, every like in most cases, like everyone has access to the same models. I think the you know, maybe the advantage that we have is like we're all in this environment where we're all playing with the models all the time, versus like if I'm thinking about some of my prior roles, I might have been doing this in my off time, for example, uh, because I just you know had more like day-to-day responsibilities. And so it's I think part of the hard dynamic is just like uh because it's so experimentally based, uh, time using the models, time experimenting and stuff, uh is kind of like how you build the intuition and how you build alpha. And so, like just yeah, wherever you can get uh an opportunity to just be spending the most time at the frontier, I think it is super valuable. And being at a lab, like that's kind of like naturally part of the job. I don't think you necessarily need to be at a lab to do that, but it is like a very like unique kind of characteristic that uh I was looking for in my role, and that's been really helpful.

SPEAKER_00 37:03

If you could um so how how do you abstract between training, TPU, and NV GPUs, like Nvidia GPUs?

SPEAKER_01 37:14

Uh honestly, like anything chip related is very not my area. So because you're you're largely consuming tokens, right? Yeah, I'm I'm a token consumer. I you know I build agents, I hit the API just like everyone else.

SPEAKER_00 37:27

So how much did you spend on tokens this month?

SPEAKER_01 37:31

Oh man. Uh I'm actually I'm not sure, but if I looked it up, it'd probably be a pretty large number.

SPEAKER_00 37:37

Uh like order of new things. And and from a token consumption perspective, the bulk of course must be claude, but what are the other tokens that you often use in your day other than cloud tokens?

SPEAKER_01 37:50

I think I'm like pretty much exclusively using cloud tokens, to be honest.

SPEAKER_00 37:54

Wow. Okay. Um, what problem in agentic coding do you think is solved within 12 months? And what do you think is still hard in five years?

SPEAKER_01 38:08

I think like um if I'm thinking about like the coding landscape, right? There's I think the the crux of like you know machine learning and stuff is like Python. So it's not that surprising that like the models are like very, very good at Python. That being said, there's like a long tail of like different systems and like the problems that arise from them. Uh, that I think like models are just gonna kind of like like they're good at the core, where like most people are doing, like, for example, like web dev, I've noticed that the models are like ridiculously good at React, but it also kind of makes sense that that's the case. A lot of people do React. I think they're gonna get better and better at just like being highly competent uh across like you know wider swaths of the coding landscape. Um so that's I think in in 12 months, like I would imagine, yeah. I would imagine just I guess long story short, that they're just generally going to continue to improve. Um in five years, uh, it's hard to say. Five years is a very long time in AI time, and I find myself like consistently surprised by kind of like how fast the models are improving. Um, so I have kind of personally stopped thinking in in like multiple years. I've started just thinking in multiple months. And if I can even do something that is true, like stays true or stays relevant, like six to 12 months from now, I'm very, very happy. Uh, five years is quite a long time. So I think uh again, all bets are off and just stay nimble is kind of how I think about it.

SPEAKER_00 39:27

If you could hand a CS153 student one project to work on this quarter that would teach them the most about applied AI, what would it be?

SPEAKER_01 39:40

Um, sort of a generic answer, but like just build an agent that solves a hard problem for you. Um, I think when you start building agents, like I recommend the Cloud Agent SDK, like I love it. Um that's what I always go to. Um, and when you use that, you know, you get the base harness. Uh from there, you might have to make custom tools. You learn a little bit more like the plumbing of making an agent. But once you solve for all that, you really just get to the point of like, hey, at the baseline, the model cannot do this. And going back to like kind of the elicitation overhang that we spoke about, is it possible for me to like encode some specific knowledge to make the model do this reliably? That in and of itself, it sounds so simple, but it's like a deceptively hard problem. And you can spend a lot of time on that dev loop. Uh, and I think you know, whether it ends up working or not, like spending time on that core dev loop really gives you like the intuition and the insights that I was talking about earlier.

SPEAKER_00 40:34

The uh for context, you know, there's about 500 students taking CS153 on campus and a few thousand following along online. And the final project is the one-person frontier lab. And uh they now have three weeks to work on their project. So that's roughly why you're getting questions about the final project because the you know, the submissions coming up. And yeah, we should try out the the agent SDK, um, the cloud SDK. So, what is something that current AI coding agents are good at that's surprising? And one thing they still consistently fail at in a way that reveals the lack of good understanding of large code bases.

SPEAKER_01 41:15

I think um, you know, a lot of people talk about like verification in particular as like a big topic where agents could get better. Like that was obviously a focus for me and like the work that I shared. Um, I I kind of mentioned the post, I had to do a lot of work to make agent the agent better at verification. And even there, there's still like headroom to be achieved. Um but just like the way that you know, like a model might go about uh verifying something naively. Um, it also gets to the heart of like uh what is even verification and like what's the level of confidence someone needs to see. Because like the canonical case is like Claude will like write something and then it'll say, like, oh, I wrote 65 unit tests and the unit tests all pass. Um and it kind of gets to the core of like, you know, what folks say about like taste and judgment. What's like the minimal amount of tests or like infrastructure that you need to feel confident in the final deliverable, especially when you like the agent is writing all the code, so you don't really have too much context on the code yourself. Um, I think that is like a very interesting open problem. Um in terms of things that the model is very, very good at, I think I've still continue to be amazed at how well it can like go into a code base and like naively kind of just read through it, search it, and build like a strong understanding. Um, I'm continuing to find that I can, you know, just go into code bases or projects that I have no business being involved in. And like like I did a PR recently in Go. I've I have zero knowledge in Go. Um, but the fact that I was able to like confidently like ship a PR and at least have like the meta understanding of like what a good PR looks like, what I should be prompting for and stuff, I think is just amazing.

SPEAKER_00 42:55

For long-running agents, do you think continual learning on the model weights slash a LoRa on them would be better than context window stuffing slash routine compaction?

SPEAKER_01 43:08

Yeah, I think um I mean continual learning obviously is just like a big topic in general. Uh Laura in particular, like I'm not 100% sure. Like I just even like for practical reasons, like you know, if if that makes sense, I'd imagine to get that to work uh would just be like very complicated unless you like kind of own the whole model stack, and then that's you know, like a whole other topic. Um context window is just elegant because it's it's so simple to interact with. Um, I do think that you know there is a good point though, which is like things like uh context segregation and stuff like that. Um I'd written a post like late last year about like context engineering, and we kind of the way that we described it is like how do you create like the most potent signal with the smallest amount of tokens? Um, and that art in and of itself is is very, very tricky. Um so again, it's like one of those things that It's deceptively simple, but you could spend a lot of time on. Um, but for me, like my bias is like that continues to be kind of like the right way to think about things, at least for right now.

SPEAKER_00 44:08

How do you abstract between different industries and their needs? It feels like UX should become much more polymorphic.

SPEAKER_02 44:16

Hmm.

SPEAKER_01 44:20

I think like with it almost gets to like the nature of like what is knowledge and like what does it mean to learn, right? Because like there is so much stuff where like structurally, like it can be like surprisingly similar, even though it's like very, very different. Like, I still find myself, it's been a while since I've done mechanical engineering, but I still kind of like almost find myself making like interesting connections between like stuff that I learned in college versus like stuff that I learned in software engineering. Um, there's some things that are just like very, very different just in terms of like the nature of the information. And so then when you think about like uh building AI systems across that, like in some cases, I was able to again like go into industries or or speak to customers that I had very little knowledge, very little state on, um, and still be able to talk to them about something that was useful for them because I understand sort of like the broader picture. Um, but there are cases where it's like, oh, like the domain knowledge is so important that unless I like really, really pair with the subject matter expert, uh, it's gonna be hard to build the AI system uh for for this particular thing. Um, so the unsatisfying answer, I guess, is like it depends.

SPEAKER_00 45:23

Yep. Folks like Sakana have built AI scientists. Do you think current models are enough to achieve an AI scientist? And has the lab team has the lab's team looked into building and testing?

SPEAKER_02 45:38

Hmm.

SPEAKER_01 45:39

I think like to some extent, like, you know, AI models are kind of if you know, like if the role of a scientist is to like kind of discover net new knowledge, call it, um, I think you could argue that like models are like kind of somewhat there. Like, for example, even like you know, mythos with like the cyber capabilities, like I think you could argue that like, you know, it's it's developing kind of like new knowledge for the species in its own way. Um, but if you talk about like science and like other more particular domains, there might even just be like like relatively straightforward things, like, okay, like how do you um how do you give the agent access to the appropriate tools for it to actually like do the science and stuff like that? Um, that are like open questions. So I think I definitely think it's getting there. Uh, whether it is there concretely, I think just like kind of depends on the domain.

SPEAKER_00 46:32

Has the labs team looked into building an AI scientist for X? Do you think that scaffolding around coding agents will suffice to make progress in AI scientists?

SPEAKER_01 46:42

Um yeah, we haven't necessarily thought about like an AI scientist, but um I do think one of the cool things about coding is like in a way, it's like it it generalizes in the way that like you can you can see how it like reasoned about the system. I guess that's that's kind of what I was touching on before, like connections. I mean, they're both engineering disciples, but like mechanical engineering, which is more hardware versus like software. I was talking to my buddy recently, he works on like you know, like airplanes and stuff. And it just comes down to like that systems understanding, understanding the system, how all the things piece together. So you can see kind of like from someone who's good at coding, they have that systems thinking knowledge. And so maybe the model can reason in that fashion as well. Um, then it comes down to like, I guess there's like a few layers. One is like the procedural knowledge, like, does it know in the case of coding the language? Two is like, can it think in systems? And then three is like, can it kind of put all that together to come up with like novel insights? I think um, in a lot of cases, like the the software piece kind of validates that it can to some extent think in systems, um, debatably. The procedural knowledge I think depends from domain to domain. And then we're starting to see evidence that hey, it can pull these things together and like come up with like novel findings and stuff.

SPEAKER_00 47:55

What were you doing in the year before Anthropic that made the difference? Were you building things publicly, contributing to OSS, writing?

SPEAKER_01 48:05

I uh the year before Anthropic, I had started like an AI organization in New York. Um, so I was actually doing a lot of like events and just trying to bring people together, like meet researchers and and other folks that were building. Um, part of this is like, you know, because uh at that time in New York, like a lot of people thought that New York was basically just like fintech and that's about it. Like nothing kind of interesting technical was happening in New York. And uh maybe you could say I'm originally from California. Uh so I kind of had some act, I was seeing what was happening in the ecosystem out in the Bay Area, and I kind of had this chip on my shoulder of like, oh no, I know that some cool stuff is happening in New York. How do we bring that to the surface? Um, so through that, I got the chance to meet a lot of really cool people. Uh, I got to meet like the Sweebench team pretty early on, for example, in doing that, um, and other folks that were just building interesting things. And so that kind of gave me the grounding to uh just see like talk to people, talk to practitioners, see what was happening. I think importantly as well, um, I because I was at Nerdwallet at the time, I'd been out of like the direct line of sight on AI for a little bit. And so it was almost like a forcing function for me to like get back in the game. Um, like I said, like AI is kind of like sports, right? Like if you're kind of out of it for like say even like six months, you might miss like what's the latest happening. So it kind of forced me to really focus and just get back into it, read every day, be part of the conversation. Um, and I think from there, yeah, I did honestly did a lot of tinkering in my own time. Um, for example, like one thing I'm really proud of is like the uh I published like the front-end design skill, which is like, you know, been widely used. Um, that actually came because before joining Anthropic, I got really interested in like all the agented coding tools, including the vibe coding tools. Um, and I was spending a lot of time with them, seeing that they were actually really good at like these front-end libraries and they were getting particular like effects and stuff like that. And then that kind of led to the insight of oh, okay, maybe this is promptable, which led to that skill. So it's it's hard to say, but sometimes just like spending that time exploring, talking to people, learning can be like lay the foundation for like fruitful ground that you build on later.

SPEAKER_00 50:09

Have you thought about async agents? Example, I'm happy for a task to take longer with smarter models for cheaper.

SPEAKER_01 50:17

Definitely have been thinking about async agents. I think it's actually like a pretty tricky problem, though, because um, if you tell someone that um, hey, this agent is gonna go cook for like 30 minutes and come back to you, um, it makes the expectation like very, very high, right? It's like, oh wow, like, you know, I'm waiting for 30 minutes. What's gonna happen when I come back? Um, and so if you in a lot of cases, you can do stuff async and that enables you to like, you know, just naive example, okay. Like the model like wrote like 10 times more tests, so we feel more confident that the software is gonna be good the first time you looked at it. But some people might prefer, like, no, but I just want to be in quad code and develop, and I'm gonna tell it, like, hey, do this, do that. Don't write a bunch of tests when I'm not there, right? And so uh from a product POV, it's almost like creating the expectation correctly around ASIC agents, I think, is like a huge open problem that's really interesting. Because like you can say that it's gonna be the output is gonna be like 50% better, but if it took like more than 50% longer, um, you know, the user might still be dissatisfied, which is kind of interesting to think about.

SPEAKER_00 51:22

Um can you name some of the people at Anthropic who you found really impressive? And what makes them great to work with?

SPEAKER_01 51:31

Oh, this is a great question. Um I I'm very inspired, obviously. Uh, one, I'm very inspired by, of course, like all the cloud code team um goes without saying folks like Boris and stuff, uh had the chance to like talk to directly and like even just I like I think like everyone else, like I follow these folks on like Twitter and stuff like that. So that's been cool, like directly and indirectly. Um more directly, uh Nate Parrott, who's also on the labs team, uh, he is kind of like the genius behind cloud design, um, extremely impressive prototyper, uh, and just has like just amazing like product intuition. Um, so I've seen like a few of his prototypes. So it just like when you know cloud design took off, that was like no surprise to me. He just has like incredible design taste and taste in like uh eliciting models. Um, Jeremy Hadfield, uh, who I worked in with closely in applied AI, now he's on the research PM side, very, very intelligent. Like he um he actually uh it was a conversation with him that made me realize that like, oh, like long-running agents are something like worth exploring more, and I think are going to be a big thing. So just he is one of those folks that has just like incredible intuition about what's important, what's coming, super impressive. Uh my colleague Justin Young, who I worked it with pretty closely, uh for like some of the earlier long-running agent stuff. Um I was super impressed because like uh basically like I'd mentioned the story earlier of how I put like Sonnet 4.5 in like the loop, and like I just, you know, I just kept exiting the loop. I wasn't able to figure it out. Uh the the story is actually I I went on PTO right after that. Uh and I kind of just mentioned to a few of my teammates, like, hey, I've been working on this thing, it's pretty cool. Like, if one if any of you guys want to like check it out while I'm out. Uh and Justin actually, like then I think the next day just like forked my code base, looked at it, uh, worked on it hard enough to figure out this insight of like the context resets, um, which not like I'm I'm telling you, and it's pretty simple, but like at the time, it was like very non-intuitive. Oh, let's just like divide up the work, reset the context window. It's like a very like, like when he told me I came back from PTO and we had that conversation, and I was like, oh, that's like that's so smart. And so it's like I that's one of the beautiful things about AI engineering is you kind of get to see the the ingenuity and like kind of like the cleverness of people and the solutions that they come up with. I think that's been like super inspiring. And I think like anthropic as a whole is like really awesome because you know you're around so many like great people and we kind of all get to talk and inspire each other. So like wherever you can be in that sort of environment where it's just easy to bounce ideas around and everyone is kind of like on the same wavelength, super, super valuable.

SPEAKER_00 54:07

Um the con the call, the the in I guess the the corollary of this is can you name some of the people outside of Anthropic who you found really impressive? And what about their thought, you know, blogs or writing or tweeting or systems design you find inspiring in the industry today?

SPEAKER_01 54:25

Hmm, that's a good question. Um, I I used to read a lot of uh Eugene Yan's work, uh but he actually joined Anthropic. So somewhat cheating. Um, but I thought some of his work really um really good. Um there's a gentleman I'm blanking on his name. It started with an H, but he did a lot of like great blogs on like evals. Uh I think it might have been like Hummus or something. Uh I I'll I'll find it and link it. Um but I I really enjoyed his work because he had like very detailed explainers on like how to do good evals and LLM as drive systems. I found that really valuable to read. Um I also found um uh uh Swix, like the AI engineer. Um he's super useful. Actually, um one of the ways that I kind of got back into kind of like into the game, like I was mentioning, is like he had this newsletter where he would collate like all of the AI news every day and just send you an email. And I would just start reading that daily. And that's that was like my one source of truth, my one source of information. Um, so I think like those folks I found really valuable.

SPEAKER_00 55:31

Yeah, I love Swix. He's been in this room, he's we've hung out many times, and I was just on with him recently at Leighton Space. He came by the office, he's he's fantastic. Um, for long-running agent harnesses, people are building domain-specific skills, guidelines, et cetera, to keep the model on track. This seems counterintuitive to the bitter lesson. What's the right balance?

SPEAKER_01 55:55

I think the right balance is like do the thing, but like stay very nimble about it. Um, I think like many people have reported, like when you're building harnesses and like uh, you know, agentic products and stuff, like in three to six months, you might, you know, a new model might come out, you might realize that some of your scaffolding is completely irrelevant. Um, but I'd spoken about this a little bit in the blog too, where like I'd spent a lot of care in terms of like how I decomposed the work to make it more tractable for um for the agent as it was working. And then the new model came out, and then it was just much more coherent on long con uh long horizon tasks. Uh, and so I just had to remove that entirely. And so there's there's something to be said about just like not being super attached to the work you're doing, even when you find these incredible insights. Um, it's really yeah, just about that nimbleness.

SPEAKER_00 56:42

Cool. We'll take two more questions and then we'll wrap. Um top three predictions for the next for the rest of this year in this space.

SPEAKER_01 56:53

Uh, I think for one, like, you know, I expect like models to just continue getting better. I'm not sure like how much better per se, but like I just try to keep my um my mental model is just like don't be surprised. Like if something comes out and it's just like wows you in a particular domain, just adapt to that and just like make sure that again you're staying nimble and that you can like kind of ride that wave and it doesn't just break some foundational assumptions of yours. Um, two is uh I think that uh you know, long-running agents, agentic infra, stuff like that, those are like big problems that people are talking about now. I definitely think by end of year, like this, you know, kind of to the earlier question might be like the primary means with which we interface through agents. Uh, if not the primary, at least it's gonna become a significantly higher percentage. Um, and the third one is, hmm, I think that we're going to see like a lot of, you know, people talk about like the claud code moment. And I'm hopeful that we'll see more like claud code moments for like more people in you know domains that aren't software engineering. Um, like I recently started using Claude as like a like a fitness coach, uh, and I haven't focused on my fitness in you know a little bit, which you know, I guess as a side note, like focus on your health, take care of yourself. It's really important. Um, but I uh the last time I focused on fitness like this super intensely was like in 2024, and just like having Claude to kind of manage this with me, I was like, oh wow, I kind of had that moment again. Uh and so I'm hopeful and excited to see like more people having that moment in different aspects of their life.

SPEAKER_00 58:26

And um how you know the state of the art, very few people use multi-agents today, right? And almost everything we talked about today was uh like an interaction with Claude as the singular. Um what is the future of what what is different? Uh what is something surprising that people you wish more people knew about how to use a team of agents or multi-agent systems versus single agent systems?

SPEAKER_01 58:54

Yeah, I think like a really simple one is just like uh it really what it comes down to is like the interplay between like different personas. So to give an example, like um I mentioned like kind of like the planner agent, right? Which is part of like my multi-agent setup. And what was cool about it is like um the entire role of it was just to write like a spec, like a PRD for what was to be built. And so if you gave the model like uh naively like a simple task, like build X app, it would go build it, but maybe it wouldn't have the same level of depth as if you had like a whole document. So the fact that we could automate that, create this really nice spec, and then give that to the model, it again, it was sort of like an elicitation overhang trick, right? Where we got a lot more performance just from doing that simple thing. And so I would encourage people when they think about multi-agent systems, I would be careful not to like overcomplicate it. And if you're you know your system requires like you know 12 agents in a very specific sequence, like that can be a little risky. Like, does that stay persistent as a advantage over time? But if you can come up with something that's like relatively lean and then you know you kind of see this interplay of models kind of like playing off of each other in a positive feedback loop, I think that can be really valuable.

SPEAKER_00 1:00:02

And what is a product that um you think people should check out, other than ones at Anthropic, to go see a glimpse of the future of multi-agent systems?

SPEAKER_01 1:00:16

Oh man, this is a good question. I I'll be honest, I've been very uh in the anthropic bubble lately. Uh so I haven't gotten a chance to like check out products outside, but uh I know there's like a lot of great people building a lot of great stuff out there.

SPEAKER_00 1:00:28

So cool. We'll have you back next year and you can do an assessment of any if any products have have got through the noise or through the the anthropic curtain.

unknown 1:00:38

Cool.

SPEAKER_00 1:00:38

Yeah, that sounds great. Awesome. Well, thank you so much, Brittany, and have a great weekend. Thanks so much, appreciate it.

SPEAKER_02 1:00:44

Okay, cheers, and thanks for the same.