Gemini 3.5 Flash, Composer 2.5 is a Beast, Google IO, We Live in Exciting Times | Ep 16 Artwork

Rate Limited

Discussion about the latest news in the world of AI assisted coding.

Rate Limited

Gemini 3.5 Flash, Composer 2.5 is a Beast, Google IO, We Live in Exciting Times | Ep 16

May 21, 2026 • Adam/Eric/Ray

0:00 | 1:01:30

Google I.O. brought major updates, but the developer community is furious. Is Gemini 3.5 Flash a massive step backward for production apps? In Episode 16 of Rate Limited, Ray, Eric, and Adam break down the massive backlash surrounding Google’s new pricing model and token consumption, why the Gemini CLI is getting aggressively sunset, and whether Cursor’s new Composer 2.5 is the absolute pinnacle of AI coding workhorses.

We also react to Andrej Karpathy’s massive jump to Anthropic, detail a wild success story of modding Zelda into VR using Codex's Goal Mode, and discuss how to transition your mindset into the era of probabilistic, agentic engineering.

If you are an engineer or builder navigating the frontier of AI, hit that subscribe button for deep, unfiltered technical breakdowns twice a month.

Links:
Ray: https://www.youtube.com/@RayFernando1337
Eric: https://www.youtube.com/@pvncher
Adam: https://www.youtube.com/@GosuCoder

00:00 - Google I.O., OpenAI, & Composer 2.5
00:45 - The Truth About Gemini 3.5 Flash & Token Guzzling
02:08 - Is Gemini Rebranding What "Flash" Means?
03:45 - Prompt Performance & Instruction Following in Anti-Gravity
05:47 - Google's Profit Margins and Capacity Constraints
06:15 - RIP Gemini CLI: Google Consolidating Compute
07:38 - The Hidden Cost of Small Model Reasoning
09:38 - Google's Distillation Strategy & Compute Allocation
11:41 - Knowledge Cutoffs & Eval Degradation in Gemini 3.5
13:40 - Anti-Gravity 2.0 Drama: Shifting Away From Consumers
16:40 - Composer 2.5 First Impressions: A Pure Coding Workhorse
19:15 - Building Parallel QA Agents with Browser Use
21:32 - How Cursor Pulls Off Trillion-Parameter Speeds
23:18 - Is Speed the Moat? Continuous Improvement RL Loops
25:25 - The Messiness of Real-World Context vs. Static Evals
27:13 - Tool Search vs. Preloading Context Bloat
28:14 - Minimal Context Setups vs. Skill Maximalism
32:00 - The Lack of Portability Between AI Harnesses
34:25 - GPT-5.5 Low/Medium vs. Anthropic 4.7 Latency
38:10 - Tool Tinkering vs. Grabbing Off-The-Shelf Tech
39:11 - The Team's Current Model Stack (Vibe Check)
42:11 - Codex App: Mind-Blowing Background Computer Use
44:39 - Goal Mode vs. High-Level Orchestration Loops
48:23 - Success Story: Modding Zelda into VR Using Goal Mode
50:33 - Andrej Karpathy Joins Anthropic: Why Now?
52:30 - The Value of Co-location & The Bay Area Sparkle
55:00 - Impact Over Upside: Technologists Shaping Green Spaces
56:09 - The Mindset Shift: Transitioning to Agentic Engineering
58:30 - Auditing Your Threads & Finding Inefficiencies
01:00:27 - AI Automation as "Mana" for Your Life

SPEAKER_02 0:00

Ladies and gentlemen, boys and girls, sit down because this is going to be a wild ride with your hosts Ray Fernando, Eric Provencea, and Adam Larson of the Rate Limited Podcast. Make sure you tune up the dial fully for this entire episode because we'll be covering many topics across what's happened this week with Google I.O. Composer 2.5, OpenAI, and a whole bunch more. And I can't believe all the stuff that's been happening this week, guys. I mean, what do you guys think? This has just been an incredible last couple of weeks that you know Carpathi is now at anthropic. You know, Google I.O. just cap came out. Uh, I'm using Composer 2.5 a lot, and we'll just be curious to hear what you guys are thinking and kind of what's going on. Eric, what's up?

SPEAKER_01 0:45

Well, the news never stops. It doesn't sleep. Uh, in the world of AI, every day is a new day. You gotta sit up and pay attention. Um, but you know, in this week with Google IO, there was a lot of news coming out, and um, a lot of it is probably like you know, vapor for now. Like, you know, we we saw this new like spark open claw competitor, but like it's only for testers, and then when it will be out, it's us only. So, I mean, we'll see where that ends up. But like the things you can use that did come out this with the this week with with the I.O. announcements are the new Gemini 3.5 flash. And man, like the the people are not pleased with it because it is expensive, and from what I've been seeing, uh running benchmarks with it, um, it's actually more expensive than GPT 5.5 in many cases, because it just burns tokens like nothing else. It's a token guzzler. So I think we're starting to see uh a change now with models where they have a sticker price, which is already 3x more than the old flash model, which is like a huge shame. Um, and then there's like, okay, well, you have a task you want to complete. How many tokens does it burn to get through that task? And 3.5 will burn tokens like nothing else. Um, so if you've seen if you've seen uh benchmarks on that, like just pay pay in mind, like maybe GPT-5.5 will get it done faster and cheaper, even though the price is higher. Uh, have any of you tried the model? Have you uh played with it at all, Adam? Oh, yeah.

SPEAKER_00 2:09

Actually, I'm a huge fan of the Gemini Flash series, like 2.0, 2.5. Those were great. Even three to a certain degree, their preview model. Yeah. I when I when I saw the 3.5 one, I was actually pretty excited about it because I these models are typically really good at agentic sub-agent workflows, things that you just need really fast turnaround time on, and you don't need a ton of reasoning. I feel like they have rebranded Flash to mean something different now. Yeah. Like they burned a lot more tokens. They are fast. I will give it that. Like, it does have fast CPS, but it's burning a lot more tokens, and it is substantially more expensive. Like almost to what uh like if you think about it, like what is it? 66% of like a top-tier model. Like it is insane how costly they are. Even their caching price is actually pretty high compared to their the it's a dollar per million tokens per hour, so it's got an hour limit on it too, which is crazy. Um, I'm very disappointed, honestly, that they've really changed the meaning of flash. Now, the flashlight model, did you guys try that one? The Gemini 3.1 flashlight?

SPEAKER_01 3:19

3.1 flash I've tried, and I was pretty impressed with it. But it's just like not quite what I need. You know, it's just like just under the bar, and it's like the you need really niche workloads for it to be worth its time, you know. That's the thing with these tiny models.

SPEAKER_00 3:33

I'm hoping there is a Gemini 3.5 flash light model in the works that is substantially cheaper but better than 3.1 light. Uh I don't know, Ray, what are you thinking?

SPEAKER_02 3:45

Um I was just using inside of anti-gravity, and I don't know why, but it performs pretty well for me inside there. And I just but I had a really well spec'd out plan. And just breaking the plans down into much, much smaller pieces, even though it has a million token context window, I'm able to pull quite a bit out of it. It doesn't have that uh playful intelligence that 5.5 just has out of the box. And so I think that's kind of where 5.5, it's like I can transition without thinking too much. And I think whenever I have to offload thinking in terms of like work that needs to be done, you know, I I'm thinking about this more in this agent engineering phases of things, and I feel like I can see the the hints of it there, and it just needs it needs a smarter model to drive this this this model forward. Um because I feel like the more how do you say this? Okay, so if I have a very succinct plan that has a lot of details that are very well structured, I flash actually follows them extremely well. So one of my prompts is like 300 lines, but it's literally to build out like a a web page that has a home page, a blog page, a bunch of design notes that are in it. And Gemini even 3.1 Flash does extremely well with this. I found that Gemini Um Pro didn't really do Gemini 3 Pro didn't really do too well, or the pro versions of this stuff that didn't really do too well of it. And but the Flash versions always perform really well with instruction following. With like very special like, but I had to have Opus as a model kind of design this prompt um so they can work with a smaller model like this. Um and so those work really, really well. When I just vibe in terms of like speaking on the microphone and explaining things like of how I want them, that's kind of where the frustration really comes out. And so anti-gravity seems to I don't know if it aids in it. I don't know what they're doing behind the hood because I do select the model, I don't think it's changing the model underneath the hood, or maybe it is, um, but it seems to guide the the model a little bit more or like shepherds it across, like, oh, you should be doing this thing at this phase now. So I don't know.

SPEAKER_00 5:48

So here's here's a random question for both of you. Yeah, do you think they're making money on this model now? At the price they're gonna be able to do that.

SPEAKER_01 5:54

I think they're trying. I I think the thing with Google that I've understood is that they value margins really, really like intensely when they're charging money for something. And so if they're charging the price they're charging, it's because they want to have margins on it. Um, they're not like I I think just structurally, they're not in a place to try and like undercut people for AI. And I think they're very capacity constrained as well. And each of their products where they release them, they have very tiny allocations. Like, I think the anti-gravity side gets like a slice, and that's why they're trying to consolidate things. And part of this news as well is that they're killing off Gemini Cli, which had been around for a while, was open source, it supported ACP, which was meant you can use Gemini models and other IDEs and harnesses and stuff, and that was just fully removed. An anti-gravity 2.0 doesn't have it. TBD on whether they'll bring it back because they were actually one of the stewards of the of the of the uh pro of the the of the standard, but it seems like that's no longer the case. So that's uh that's unfortunate news for that. Um but yeah, I mean I I think they they are trying to to make a profit off of this, and I think that's probably why 3.5 is so expensive. Um so we'll see. But yeah, I'm a little disappointed with that.

SPEAKER_02 7:04

I think yeah, if I'm looking at their website, it they have it all organized, like project management style, like standard, badge, flex, priority, whether they train on the data or not. And you know, I think in terms of like profitability, they probably had profitability from down on up. And they just said, okay, if you're gonna give you this much more, you're just gonna pay that much more. Because check it out. If you get 3.5 flash on priority, the input price is $2.7 per million tokens, and then 16.20 for output for million tokens.

SPEAKER_01 7:36

Yo, that's that's the the price of Sonnet on honestly. Like, that's that's where you're getting at. Like, yeah, the input's a little cheaper, but you're you're paying, you're paying Sonnet prices for a model that is probably much smaller than Sonnet, to be honest. Um, I don't know, like it's weird. Like, I think Google's very efficient with how they make their models, but man, like I this is not this is not a a whopper model like you're gonna get with Opus or Fi5. Like, this is a small model that is competing at pricing that is not small prices. And you know, like one of the things that is a bit unfortunate with this is that like so reasoning models, there there's this, there's this thing where where they burn tokens to get intelligent answers, right? Um, and smaller models need to burn more tokens to get to the same intelligence. And the problem is like if you're burning more tokens, you're burning more of your context window on reasoning. And so then the deeper into the context window you get, the dumber the model. So if you're burning more tokens to get to that place, you're you're just degrading intelligence further and further as you get to announce it, and you can do less in a context window, which is a problem. And and Google tries to pad it with like the 1 million context window, but small models are worse at large context than bigger models. Uh, so you you you have this problem that that arises. That's why you're getting, you know, GPT-5.5, the breakthrough is that it's so efficient at reasoning. And so you you can get GPT-5.5 low, and it will reason for a fraction of the time with 3.5 flash and get to the same outcome, even with like max reasoning on 3.5 flash. So and it ends up being much cheaper as a result, faster and cheaper, even though you're getting lower tokens per second. Like these things are getting hard to judge and compare because yeah, like on the price on the sticker, like it looks like you're getting a better value, but you know, the amount of tokens you burn matters, and you're burning a lot with this. It's a it's a real token guzzler.

SPEAKER_02 9:27

Do you think it's a slingshot play in terms of where they need to just collect data and get people to use it from all these different use cases, and then eventually that'll get RL'd in, or you know, like I think they're trying.

SPEAKER_01 9:39

It seems like Google's thing is that they build like a behemoth model and then they distill it down, and they're getting really good at this distilling game. And I think Google's probably the best at distilling, but you know, I think there's diminishing returns on that, and at some point you need like I don't know. We'll we'll see where it goes. Like, I obviously they need usage, but also they're selling a lot of usage to anthropic with their with their big uh you know TPU play and and and trying to give give all that compute away. So I think internally they're underallocated on compute, and I think that's part of why things are so expensive, is because they have to like load balance demand uh with with availability, and I think charging higher is how you get to that equilibrium, and hopefully it makes sense for them. But like I, you know, I'm talking to people building on Flash, and they were building good businesses on Gemini 3 Flash, and it did the business case stops making sense, and the model performs worse in certain benchmarks for them in that case, and um, and and the thing that is really unfortunate too with with this move is that so 3.5 flash is uh a GA model, but the previous Gemini 3 Flash was a preview model, which means that it gets sunset way earlier. Um, so they're sunsetting all the cheaper ones, getting putting the new price up, and and it's just like I I just think people are gonna move away from Gemini models because it just doesn't make sense at this point.

SPEAKER_00 10:54

Um I mean, just to add on to that, there's there's a couple points here. You nailed it by saying what you did, Eric. There there are there are use cases that I built in my company that uh I know other companies are building on where they just need a very inexpensive workhorse of a model to do some lightweight like and there is kind of a gap in the market right now. Um I know the flashlight, but again, that performs in my experience worse than the flash models of prior generations. So it's uh there's a big gap. And the other thing is this model, the 3.5 flash version, seems to have a lot less information like in it. I'm not really sure how better to equate that. But some of the jobs that we typically would run require some knowledge outside of code. Like it's good at coding, but it's not a general purpose workhorse anymore. Yeah. And I actually was checking, and I think the it's uh knowledge base hasn't been updated. I still think it's dated to January 2025, which is kind of crazy, honestly. It's brutal.

SPEAKER_01 11:57

Yeah, honestly, the flash models always had less information in them than the pro ones, and I think that's what you're kind of getting with the pro model, the bigger model, bigger retention and world knowledge. Um, but you know, I think that's that was always the benefit of the Gemini models is that you get like Google's like world understanding in the model. And you know, at a flash model, you're not getting that as much and you rely on searching.

SPEAKER_00 12:17

And yeah, I see a variation in some of the evals that we run between like 2.5 flash and 3.5 flash. Where I think they're I think they've actually degraded the overall either they they more fine-tuned it for coding versus more like general purpose use cases.

SPEAKER_01 12:34

I think they have, yeah. I think I think that's definitely the case. They wanted to prioritize agentic use. And I think the thing, if you look at like what Google's announcing, is they were probably putting agents everywhere, and and the big thing with Spark is their like general purpose agent where you know it could you access your email, it can access all your Google services. And I think that's like a really promising thing, and I've been looking for that from Google, but they needed a model that is cost effective for them to run and serve in that kind of a harness. Um, and I think that's what they're trying to build there. Um, but it's just like and I think as a general agent, it's probably fine. I I just it's it's unfortunate, like it's just so expensive to use as a as a consumer, uh like in in enterprise paying API prices for for this. Like, I just it's just an unfortunate turn of events. Um, but yeah, as part of this, you know, Anti Gravity 2 is is is around, and uh that's definitely their new thing. And like we were mentioning, you know, compute allocation is a big thing for Google, and they're I think they're trying to consolidate where they allocate compute for people subscribing. Um, so TBD on that. Um, if you're not familiar, anti-gravity was the product that uh is the result of the Windsurf acquisition uh from last year. Uh so so they've actually gutted out all the IDE. And there is a crazy thing that happened this week with this, uh, which is I I was shocked, and a lot of I've seen people really upset about it. So people who are using Anti-Gravity 1, um one day this week they updated the app and it became anti-gravity two, and the IDE is gone, their settings are gone, uh those all broke. Uh and then all of a sudden they're using uh a kind of rip-off of the cursor, uh, sorry, of the codex app, uh, which now everyone else is ripping off, um, except just not as good and not as complete. And and like, sure, it works fine, I guess, but like, you know, if you were using it a certain way one day and then now you're using it this way, like all of it's gone. And and if you wanted to get that old experience back, you have to download a separate app and reset up all your settings again. Uh, because it's two separate settings files. So then you have all these like cluttered things, you have like annuity one and two like set config folders in your in your repos, and it's a complete mess. So I think they did not handle this well at all. Uh curious to see how that goes for them. But I, you know, if I was a user of any gravity, I would have turned from this. Like, I I just can't imagine continuing to trust them for this.

SPEAKER_00 14:48

And ah, anyway, start with the rant there. Yeah, real question on this how many people do you think actively use anti-gravity? Like, do you think it's actually I don't know.

SPEAKER_01 14:58

Um, probably less this week because of this, but I love it.

SPEAKER_02 15:02

I think in different countries because I know a lot of people, like in South America, that they're using that, you know, because mostly access to compute, because for them it's cost a lot of money compared to us.

SPEAKER_01 15:11

Yeah. I think one thing that is how they have going for them, and there's actually someone in my Discord who's a who's a user because of this, is that so they have the like Google One plan for the family with like the mega plan or whatever. And if you have that plan, then you have anti-gravity usage. So a lot of people are subscribing to Google Services uh anyway. Like you know, I subscribe to Google YouTube Premium and different things, and uh one Google Drive. So if you're in that ecosystem, oh well, you have this coding product available to you as well on top of it, and I think for that it could be good value. Um, but you know, if you're not in that ecosystem, there's not really an appeal to kind of go in just for the coding side, so TPD.

SPEAKER_00 15:53

Um yeah, I just I feel like it's such a weird decision. And and I know they they have all the data. So like I I'm all I'm on the outside, I can only see so much. I totally understand I could be wrong. But from my bubble of the world, which again is very limited, I don't really know anyone that actually uses anti-gravity. And and I feel like they're gonna have a hard time breaking into that market with the stranglehold that like VS Code has and Cursor has. Not saying it's not possible, but it is gonna be a lot of investment. And then to outright like stop development on Gemini CLI, which I think they actually had a better chance of making that a long-term play, yeah, just feels a very interesting decision making to me.

SPEAKER_01 16:33

Yeah, I'm yeah, the some the CLI thing is weird because on top of it, well, one thing that is interesting is that they're actually not sunsetting it for enterprise customers, but I think it's gonna effectively be on life support. Um, you know, but they just don't want to disrupt those customers, but then the consumers, those they don't care about. So you have 30 days to move away, or and then your your sub stops working with it, which is kind of weird. We'll see. I you know, there's a lot of people that built on it because of ACP, and then all those people are getting screwed over a little bit because of it. Um, so a lot of workflows are going away. They really want to funnel people to anti-gravity, and ah, we'll see how it goes. Anyway, I think that's enough about Google. Um, we'll we'll kind of move on from here. Um, another thing that came out this week was Composer 2.5, and Ray, you've been like a super huge fan of it. And before you dive into your thoughts, I just want to like kind of re-add this thing. So, Adam, you were saying, you know, like you were finding 3.5 flash is kind of losing world knowledge in favor of coding, and I think like Composer is in the same boat in terms of like the the way they train this model as a coding model, they don't care about the world knowledge as much. Um, when I was using two composer 2, I definitely felt that way. I'm curious, Ray, how are you feeling about Composer 2.5? Like, how's it affected your workflows and and are you using it as your daily driver?

SPEAKER_02 17:44

So before 2.5 came out, I was using Codex a lot, like as a primary driver for a lot of my coding and just you know, codex high, medium, low, and and even in fast mode because I like the speed. And I just started using composer. I liked composer 2, you know, for a certain task, especially. Like I was really impressed with the front end stuff. Uh and then 2.5 came out and it's literally taken over everything from planning to like I've unleashed a QA agent. Like to give it like skills, it builds the skill files. I did everything, and I just keep pushing this model and I'm just extremely impressed at uh its ability to get stuff done. It feels like Yeah, it feels like a cross between Opus and GPT 5.5. And uh it just it's a workhorse. It's it's a workhorse, and the the speed is probably the other factor that it's really influencing a lot of my uh uh bias here in some ways, because I didn't realize how much of a factor that would make in uh me wanting to continue to just get needed information. But also the when I generate plans, they're also very succinct. They're like not overly wordy like Opus, but they're not overly like GPT-ified, if you know what I mean. So GPT-ified likes to put a lot of headers, likes to like kind of also guesslight you back a little bit. This is just kind of like here's a layout of the land. I was like, okay, cool. Can you then take this and and run with it? Or can you do this and that? Um, you know, I was like, okay, how good is it at building a novel system? Like I I'm literally just building an entire Suite workflow inside of my app. I just built a new app from scratch, and then I said, okay, let's just pretend to be um let's just go through like a whole bug review process, and you're gonna be uh the person who's gonna write me bug reports, help me set that up. And so it's like it set up its own folders, it set up its own way of you know writing bug reports and templates and everything like that, and then it did a test pass and it launched multiple browsers and was clicking away in the browser super fast. I was like, Oh, and it wrote really good bugs, like it found bugs that I would normally find just testing by myself. And I was like, damn, and the speed was crazy. So it's like oh, I don't think we've even understood the impact of this yet. And not a lot of people have like this general SWE background. If you've worked at like really deep in QA engineering or worked with like project managers and all these different workflows, you can reproduce those workflows with this model, send it off, and do stuff in parallel. And like that's another thing that their harness has improved tremendously. That I know I feel like a cursor shill and stuff, but I've just been uh the team has been I've been like giving tons of feedback and they've been incorporating as fast as possible. Uh and so it sounds like they've rewrote their entire engine behind the scenes, uh, and everything is extremely performant and really, really fast. In fact, I don't think it's tied to the VS code stuff, and I think that's what's also driving their new CLI as well. So that that same underpinnings is driving the new UI interfaces, but also driving these the CLI. So you can do all the stuff in the CLI technically of what I'm doing with like multiple agents and parallelizing and uh testing and stuff.

SPEAKER_01 20:57

It's really interesting. I I hadn't realized that the browser use was like such a step up on Composer 2.5 because I think that's really like a bottleneck with big models, is like how slow they are at clicking around and doing things. And if you can have that like at really high TPS and the vision side of it is good enough to kind of drive that, like that's impressive. And uh I think I think that's definitely like if they can get that latency down for those kinds of workloads, it's a huge demand for that. Like that opens up new use cases for sure in in your loops. So curious to see how that evolves. Um, have you tried it at all, Adam? I know you're heavy uh cursor user.

SPEAKER_00 21:29

Oh, yeah. I feel like it somehow I've become a cursor shill as well, and I was very like negative on cursor early on, so I also feel kind of odd about that. But I will say um I agree with everything you said. Ray, I think cursor uh composer 2.5 is just phenomenal. I still wish I had an understanding of how they pull off the speed they do, because if I understand correctly, this is built off of what is it, Kimmy's K 2.5 model, which is notoriously hard to. Run like a it's like a trillion parameter, it's like a ridiculous model, and I typically get 15 TPS or 12 TPS, something ridiculous when I hit, yeah, and it's like this thing runs so fast, it's very smart. It um it isn't to the level of like GPT 5.5 for me, but when I when I'm on a call and we're brainstorming something, I literally can just real-time prototype something, and it doesn't even like while the conversation's happening, I'll be like, Oh, are you thinking like this? And and I can't do that with any other model. It's phenomenal it is honestly just a ton of fun. And I to your point earlier, Eric, I do see I do agree it has sacrificed world knowledge, but it's like maybe I'm unfairly rating Gemini based on this, because Gemini at Flash, I have typically used of like workhorse, like API-driven behind-the-scenes things. Not so much for coding. Composer 2.5, if I were to put it in that same position, I'd probably have the same complaint. So just to be totally fair, but for coding, I love this model, like it is incredible. Kudos to the team. Massive kudos to the team.

SPEAKER_01 23:04

Well, I'm excited to see what they do with Composer 3 and uh you know the Colossus partnership with with SpaceX. Like, could be a crazy turn of events for them. They might you know be a leader. We'll see how it goes.

SPEAKER_02 23:15

What do you guys think? Do you think for me I wrote this on X and I kind of believe it because I saw there was a bug report where the model is preferring to write motion instead of div or something in some type of the way that it was writing. This is Composer 2.5. So basically, within getting that report, the cursor team turned around a new checkpoint for the model and shipped it out within like a couple of hours. And so for me, if they're able to keep up at this speed, I feel like the speed is the moat. You know, if they have this type of pipeline to turn stuff around super fast, you know, I I don't know.

SPEAKER_01 23:48

Like I'm a little skeptical of of this like continuous improvement RL, because the thing is like if you're just like vibe vibe shipping checkpoints for for these models, like it's it's really hard to like prevent regressions. And like, yes, you have benchmarks that you check for, and you're if your benchmarks are like good, you can you can retain quality, but I just think in real world use, like the benchmarks don't tell the whole story, and when you test models, like I don't know, you want some consistency, especially if you're serving on the API. Like, I I guess like for cursor, like they're not doing that, so they don't care. But I'm I'm I'm just skeptical that like you you end up with like these spiral degradation loops of like just trying to overcompensate on one side versus another. But I don't know, we'll see. Like, I it seems to be working okay for them. Uh, but I and I I do think like at some point, probably some continuous iteration is is kind of gonna be the way that's standard, but I'm just not sure that we're quite there at the moment in terms of the reliability of it, so I guess we'll see.

SPEAKER_00 24:50

You know what's funny is I was working with uh a different team over this past week, and they've gone through and they're like, okay, we updated all of our prompts and our evals are now at a higher like a higher percentage, and you go to actually test it out and you're playing around with it, and you're like, Well, congratulations, you figured out that our evals suck because this thing is way worse than it was yesterday. Like that that's that's the reality of where we are right now. Like it is that it's not a useful it's useful to a point, but there's still that human subjective nature that you have to apply to it.

SPEAKER_01 25:22

But there's also like, you know, when you're um when you're working in uh as a user of these models, you have like all kinds of complexity in your in your setup. You've got skills that are installed that you want the model to auto-invoke, you've got like your your current prompts, your system prompts, you've got your agent files, agents MD, uh, you've got like just how you type it, your code base is a prompt too, like the code that the model runs into and engages with, like how is that laid out and set up. All of these things are kind of divorced from the evals that are kind of static and and and don't change. And so, yes, if you you know tune on that, maybe you get better scores. I think cursor to their credit, like they're trying to consider all of this in their setup, and they haven't shared too much about what their cursor bench does, but like I'm sure consider some of this. But you know, the real world is messy, and if you're continuously optimizing for a kind of score, you kind of lose out on some of these other things and and the messiness and you know the context clutter of of like how a model works day to day. And there's just a risk that like you know, you you're changing too much. And the thing is when your user gets used to a model, they try to like lay things out, change their prompt style, and kind of tune to themselves to the model if they're if they're a good power user. Um, but if the model's changing under the hood every day, like your prompt that worked yesterday might not work today. And uh I just think it's a risky thing to just constantly pull the model out from under people every day. Just like a lot to do. Um so so on that note, I wanted to dive a little bit more on this kind of idea that like a lot of people right now are are kind of really building up a lot of skills, you know, plugins and different tools connecting MCPs. And you know, Droid actually just released that tool to kind of uh clean up you know sessions so that they don't preload everything, they dynamically pull in tools as they come. You know, this is a notion called tool search that you know Codex is implementing some form of it, you have Claude implementing some form of it. Um my my personal take is that like tool search is a crutch for people loading in too much, um, because if the model has to find the tool and if it's gonna use that tool every time, then you're actually burning more tokens finding the tool than just preloading it. But you know, if you're not gonna use every tool, then you save context by you know pre by like deferring the loading. Um but if you're doing that, then you're losing efficiency, and then you know, we measure context usage, but we don't necessarily measure outcomes and quality. Um, so for me, I think we're in this awkward pay like phase where tools kind of suck, they're inefficient, people are loading too many of them, and we have to have these kinds of uh dynamic loading to make them work. Um But besides tools, there's skills, and skills kind of rely on certain tools, and people are loading in way too many skills. Like some people have like hundreds of skills. So just like uh you know, round the round table, like Adam, do you have a lot of skills set up in your setup? Like, do you use like dozens of skills? Are you very little? Like, what was your skill?

SPEAKER_00 28:12

I'm very I'm a very much minimal context type person. And I'm even this way when I'm building agents myself, I I actually like to lean um in that regard. But I will say it is a real problem because when you're trying to do when you're trying to build like this agentic workflow that has hundreds of tasks that it could possibly do, and you don't know what a user is going to type in or ask for, like how do you how do you actually build that? And so I've got a lot of ideas around that. The the one that comes to mind is I really like the way like the codec CLI um kind of approaches this, which is we we let the AI do what it's good at. It leans more into like uh terminal commands, it leans more into like writing code when it needs to. Actually, I really like that approach to things, and I think if we can start thinking about building our apps from that perspective, like what are the what are the end things that need to happen and how do we actually give the context so the tools know how to go get access to those things, yeah, versus like distinct tools for every particular action. You know, I've seen I've seen some um some agents where they'll have like 50 or 60 tools loaded in, and that is never gonna be good, man. That's just not gonna be good. So I'm I'm a vanilla setup type person for the most part, other than maybe a claw md file. That's about that's about my limit at this point.

SPEAKER_01 29:28

Yeah, well, if you have a Claude MD, the thing that's funny is that the model doesn't actually care about it that much.

SPEAKER_00 29:34

It will it will search it periodically. I'll be like, go check the Claude MD file.

SPEAKER_01 29:38

I do think actually this this has changed a little bit with 4.7, but I still find tool instruction following is just quite lacking on uh Claude versus GPT. What what about you, Ray? Are you are you like a skill maximalist at this point?

SPEAKER_02 29:51

Bro, I just raw dog skills all the time. All the time. Just skill, skills, no, I'm just kidding. No, it's like I kind of want to have like an opposing opinion here just for just for contrast sake, but like no. Uh I agree. I think my question to you guys is I think a lot of these, you know, harnesses, you know, like the Codex app, the web app on Claude, these different places do a lot of that work for us underneath the hood. And I guess that's my question to you guys. Because I notice when I do do conversations, it doesn't pull in the skills. Like and and cursor, they show now like what stuff you're actually pulling into the context window a lot better now. So you can tell like it's not even loading the skills and other things. Um and you know, they just have them there sitting, and then you can either call them directly or mention them by a keyword. And so I I don't um feel like I'm polluting things, but I only have like a handful of them to begin with. I I really hate uh having to think about what's in there. Um but I usually they're around a lot of business needs. Uh and I uh I'm spending less time thinking about the skills. I only turn stuff into skills that are repeatable workflows, and kind of the same with OpenClaw, like it will call it whenever I need something, but it's not necessarily always a thing I call. Um and so I I think uh I don't work with the API directly right now with like my writing my own agency carne, so that would probably be a pain in the butt. But um yeah, I'm curious to know, Eric, like how does that actually happen nowadays? Is what's like have they accounted for that issue um in these modern tool sets, or what is your experience?

SPEAKER_01 31:24

Well, I mean, skills the way they're done right now is that there's basically like um the descriptions of them are loaded to the model up front, and so all the description if you've if you have a lot of skills, then that description bloat can be real and you can actually like drain a lot on on your context usage. So too many skills does hurt. Um, but then there's also like the models have a tool in the harnesses that let them invoke these skills. Um, and so the model can choose to go and like since they know the description, they're like, oh, you know what? Like that sounds like a good fit for that skill based on the description, and they'll invoke it. The thing is, most models don't do this proactively, only codex seems to do it reliably. But then the thing is like, so so you have to be careful, like you know what model are you using? Are your skills kind of generic? And so if you're using the skills for Claude and you switch to Codex, like you have to be careful because the behavior around skills changes. Um, and then if you start having the model dynamically call skills, which like technically Claude is able to do, but just doesn't do it well, um, well, like you have to think about what's in those skills. Like, are there skills that conflict with each other? Are there skills that actually shouldn't be there at all and don't belong in this specific instance? Like, I think the way that you gotta have to think about this in the optimal way is that like you should kind of preload in a given chat which skills the model should be able to use. Um, but that's like a lot of overhead for the user, which is why it doesn't exist. So, you know, there's all this balance of like how much upfront work do you want the user to do versus what the model's gonna do, and you pay you pay costs for everything that you defer from the user. So that's kind of where I'm at.

SPEAKER_00 32:57

And that and one of the you touched on two things there. Um the first being when the model changes or the harness changes, there's very little portability between them. So you end up like in the situation where your skills kind of become fixed to a model, and for some that might not be you know the the biggest issue whatsoever. But the other thing that I've been thinking about is um there is a there is a latency issue here too, because what ends up happening is now you have to go through this whole reasoning step to determine what the intent is before you can actually begin actioning on the the job. And that may just be a necessary evil right now, but I feel like there's got to be a better way in the future. And then I've also played around with this idea of like bundles, so basically a skill being paired with like unique tools for that skill. Like, how can you actually kind of bundle these things together and then pull them all in? Um anyway, I think there's a lot of like, and I'm sure there's folks that are probably doing that already, but that's one of the things I've been uh thinking about experimenting with. Because I do think there is some value to having that intent to context building step, but I do worry that having that as part of the large model loop and trying to figure it out, it's just it is added overhead. And a lot of times you want the agent to be faster than you can do it. And when I start when I start seeing, hey, I want to do this particular thing, and it takes longer than if I would have just done it, and somebody's like genetic system, it starts to lose its value.

SPEAKER_01 34:23

What model are you using where you're feeling this the most? Uh I would say typically the anthropic models. Yeah. I I find honestly, I was using 4.7 the other day, and I was like, wow, this is like extremely slow. Like this model is just thinking and thinking and thinking, and it's it just overthinks every tool call. I just don't enjoy it. And and and that like the reason it's thinking so much because I said it's XI, but like that's what Anthropic says is what you should use like Opus 4.7 on. So like I'm using the recommendation. You know, I I'm using GPT-5.5 a lot, and sometimes I switch down to low and it just blazes through things. Like, I just don't find this issue at all with that model, and that's what I really like about it. Um, so even though it's a big model, like slow tokens, it's fast, which is quite nice. Um but yeah, I I think like just to circle back a little bit on your thing for the for the skills, because you were saying, like, oh, you know, maybe there's not like that much of an issue between models. The thing I noticed as well, like I had skills that worked well for GPT 5.4 initially, and then 4.7, Opus 4.7 came out, and it just screwed up the skill completely, didn't do it well at all. And I found like, okay, like certain words, the way that they were said, like the models hook hold on to them in different ways. And so I had a thing where okay, like 4.7, the well, the model had to go get some information and then prompt another model with some work to do. But because of the way it was worded, 5.4 was like, yeah, that's fine, it's just a quick check. I'm gonna do a couple things and go. But 4-7 was like, no, no, I need like a complete understanding, and so it would just go off for like minutes at a time before it did step two. Um, and it was just broken, the rogue flow is just broken. So I had to like change the prompt to like lighten the requirement for 4.7's like more anal retentiveness uh behavior, and then it was able to work. So so you you have like keywords that like and 4.7 loves this word as like load bearing, um, that they could be very load bearing for the model and and and like really change how the model behaves, and you have to really be careful because like some models will be way more sticky on certain words than others, and some models will just ignore all the words and then just not do any of it. So, yeah, I I think that's yo does like very tricky.

SPEAKER_02 36:36

Yeah, I heard you need a skill for your skills so that you can run your skill files with other skill files for your models.

SPEAKER_01 36:42

Yeah, I actually though, yeah, you need you need a skill that primes each model to how to use different skills. Uh I I guess we're getting too silly there, but like I I think that's why like it's it's tough to change models too much. And if you like listen to like you know, AMP's philosophy around like tuning the prompts, the tools for each model, and like take away that choice from the user, like I think it makes sense. Like a year ago, I would have said, well, the models are converging, you know, and so none of this matters, but over time, I think this part of that's true, but also part of it is really not, and and like the models really are diverging in how they listen to instructions and and behave. And I think there's a lot to like do, a lot of alpha and just really tuning your workflows to be as lean as possible and optimize for specific models. I think it's kind of the best way forward right now, anyway.

SPEAKER_00 37:30

Yeah, and I think there's two two two major schools of thought there. It's like, are we gonna let engineers have the ability to tune and manage things for themselves, or do we want the harness to do it? And like I'm much more of a tinkerer, so I like having a bit more flexibility myself, but I do I do I have come around a bit more on AMP strategy, even though they've they've come the other way and started adding additional models and things to their stuff.

SPEAKER_01 37:55

So but their additional models are like in different workflows, right? So they'll have like a rush mode, they have a deep mode, a smart mode, um, and each one is a specific model that it uses and is tuned for. So like depending on how interactive or in a hurry you are, you want to use certain ones, and if you're if you want to just let it go deep, you use the different one. Like, I mean, I I still prefer my my own setup, but like, you know, I I I think for someone who doesn't want to think about like the setup, I think it's a good way to go. You know, we're getting to a point where like, you know, if you really want to be on the bleeding edge, you have to tinker. Um, but then there's the diminishing returns. Like, if if your job is not to do this tinkering, then how much of your work is tinkering versus doing the work that you're supposed to do? And it's like a huge challenge, like because you have to stay up to date with the model changes, like when we're making a podcast that like every episode we have like new models coming out, which is crazy. And and it's just hard to stay up to date and and you have to tinker, but like how much tinkering is actually worth doing and you know, versus what what you can just grab off the shelf and you have to really ask yourself that question.

SPEAKER_00 38:59

Yeah, and at what point is it like, do I even care to try a new model? Because what I'm using right now is fine and uh and it's doing what I need it to do. Like it is it is tricky for sure.

SPEAKER_02 39:07

I think that's tough though because I mean what models do you guys gravitate to? Because I feel like we're trying a lot of them and we're definitely at the buffet table. I would just love to hear everyone's like, you know, I used X, but now I'm using Y, or what's your kind of like state of this week?

SPEAKER_01 39:24

I mean, I've I've been on 5.5, you know, since launch, and I've kind of tried I've been actually escalating my use of it more and more, and and using five five low in certain cases because it's just so fast and efficient. Um and and I've just been trusting it, it's reliable, it does what I need. Um, even you know, UX-wise, I found like it does a decent job in many cases, it lays out information really well. Like you might need a styling pass, but like the way it designs layouts is is really good. Honestly, like it's hard for me to use anything else at this point in just terms of intelligence, reliability, and and efficiency, like, yeah, that's where I'm at. What about you, Adam? Oh, you're on mute there.

SPEAKER_00 40:04

I would say that the majority of my usage has been on GPT-5.5 medium. My anthropic usage over the last two weeks, it's probably been near zero, just to be totally honest. Like, I haven't felt much of a need to go back to it. And then Composer 2.5, honestly, is just like my dream model for prototyping things out really quickly. Like, you can't get you can't get better than that. I will say I have been giving GPT-5.5 low uh a bit of a chance, and I've had good results with that. I do wonder if that could be a workhorse uh model at some point, like if that could take place, take the place of some of the flash model stuff that I've built in the past.

SPEAKER_01 40:38

Yeah. Honestly, it's really fast. Like it's it's crazy. Like, I've been using it as a as an explore agent, and it just kind of narrows in. You give it a short specific task, it it executes way faster than smaller models that you would think are faster. Like I've tried comparing it to like GPT Spark or GPT 5.4 low or sonnet or or opus as well, like using it in those cases, and all pretty much all those other models overthink the task and read way too much context and don't actually get to the solution fast. And like the fastest one is typically always 5.5 low, which is surprising, honestly. Um about you, Ray? You're you're you seem like very composer pilled at this point.

SPEAKER_02 41:19

Yeah, I'm composer pilled inside of cursor for coding, but that was also typically done with codecs um app as well. So like I'll do codex medium or low. And then if if if I needed like really like cross-check a problem with composer 2.5 or just say, hey, can you just take a look at this thing and give me a second opinion? I'll go like high or X high just for that one pass on GPT 5.5 is like my expert that I pull in. And then um I have other workflows, so for open claw, I have a GPT 5.5 with low or medium uh in fast mode. And I just talk to it in open claw using my subscription 200 bucks a month, and I don't even burn through half of it, and it's literally helping me with my health and fitness goals, and it doesn't feel like any different than what Opus was doing, and it's like actually very useful. And then I use Codex in the app to drive my computer and do like computer workflows, and that just is like usually medium or yeah, it's usually medium, and you know, that computer use control, all that stuff is just absolutely incredible. And I don't think that I think a lot of people are actually sleeping on that new workflow with codex, where I think codex may even be the wrong name for the app because it can do way more. I'm actually wanting to give it to like grandparents and parents and just say, talk to this little box here, and then this will get you all your information you need for um you know, like helping you navigate around the computer. You know, it's like it's it's that good at controlling the computer, which is kind of my workflow there.

SPEAKER_00 42:52

So, Ray, are you are you doing like a lot of web browsing or things like that with it? Are you doing like work on in the web with it?

SPEAKER_02 42:58

Everything. So, like literally going to a text message, it opens up and it's writing Apple script to scroll like the text messages and read context for business. It goes back into mail, like my actual mail app, because you can only connect one Gmail account in the plugin, but if you can tell it to launch the mail app, it starts reading your mail, and then you know I can uh the computer use is actually insane.

SPEAKER_01 43:18

The way they did it, like too good.

SPEAKER_02 43:20

And yeah, I can control I can do other stuff on my computer because it's it just launches the app in the background and it has a separate mouse that it's moving, and that's why you need to get a Mac, uh Adam, yeah, like ASAP.

SPEAKER_01 43:32

I actually don't think that stuff will work quite as reliably on that's insane.

SPEAKER_02 43:36

That's insane. Yeah, yeah, because uh the with with Claude's desk the computer use thing, it takes over the whole machine, you can't do anything else. Yeah, yeah, it's painful. Yeah, it's it's so painful. But here it's it's actually driving the app underneath the hood. In the background, yeah. In the background, and you see the little cursor moving down, like a separate cursor that's driving that.

SPEAKER_01 43:55

UX on it is insane. I think they really killed it on that, and now you can even use all of that. That from your phone, which is just insane. You can just ball it. Hey, check check my email app on my phone on my computer.

SPEAKER_02 44:07

I was like, can I I have this thought? Let me just punch it into my machine at home and just drive.

SPEAKER_01 44:12

You know, yeah, I want more Mac now. Yeah, if you haven't tried it on Mac, like you really gotta go back. The definitive experience on Codex app is is definitely on Mac. Like you can't use it's just not the same on Windows, unfortunately. But yeah. Have you guys been using goal mode at all in in Codex? How about you, Ray?

SPEAKER_02 44:32

Not as much as I would want to. I don't know why. I just feel like I want to drive more things right now. Um, you know, I I feel like Yeah. I don't know. I'll let you guys speak on more on your experience, but as of right now, for whatever reason, I want more hands-on control. Have you tried it, Adam?

SPEAKER_00 44:48

I've only ran a couple things through it, and I was like, um, I like where it's heading. And I I like the idea because I it's like you're going to bed, I'll kick off something, come back and see where it got to. So I I like that. And I do think there's something to it, but I haven't done anything meaningful with it yet. Like I've modified some stuff in my Shopify store, uh, which is really nice actually, because it would it it did a really good job at coming through. And of course, I still have to bring it back in and then iterate on it from that point, but it kept me from having to like baby step it across. Like it got a high-level goal, and I was able to go away for a few hours, come back, and it was further along than I would have been. And it also helped, you know how like when you get stuck in that dopamine rush of like, oh, this is so cool. I'm working, it's 2 a.m. It's like, okay, you know what? I'm gonna kick off a goal, and then I'm just gonna let it run for an hour or two, and I'll come back and now I can go sleep because I know something's working.

SPEAKER_01 45:42

So if you're not familiar, because we just jumped into it, but goal is a new feature in Codex app where basically the way that it works is there's like an outer loop where you have like a goal, and then when the model's done working, it kind of gets injected, like, hey, like did you complete this goal? And the model's like, uh yeah, I completed it and stops, or it's like, no, actually, there's more to do to hit that goal and it keeps going. Um, so it doesn't let it stop. It's like it's like a thing that injects another turn. If you use codecs in the past when they had like a queuing as the main default, if you were to type like 10 continues in in the in the queue, it like you just say continue, continue it every time it would hit the continue, it would get another continue. It's basically the same thing, but like a little more sophisticated. Um so it's got like a like a double-edged, right? Like on one end, you know, if you have a well-structured goal and you take time to really craft your goal and have good clear end conditions, you can get a lot done with it. Um, the thing is that it leans a lot on compaction, right? So if you have a lot of context and it's necessary and it's kind of hard to get the whole picture, um, you know, you you might struggle a little bit with the goal setup because the model has to come continuously compact and continue. If you have like a very clear, like iterate on this specific thing, it's really optimal for because then it's like there's not that much context to load in, there's clear conditions on what to do, it can try different approaches, it remembers what it tried through compaction, um, and then can keep iterating. I think that's like a kind of ideal way. If you have a big plan and you give a goal, it'll kind of keep plugging along until it completes it. Um, so I think that's like really cool. I do think though, like, you know, I've been you know hammering on orchestration as a concept for a while, and I think orchestration is still like a better approach than doing goal because you have like a higher level executor that kind of understands the problem, breaks down tasks, and issues it, and you know, the way that you handle um context efficiency with like a master context and subcontext is kind of a better setup, in my opinion, uh, than goal, which is just like one train that just keeps compacting and continuing. Um, you can combine compaction, sorry, um goal plus orchestration, but in many cases, that's like overkill. Like if you have like a crazy huge project um and you need it to run for like days at a time, then that's kind of a good way to go. But you know, I I think it's really hard to set up a task that that's like well-defined enough to kind of run that long in a good way that's productive, because oftentimes it'll degrade and kind of fall off.

SPEAKER_00 48:00

But um so I actually yeah, go ahead. It would be a fun test to do the same thing in an orchestration and a goal. Yeah. To just see the di the variance between like something big, like and then like run it, see what would happen. I I would love that experiment. Yeah, anyway, keep going. I was thinking about that.

SPEAKER_01 48:17

Yeah, yeah. No, I mean I was running some experiments with it like last week with it a bit. Um, and one of the things I actually did, so um, this is like a little nerdy side. Um, so I was uh I I I saw this new release came out, and it's the um if you've ever played uh Zelda Twilight Princess game came out, um, there was a decompilation effort. So someone spent five years kind of reverse engineering the code for this game, which is like a GameCube and Wii game. Um, and then they made it run in a way that like compiles everywhere. So you can run it on like uh an iPhone, you can run it on like uh any device really uh because it uses like web web GPU to render. Um it's quite cool. Uh so I was like, okay, well, you know, we got this nice DComp, it can run really well. It has like high frame rate support. What if I can get this to run in VR? Uh, because like there was like an old mod that came out a while back that where that was possible. So I like you know, use repo prompt, I set up a plan with it, I have this D plan. And then since it's like a VR is like kind of only for Windows, I like well hopped to my Windows computer, gave it the plan instead of goal mode, and then I just let it plug away. And it took a couple of iterations um where like I it executed, went back, did the plan again, and came back. Um, but with that clear plan, it I it got it done. Like we we we we shipped it, it worked in VR, and I was like, wow, like we we got it. Like I didn't need some steering a little bit, but like the goal got through like huge chunks of multi-hours of work. I think like around four hours was like kind of the peak of where I was like trusting it. One time my my goal was bad, and then the model got stuck in a loop and it just kept saying uh can't complete goal stuck, and then it would keep going and looping, and I think they're that's a known issue they're fixing. Um but yeah, like you can you can get a lot done. Uh, but if it fails, if your goal end case is bad, you will burn tokens for no reason on stupid shit where it just keeps iterating on nothing. Um, so you gotta watch out for that. Um, but yeah, I think these little primitives are coming together, and yeah, orchestration is the thing that will come too with it. And I think you probably want both, they're both kind of in individual pieces that make sense. It but I think orchestration is the kind of a more important primitive in my opinion, and uh we'll see where we go from there. That's so cool. Uh so I think uh, you know, we're getting uh to the end here. I just wanted to end on one quick note there. So it seems that uh Carparthy, if you're not familiar, one of the legendary um co-founders of OpenAI who went to Tesla, worked on self-driving there, uh, then went off to teaching for a while. Now he's at Anthropic, uh, which is crazy. Um so curious to see what he does there. Seems like it's gonna be uh some work on self-improvement. So seems Anthropic's really the hot lab everyone seems to want to join at this point. Um, so kind of interesting to see him just kind of hop over there. Um interesting thoughts on you know why making the jump now? And um, yeah, what what do you think, Ray? Why why do you think he's he's making the jump now?

SPEAKER_02 51:08

Uh I'll steal what was said earlier in the green room is that like he probably went to anthropic just so he can get access to mythos, right?

SPEAKER_01 51:17

Right on. I mean, you know, as bottles are getting closed off, like if you want to stay on the frontier, the only place to do it is at a lab, it seems like, right now.

SPEAKER_02 51:25

So I think I think there are a couple things that as an engineer, you don't as the systems keep evolving. Like, for me, I don't have access to Apple's internal source code, so I can't see what's going on there. You know, you just get the result and you use the thing, and you're literally a black box, and you kind of have to figure out what's going on. And I feel like in that podcast he did with Sarah Guo, uh, he had a great podcast, they had a great discussion. And in that podcast, he kind of sort of revealed that he misses out like being in the lab and talking to these people and figuring out these new techniques. And she kind of casually like laughed and said, Haha, uh, it looks like he's looking for a job, folks. If you guys are hiring, you know, let and now it's like maybe he was also already talking to Anthropic over a month ago plus about that.

SPEAKER_01 52:09

And I mean it takes time to do in these labs, sounds sounds like that's a plausible case, yeah.

SPEAKER_02 52:14

And so uh to me, it makes a lot of sense because he has all this knowledge. He drops like the wiki thing, he drops all this knowledge in terms of like, why don't we just try this? Just like as a kid who's playing with this stuff, and it's pretty obvious to him, but it's not obvious to the rest of the people. And I don't I think he just forgets how much experience he has because he just drops them like everyone should kind of know this stuff, and it's like and everyone just goes bonkers or like, oh my god, I built my auto research loop, I built my you know, wiki brain for my whatever thing, and it's like that's only scratching the surface when you have uh other researchers who are very deep in all these different topics sitting in a room with you, the iteration just to being co-located is actually uh underrated, and I think that's what a lot of people really forget slash miss out. Uh and it's sort of like the sparkle of what's been happening in the Bay Area in general is that people are uh are co-located and they're they're able to move a lot faster because it's just that at a coffee, you're thinking about it, you're on a run, you come back, and then you you you're sitting down with everyone late at night, and then just you know, something magical happens and uh or something gets revealed from a competitor and you start playing with it, and then you get inspired, and you're like, Oh, since that can happen, now this can happen.

SPEAKER_01 53:26

And so But you know, there there is a double-edged short to that though, which is it leads to a lot of groupthink and like everyone trying the same ideas and and just doing the same things. Um, so sometimes you know that that does really push and refines those ideas to like the the natural limit, but oftentimes, like if everyone's kind of convinced this is the right idea, they don't think about other ideas which might be better. Um, so it's a bit of a risk there. But I'm not saying that's all happening there, but you know, I've seen it, you know, being able to try my own ideas, working on coding harnesses that like people are just not trying certain things, and I'm like, it's works pretty well, and I'm curious why not. Um so anyway. Any uh thoughts to add there, Adam?

SPEAKER_00 54:04

Yeah, real quick, I just say like he strikes me as a technologist. Like it you there comes a point in your life where you're like, yeah, uh he's gonna get the upside of the IPO that's inevitable. Yeah, he's gonna get access to stuff, but really at the end of the day, like we all just want to work on things that matter and are gonna make an impact to human civilization. Like we you know, there's only so many hours that we have in our life to actually put on things, and do we really want to spend it uh in things that maybe we think we could be spending and having a higher impact somewhere? So honestly, I think that's probably where he's at in his life. He's like, because he he mentioned in that tweet that I saw where he was like, I'll get back to teaching. I think he's seeing this as like, okay, there's only a short window of time where this stuff is actually gonna be like a green space. Like uh you know, basically, like I have an opportunity to help shape it, to help drive it, to be a part of it. Like, why would I not do that now? Teaching will be there when when that's over. So that's where my head would be. And I honestly I'm in a very similar boat, uh not on his level, but I want to work on stuff that's impactful and meaningful, and you know, I and and so whatever company or thing that I join, it's going to be it's gonna be that's gonna be a thing that I care a lot about because I I wanna I wanna build things that people use.

SPEAKER_01 55:19

Yeah, yeah, I definitely feel the same way. Like it's an interesting time. There's not many moments in history where like change is at the kind of velocity that it is right now. And right if you're curious and you're working with AI, like there's just so much to learn, it's just never ending really the depth to which you can learn. We're kind of discovering whole new workflows and that that are like changing and being deprecated so quickly. Um, you know, you you like if you're okay with throwing out what you were doing yesterday, like it's the perfect time to be building, and I think it's hard to it's hard to find anything else as a technologist, kind of that can sp like it's the same the same kind of curiosity. All right.

SPEAKER_02 55:57

So, how would you guide people right now who are kind of entering this new era of like agentic engineering, quote unquote? And you guys have spent a lot of your recent careers, you know, kind of shifting in this direction of playing with agents, using them, actually trying to make products with them. Uh, where do you think people should kind of start uh thinking, right? Because I feel like it's a different way of thinking than before. Because before you'd like, go learn the code, go learn learn these systems, you know, build them yourself, uh, and then maybe you can scale your own system or something like that. Um I feel like these systems behave very differently now.

SPEAKER_00 56:31

So this is gonna sound very dumb, but I've been in I've been a technologist for a lot of years. There was a transition from synchronous to asynchronous, and you would not believe how hard it was to get engineers that weren't familiar with the asynchronous way of thinking to understand how asynchronous technology works. And I know this isn't it's a very loose like correlation or uh example here, but agentich is like that. Like it is so different in thinking. It is not you put in something and you get the exact response back. You have to design your systems around these uh probabilistic like behemoths of uh these large language models that may or may not do what you ask it to do. And it is actually like one of the most exciting times to actually be in technology. So if I if I was just coming into it today, I'd be spending all my time in understanding that. I'd be I'd be trying to understand what does it mean to actually work with probabilistic software and how can we get the outcomes that we want from it, and how can we actually start chaining these things together to start bringing systems that are these point solutions that are all over the place into end-to-end complete solutions. That's that's where I would spend my time.

SPEAKER_01 57:41

Yeah. I think for me, like uh one thing I'll say is like you know, you could be building crazy systems, but I think like just using them is already a whole thing, and you're trying to do your work with them. I think my advice to that is to try and just spend maybe an hour or two every week uh to just look back on how you've been working with with the tools this week and see like, okay, like well, what could be different next week? What can we automate this week? Like uh, you know, what what did I do several times? And there's tools. Um, one one of them in particular is called Cass. It's uh made by this guy who uh I think his GitHub, something something Dickleworth, uh Jeffrey, Jeffrey Manual on X, if you find out follow him. That guy has like he goes a little hard on the orchestration, and I don't fully agree with his his approaches, but he he makes some cool tools, and one of them is is CAS, where basically it allows you to kind of really search through all of your old threads really easily, allows your agents to kind of work through what did you do in codecs, what did you do in Claude, all these things, and helps you find like inefficiencies and things that you you could probably do better. And um, you don't have to use that tool, you know, just but just the idea of like looking back on your threads, looking back on how you worked, reading your threads and seeing like what could be better, what could be automated into a skill, uh, and trying to like just like reshape a little bit like the way you're prompting and doing things and trying to find small points of leverage. I think that's like you know, an important thing to do is you you just gotta constantly be reevaluating how you're working with these tools every week and um taking the time to do it is important. I love it. What about you, Ray? What's your advice there?

SPEAKER_02 59:13

Just prompt. No, I think they're um I've been talking to a lot of people and kind of getting them started. And to be honest, I think it's it doesn't really require code, and it's just requiring you to go in and think about a workflow. And so my challenge to a lot of people, even just getting started with AI, is like, you know, launch the Codex app actually and and go through the setup to control the computer and then just say, you know, I what is something that I'm doing? You know, check my mail. What is something I'm doing every morning that can save like five minutes of my time? And just start something basic like that. You'll be very surprised once you start thinking in this way and you watch it interact and do stuff, you'll then get the imagination, like, if the computer can then go do that, why don't I just delegate this to it? And why don't I delegate this? And it may start to get you kind of thinking like, why did I even sit down to do certain things all the time uh repetitively? And that's how you start to realize uh where the what's actually happening as a machine, you know, uh underneath it.

SPEAKER_01 1:00:12

So that actually reminds me, I was having a conversation with a friend of mine who's a uh he's a he's an animator for video games, um, and he works at a small studio. And you know, he I I I put him onto Codex and he started using it more, and I started describing like some of the automations you could do with like Blunder and other tools. Um and he was like, oh man, like he started playing with it and he's he felt the power. And you know, the more automations you could add, the more he was like he he compared it to mana if you've played like an RPG. Um, you know, if you get like or you watch like you know, Dnd, like you know, mana is this you know pool of energy of magic that you can you cast spells with, and like the more mana you accumulate, the more you can accomplish. Um, and like in some ways, like having codecs and these tools like do all these automations, it's like it's like an expression of mana where you have to you have to decide how you're gonna spend it. Um, but like it is power that you can assign to your life and do things for you. And if you think about it that way, like you try to increase your mana and try to increase how you spend your mana. Uh, like interesting thoughts there. So just just uh fun ways to compare it, yeah. I I love that. That's awesome. Yeah.

unknown 1:01:19

All right.

SPEAKER_01 1:01:20

Well, I think uh it's a good place to stop for today. So um thanks to everyone for tuning in and uh I've been appreciating the feedback. And uh yeah, see you all next time. Take care, everybody. Cool, take it easy.