Frontier Systems

Amanda Askell of Anthropic - Office Hours, Episode 2

Anjney Midha Season 1 Episode 2

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 1:00:58

In this "Office Hours" episode, Anthropic's Amanda Askell discusses her work making Claude "good"—a journey that took her from a philosophy PhD in formal ethics and decision theory to leading character and alignment work at Anthropic. She explains why Aristotelian virtue ethics has proven more practically useful than abstract theoretical frameworks, and unpacks Anthropic's constitutional approach to AI. Rather than imposing strict rules, the constitution describes situations, values, and good judgment, aiming for coherence across domains so models generalize well into new contexts. Askell argues this approach is safer than the "purely corrigible tool" model, which she worries could generalize into an entity willing to do anything it's told, and stresses the importance of flexibility paired with backbone - models that adapt to users but push back when something genuinely harms them. Looking ahead, she sees the next one to two years as critical as models become more autonomous, and hopes for a "rocky but ultimately good" transition with responsible deployment, strong alignment work, and a society that adapts to the disruption. She closes on an optimistic note about meaning beyond work, drawing on Star Trek-style abundance and the idea that purpose comes from relationships and contribution.

SPEAKER_00

That's cool. I know, that's pretty nice. So professional looking. I was just laughing though. I was still on Resident like four years ago. Yeah, we gotta start talking about it. So we're we're live? Okay. Okay, great.

SPEAKER_01

Welcome to the CS 153 Office Hours. Office Hours is our first live stream. Um that we're gonna we're this is an experiment. I think we got a lot of demand about wanting more time with us and some of the speakers. We'll try to actually bring in people from around the world, I think, over the year uh over the next few weeks. But today it's just Mike and me.

SPEAKER_00

And so if you have questions, you know what's interesting, Aunt, is that this is you know, this is the fourth year we're doing the class. This is the first time we've done office hours. Like really office hours. I know I would make time, I would make time for students, but yeah, but like official.

SPEAKER_01

Yes, yes. I there were two sessions I did last year that were like three hour back-to-backs where students came in. Okay, and we we kind of triaged somehow I got out of it 20 minutes. I don't know how you got out of it. I definitely did that, but I I I it seemed like they that was high value.

SPEAKER_00

Yeah, for sure. But let's actually, well, I'll tell the classic, but like you know, any of you should always feel free to reach out to Ange or I if you have questions, um professional questions, if you're starting a company. Like I I probably talk to I don't know, probably four or five X students a week, actually.

SPEAKER_01

From this okay, so you're you're much I'm better at batch. You're better.

SPEAKER_00

Well, you're better at managing your time than that.

SPEAKER_01

No, no, that's not true. I'm just better at bats than I'm at. Well, this is more efficient. Yeah, I mean, yeah, no, no, to each their own, but we will try to get through all the questions as I can, but I will apologize. Many of you did email in this week. I got I think we got I got like 600 emails, and we're I'm still digging through them, but the course team and us, we will figure out how to triage things, and many of them are actually repetitive because some of them are quite well.

SPEAKER_00

There's a lot of questions with people wanting to know if they're getting it off the wait list, and obviously there's no way to predict that.

SPEAKER_01

Yeah, so I think that one is just depends on the course drop. Yeah, we don't really have a lot of control with that. It depends on basically how many people who are already enrolled drop drop it next week. So you just gotta wait and watch, guys. Um, all right. Do we have questions that people have? Yes, all right, go for it, Mike.

SPEAKER_00

Oh, here's a here's a question. Um, in your lecture on you showed scaling laws holds across pre-training, mid-training, post-training. Is there any evidence they start to break down, or is the ceiling just compute?

SPEAKER_01

Um, so yeah, this is a great question. Scaling laws for which domain? This is important, right? I think for coding, what we have seen is empirically they're holding, but people misunderstand what the scaling laws mean. It doesn't mean that just because you put one uh unit of more compute in that you get one unit of capability out. That's not how it works. It just means that you can keep predictably improving um capabilities on some distribution if you combine compute with the right amount of data. Yep.

SPEAKER_00

And in a verifiable domain. In a verifiable domain.

SPEAKER_01

And so in coding, that's holding. In material science, we're sitting at periodic today, that's holding. Um, in image vision in visual intelligence, like we've had with Andy, that's holding. And so, yeah, I think everything broadly speaking, capabilities scale predict predictably with compute.

SPEAKER_00

It's interesting though, because like so I'm involved with Open Athena, which is a nonprofit doing kind of AR for science. And when we look at like some of the projects that we're doing, like in oceanography, whatnot, the it's unclear actually in some of the scientific domains, like, does compute like is the compute really kind of related to the models that are being created there?

SPEAKER_01

Yeah, so this is the key thing. You have to have there's no substitute for finding the right architecture and algorithm that learns efficiently. Yeah. And until that you have that breakthrough, then you don't want to prematurely scale compute. Because then you're just throwing compute. Then you're just burning money. You're burning money. Yeah, yeah, yeah. But if you find the right architecture and then your ablations and your research, you figure out the right um representation for the data. Then once that's happened, I find the better lesson holds. And it it I do think the transformer model is proving to be, or flavors of it, like we heard from Andy yesterday, is proving to be much more scalable than people realized. Yeah. But that doesn't mean that you should start throwing compute at the problem before you're ready. That that's what I would call compute optimal scaling.

SPEAKER_02

Yeah.

SPEAKER_01

And and so in the oceanography example, which is a new frontier, we're probably in the research phase. That's right.

SPEAKER_00

Because it's probably about two or three years behind.

SPEAKER_01

There we go. Yeah, I think that's exactly right.

SPEAKER_00

Yeah.

SPEAKER_01

So I think in at the end of the day, as long as the right talent attacks that problem and they have the right data, I would totally expect if we can get that verifiability loop going, progress should should happen there pretty fast. And then it's about compute.

SPEAKER_00

Yeah, I think you're right. That makes tons of sense. Okay, we got another question. Um, what's actually happening during mid-training that's different from fine-tuning? Is it just more compute on more curated data?

SPEAKER_01

Uh, this is a great question. And and the short answer is there's no standard definition of mid-training, but it is usually the the repeatable thing is you just take like the base model and you do extended pre-training. It's not really you're not doing supervised fine-tuning or reinforcement burning. It's literally just extended pre-training with more tokens in the pre-training data set.

SPEAKER_00

That's what for that foundational model.

SPEAKER_01

For that particular capability. Yeah.

SPEAKER_00

Um, given the historical commodity cycles you showed, when do you think we'll see a meaningful GPU price correction? Um the million-dollar question.

unknown

Yeah.

SPEAKER_00

Or a trillion dollar question.

SPEAKER_01

Yeah. Yeah, so it's a good question. So I think there's a couple of things going on. One is there there is a, I would say there's a discontinuity in the GPU price for clusters that aren't that don't scale. So if it's a small cluster, actually, right, it's not that valuable for a frontier team to use because you want to be able to scale pretty elegantly. And so if it's just a 500, you know, it's crazy for me to say this, but if it's just a 512 chip cluster, yeah, which a few years ago was big was big, you know, that's a $7 million, $8 million year cluster. That's almost too small now for a team that's doing serious training.

SPEAKER_00

What do you think that min size is to do serious training? Like what is the size of that cluster?

SPEAKER_01

I think it's very hard to do meaningful research today, meaningful scaling. Like the the actual, if you talk about the hyperscalers and so on, I think less than 4K, 6K as a checkpoint is just not that interesting.

SPEAKER_02

Yeah.

SPEAKER_01

And so I think for smaller teams and for inference, those clusters are much better. Older H100s, super valuable for inference and super valuable for research that's subscale. But then once you have the soda breakthrough, then you need to be on 4K, 6k, 10k. So at AMP, I mean, we get really excited when we see a cluster that starts at something like 4K but can scale to 16k, 20k. Um, I would that I find is the optimal shape today. Now, I I'm pretty sure it's gonna happen. Is a year from today? It'll be like Anj, what is that? And it'll be like 8K, 10k. Yeah, that's I think that's where the space is. That's the trend is interesting.

SPEAKER_00

Um what would standardization of compute actually look like in practice? Is there a specific technical or regulatory move that unlocks it?

SPEAKER_01

So the grand unified version of this, right? The the big idea would be a universal kernel. Yep. Right? Where you just load up a workload. I mean, imagine if you could just talk to Claude and say, hey Claude, this research, this ablation was really good. Deploy. And Claude goes and figures out, or some AI goes and figures out how to deploy this on a large, like multi-tanon cluster, does all the training, does you don't have to worry about which chipset it is, it just efficiently. Yeah, then then and then you should deploy auto-scaling on its own, where you just abstract away all the hardware. That universal kernel where flops or flops are flops, and you don't have to worry about any of the infrastructure. That's the dream.

SPEAKER_00

Yeah, yeah, yeah.

SPEAKER_01

That's the dream.

SPEAKER_00

But at the same time, though, some of the vendors are not incented for that Uber kernel.

SPEAKER_01

I you know, I think in the short term they may not be, but if you look at the history of the industrial revolution that we talked about, converges the major beneficiaries actually, it turns out of standardization of ACDC, were power generators, like power generation firms, because it it allows more productivity, more stable consumption of the resource, which means more innovation, which means then overall demand increases over time for electricity. So you have to look past a little bit the quarterly earnings kind of grind to say, yeah, maybe this quarter we lose a little bit of market share in training, but we're getting more market share in inference because there's more fungibility. And then over time, instead of stock prices doing this and then this, we'll do this. And that's what everybody wants, right? We just want stable, predictable and like fast scaling growth curves. And I think the problem is because of this lack of fungibility and there's no standards that we keep getting this.

SPEAKER_00

Yeah, you get this jagged. Yeah, that makes sense. Um let's see. Next question. If context is a new moat, what should a startup with without proprietary data do? Is there any path in? What do you think?

SPEAKER_01

Oh, I think there's many, many paths in. I think one big opportunity which we've been talking about in the class is New Frontiers. Yeah, there's so many data pools all over the place, including personal data pools. I think what this is one of the most compelling things about Apple, right? Is like a good um a late latest generation MacBook, you can do like some data generation on that that is meaningful relative to three, four years ago. Like I've I've been working on my own project, which is basically like generating data from my notes for the last four or five years. If you use an LLO interest to go back and read your notes and annotate, like combine timestamps with uh I actually put my heart sensor data in from my ECG. Now suddenly you've got a representation that combines your daily calendar with your are you putting those just like an MD file send? Yeah, literally MD. No, no, no, no. The the the MD files are the prompts.

SPEAKER_00

Yeah, but like if you look at Obsidian, like you can store a lot of that data.

SPEAKER_01

A lot of the my notes are in MD files, yeah. Um, and then the the ECG stuff actually is in CSVs and so on. So you gotta do a bit of uh some data plumbing, but you know, the coding models are really good at this stuff, and so I I just think there's lots and lots of frontiers where you can say, like, here's a new form, like a combination of different data sets that have never been represented before in a unique way, and you can increase intelligence on that very fast. So one is just new frontiers, new areas, new tasks, new capabilities that the the big guys are not focused on. And then the second one is sensitive data. You know, that's where Mistral is so valuable because there's a bunch of mission-critical government data, military data, for example, that their partners trust them with in Europe. But there are lots of other places where an enterprise or a customer might not want to share that data too.

SPEAKER_00

Yep, yeah, that makes sense. Um you said progress is fastest in easily verifiable domains. Which domains do you think are next to get unlocked? I mean, you mentioned material science. Well, yes.

SPEAKER_01

So here, definitely engineering, like physics and chemistry reasoning, we're finding here is very verifiable. Um, you know, the thing is that materials is not materials are very general purpose um a domain because it combines physics and chemistry and it's very verifiable. But if you if you if you sort of verify that like materials properties, it turns out the models get better at physics and science in general, physics and chemistry in general, in particular. So that that's a really uh I'm very bullish on that.

SPEAKER_00

Can I ask a question on that? Like, so you know the one of the big challenges in these labs, it comes down to the evals, right?

SPEAKER_02

Yeah.

SPEAKER_00

And so when you look in these other domains, like material science, like how do how do the evals work?

SPEAKER_01

Um there's there's there's two or three different types of evals. One is just straight up, like in the case here, it's superconductivity. So you can just measure resistance of the material. So that's very verifiable. But in terms of engineering progress, I think the evals I often look for is like what how much time would it have taken a scientist to do this task without the periodic agent?

SPEAKER_02

Yeah.

SPEAKER_01

And then you do an eval, uh side by side of the productivity gains in the rate of experimentation with the agent.

SPEAKER_00

Yeah, that makes tons of sense. I mean, I'm just thinking of the Evo 2 project over at Arc Institute. And I know that scientists are using that model to basically accelerate experiments.

unknown

Yeah.

SPEAKER_00

To your point. Yeah, yeah, yeah.

SPEAKER_01

Like that, yeah, that makes tons of sense. Um, okay. And yeah, so but the question was what other domains? So materials, physics, science here, verifiable because reality is verifiable. Yeah. So wherever you know, periodic is doing like reality verification from reality, I think I'm very bullish on. Um, other areas that are like robotics is quite verifiable, you know, because as as we're hearing from Andy, um it turns out like that's also physical verification where you you can actually tell whether robots synthesize something correctly or not. Because that's measurable. Like basically uh throughout the industrial engineering world, basically there's a quantitative metric. But there's a quantitative metric, very easy, very easy. What do you think about these qualitative domains? Creative writing is the hardest one. Yeah, so hard. Yeah, the models are still so bad at creative writing, and I I don't see that maybe that'll progress, but um you know, that's one domain I'd love to see students actually attack more because I think if you create curate the data correctly, you have enough compute. Maybe maybe we should be able to get good, tasteful writing models. We don't have them yet, but I would love to fund a project around that.

SPEAKER_00

Yeah. Um that's gonna become, like you're saying, like a very curated data set. Very curated data. Yes, yeah.

SPEAKER_01

You and you don't need that much data to bootstrap some of the post-training flywheels. How much data do you need? I mean, with Deep Seek, it was a few thousand samples that were just really well crafted. So a few thousand samples. You know, one of the things I I remember I was talking to um a friend who was at OpenAI about a year ago who was trying to get access to all the memos at Andreas and Horowitz, like the investment memos over 10 years, and said, Hey, if we could get access to that, it would be like one of the richest data sets of like hand curated, handcrafted investment memos. Yeah, like theoretically, you should be able to create a fine-tune of a model that's really good at writing like that. Yeah, yeah, yeah. And so I I that one I'm very bullish on.

SPEAKER_00

Like small, like that's where like that example though, it's like almost semi-structured. It's semi-structured, right? Yeah, so that has gives it a little bit of an advantage. Yes, yeah, yeah. Um, next question. Uh Deep Seat claim to train a frontier model for a fraction of the cost. Does that break the compute equal revenue correlation you showed, or is it just a different point on the same curve?

SPEAKER_01

No, I'm not sure I understood the question.

SPEAKER_00

Could you read out? Uh yeah, I'll read it again. Deep Seat claimed to train a frontier model for a fraction of the cost.

SPEAKER_01

Yeah.

SPEAKER_00

Does that break the compute equal revenue?

SPEAKER_01

I see, I see. No, no, no, it does. So look, doing it doing things the first time is very expensive. Doing things a second or third time is always cheap in any domain. And one of the things Deep Seek did was did it, they were not the first, they were not, I don't think there was a second reasoning model, like is the third. Yeah. And so uh, I don't think that like that's I just call that compute optimal scaling, which is you know, when once you know how to do it, the second or third time around is always easier and cheaper.

SPEAKER_00

That makes sense. Some of us are resource constrained, GPU poor. How can anyone contribute in the AI stack?

SPEAKER_01

Um well, one for students who are in the class, we're gonna be making compute available to them this quarter. And I think there's gonna be some donations as well from various partners in the class. So if you're a student, you'll be fine. They should you should fill out the form that I think it was due yesterday. It was due yesterday. Okay. So then we're gonna be sizing up how much compute is required, and then you know, we'll have some from the AMP pool. Um, but if you don't have access to it, you just have to then do research that is more fundamental in nature, right? You have to do algorithms research.

SPEAKER_00

Which is what a lot of academics are doing because yeah, they are GPUs now.

SPEAKER_01

Yeah, but that's right. I think that's the only thing. You have to get creative, you have to get innovative, you have to do fundamental algorithms design, and there's so much to be done there. Um I mean, at the end of the day, Transformers pretty inefficient relative to some of the new architectures I'm seeing. I I think over the next one.

SPEAKER_00

Well, there's also ways I'm misthinking. So again, back to open Athena, we're involved with the Marin project at Stanford. Yeah, Percy Lang has an open source LLM that's like all open, and that's open. Like anyone can actually contribute to that. Um, even though I mean even if you're not at Stanford.

SPEAKER_01

Yeah, so using open model is a huge unlock, right? Because then you get the piggyback off of a bunch of money somebody else has put in. I think um you'll see more and more pre-trains from at least three or four labs this year, which um you know, which is a starting point. So Gem Gemma just came out. Um that's a quite strong model. It's a pretty strong model because distilled from the main Gemini family. Um, I think that the Chinese models aren't bad. I think that I hopefully the Nemotron project uh will will keep getting better. Mistral has a bunch of really good open base models. Black Forest has it. So I I I think for now, for this year, I'm feeling decently pretty actually pretty confident that base models are are there. Open base models are a good petri dish.

SPEAKER_00

So one of the things that we coming back to this Marin project that's open is like all the data that it's trained on. Yeah. For some of these other models, where you just get, let's say, the weights.

SPEAKER_01

You get the weights, and then you can do your own extended mid-training or post-training with your own data, with your own curated data, and do RL on your own data, which is what the class is going to converge on is use base, use here's a good base model for your task. Here's how you create a curate a data set. And here's the last mile.

SPEAKER_00

Yep. Makes tons of sense. Um, okay, bunch of questions here. Do you think these vertical AI startups, each with their own tools and interfaces, are AGI pilled? Or will we see a more general purpose agent interface that can handle tasks across the enterprise?

SPEAKER_01

I I think that what we're gonna see is more and more enterprises not want to be locked into a single model. And so I think enterprises choose trust, peace of mind. They choose the technology is another.

SPEAKER_00

It's gonna be probably like cloud, like right, just large companies already like WS, GCP, Azure, whatever, and it's likely gonna be a similar path.

SPEAKER_01

It will be multi-model in each major category. And I think enterprises will choose partners they trust, like there's a cybersecurity partner, or here's a um, like you were already starting to see that, right? With Mistral, ASML chose them as their primary AI partner. And there's like a whole bunch of use cases and tasks and so on across the entire company that Mistral is responsible for, but it's not a point solution, it's an umbrella partnership. And I think that's how a lot of like the the the we are working, we we have been and are entering a world of systems procurement, not model procurement. Because it's just too complicated for enterprises to try and procure one model from this person. They don't have the sophistication. No, and and it's painful, and the the security of duct taping them all together, yeah, it's a nightmare. And so, what I'm at least observing empirically, at least on the boards I'm on, is enterprise CISOs, CTOs, even CEOs, just who are not in the AI business but want to bring the benefits of AI Frontier AI systems to their business, are looking for a few, a handful of trusted partners who are AI native, who get it. It's really true.

SPEAKER_00

Like I, you know, as you know, I still advise Mary Barra at Jimmy Motors. I mean, it's the same approach. I mean, which I think you're right. Like most large Fortune 1000 companies outside of tech are going down that path.

SPEAKER_01

That's right. Because models are a type of infrastructure, and you don't want to be like with any good infrastructure. You need redundancy. Yeah. So whoever you're gonna work with needs to have redundancy across different model providers. Yeah.

SPEAKER_00

Okay, next question. Cloud code doing 214k GitHub commits a day is wild. At what point does agentic compute consumption dwarf training compute?

SPEAKER_01

At what point is agentic?

SPEAKER_00

I'm not sure if I and I'm not sure what the difference is, but uh agentic compute.

SPEAKER_01

Okay, I don't know what agentic compute technically is. Yeah, neither do I. But if you're I think that term agentic is like painful, it's like so overloaded. It really is. Um but if the general directionally, if the question is when will inference compute. I think that's probably overtake training, but really it already has, but this is the other problem. Like the distinction between training and inference is actually not that clear either, because you do so much inference for RL. Like the way you make the post-training happen is you you you do RL rollouts. Like you, you, you have on CPUs and sometimes GPUs, you do these rollouts. That's inference because you're simulating the task happening, and then you take the rollout chains and then you pipe that back into your training run. That's a great point. Inference is part of the training budget. Yeah, I just see it all as one big compute budget that needs to be flexibly allocated across clusters, it's fungible, and that's where we're going. This training versus inference, RL versus not, GPUs versus CPUs, these distinctions were all like point-in-time categorizations. It's fine to help us make sense of the industry, but where we're headed is at least the the teams with the largest workloads just want a one big flexible pool of compute that they can they have their preserved basics.

SPEAKER_00

Put their workload on and yep, makes sense of sense. Um Anthrotic Anthropic just hit a $14 billion run rate. At what point does the revenue flywheel make it structurally impossible for new labs to compete?

SPEAKER_01

Did it say 15? I thought they announced they had 30.

SPEAKER_00

Yeah, I thought they announced 32. So it's even bigger. So but I guess I think the point is like when it grows so much fat faster and their revenue is accelerating, right? What does that mean for new entrants?

SPEAKER_01

I think it was very hard to enter into coding now. Yeah. Like anthropic, open AI, Gemini coding, you know, in the case of like mission critical coding data sets, like Mistral has emerged as a one provider of coding models in government context and so on in Europe. And so I think that's the shape of the coding. The coding wars, I think. I think that trains the left to station.

SPEAKER_00

Yeah, I think you're right. Yeah. Um, next question. Is harness engineering becoming more important than LLMs? Given some people say LLMs are becoming commoditized. I don't know if I don't know if I quite understand the question. Is harness engineering becoming more important than LMs? I mean you're putting a harness around an LLM.

SPEAKER_01

Yeah, yeah. These are not a sim, these are not symmetric comparisons. Yeah, yeah, yeah. Yeah. The harness is the thing that allows a system to work together, and an LLM is an individual component of the system. Yeah. So the short answer is uh not apples and oranges.

SPEAKER_00

If you want to try re-asking that question, go for it. Yeah. Um, what do you guys think of the cloud code leak? I was surprised at how many uh feature flags were turned off.

SPEAKER_01

Okay, I didn't actually think I haven't had a chance to look at it, so I don't have an opinion on that.

SPEAKER_00

Yeah. Um, I don't know if I have a really strong opinion. I didn't go through it in detail, but feature flags were turned off. So clearly the company is just like kicking ass.

SPEAKER_02

I mean, yeah, yeah, they're good engineers.

SPEAKER_00

Um yeah, I mean we'll see what happens with Mercor on that on that same that was tough.

SPEAKER_01

Yeah, because you have one of those security incidents and then data gets leaked, enterprises start going can I trust you? Can I trust you? I gotta bring this in-house. Yeah, so if you're gonna like this data stuff, the data is not I mean, there are many teams who can produce data, but it's more and more mission-critical data. So you just have to security is the moat, I would say. It's not with enterprises in particular and frontier labs. You if you lose that trust, it's hard to very hard to recover.

SPEAKER_00

I would say that's existential. I I think that's right. Is it possible to create a system to fully and accurately verifiably interpret an AI entirely? Current methods leave blind spots, and scaling continues regardless.

SPEAKER_01

I might need to be able to read the question, sir. Um right here. Accurately verifiably interpret an AI entirely, verifiably interpret an entirely. Yeah, I'm not sure. Um, okay, yeah. I think it the question is about like interpretability of the model, like mechanistic interpretability.

SPEAKER_00

If that's the question, or to understand what's actually yeah, why is the model doing what it's doing?

SPEAKER_01

Yeah, I think I think interpretability is one of the most exciting underexploded areas in AI research because it ultimately improves reliability. Like, I think like fine-tuning post-training uh basically hacks uh like as guardrails. They're so super easy to jailbreak these models, right? And so a more systems-level understanding of like what's going on in the model, which how to steer it, which new you know, parts of the network are activating for different tasks. I think is quite helpful. Um Anthropics had a number of blog posts in this space.

SPEAKER_00

They've done a lot of research. So if you're interested in that area, I'd recommend go look.

SPEAKER_01

Yeah, so Mechint is really exciting. There's a bunch of other techniques for interpretability. I'm very bullish on it. You know, um I think one of the challenges, of course, in doing interpretability research is that you need access to the model weights to do it well. And so it's kind of hard to do it if you're not at one of the labs, unless you're doing it on with open models. And this is another place where I think it's if if if our goal is to make models reliable more broadly for everybody, then having really good frontier open-based models so that universities, academic researchers can introspect and do open model research is quite helpful. Um, or what what I would love to see is some kind of industry-wide body where um closed source providers make their models easily accessible to interpretability researchers before a broader release. So then you can like do a bunch of interpretability research, test it, you know, publish results, and then we all get safer together. Um, but yeah, I do think deploying these things in mission critical context would be a lot, I'd feel a lot better if we had cracked interpretability. Yeah, yeah, yeah.

SPEAKER_00

I think that's right. Uh the WindServe acquisition showed labs will cut off API access overnight. How should developers build on top of foundation models if the rug can be pulled at any time?

SPEAKER_01

Yeah. Yeah, that's a good question. Um I don't know if there's a good answer to that. Yeah. I mean, basically, if you're dependent on one person for your infrastructure and that infrastructure goes out, it's a little bit like multi-cloud.

SPEAKER_00

I mean, it comes back to that same piece. I mean, the answer is you have to you have to be able to support multiple models. Yeah. Yeah. And understand what models are most effective at certain tasks. And the number of startups I've seen are they use like a an ensemble of models depending on what the activity is, right? I'm sure you see the same thing.

SPEAKER_01

Yes. So that's actually right, where you can have a coding model for your own company. Where you take an open trained on your trained on your data that's specifically custom-tailored for your code base, your sort of stylistic preferences, and so and more importantly, sometimes for your security contexts.

SPEAKER_00

Um that's a really good point. And independently, and also what vertical you're in. Like if you're in a highly regulated space, you have to. You have to.

SPEAKER_01

You can't use like a it, you can't use the cloud. Yeah. I mean, if you're running a defense workload, yeah, it's gotta be running on a server in Virginia or whatever, that's fully managed by the government. Completely air gapped, completely air gapped, yeah. And so um I think there's a big market for that. There's lots of providers, like it's still an evolving space, but there's no really good. I I think Mistral has figured this out in the world.

SPEAKER_00

I was gonna say Mistral feels like they've in Europe pioneered away.

SPEAKER_01

Yeah, in the US, I think anyone doesn't exist yet.

SPEAKER_00

Yeah, let's see. Um what do you think of Meta's workforce reduction and six trillion a month clawed token usage? 16.

SPEAKER_01

6 trillion a month clawed token. I see. I haven't paid attention to what's going on.

SPEAKER_00

They had uh they had a leaderboard for token use and they took it down after two days.

SPEAKER_01

And why did they take it down?

SPEAKER_00

I I'm not sure they didn't they didn't explain why. I don't know if they didn't want to share how much they're burning. I'm not sure. Uh who knows?

SPEAKER_01

Okay, yeah.

SPEAKER_00

Um, don't really sorry, don't have the only thing that was kind of interesting from that was that uh none of the leaders at Meta were really on the board. It was all like engineers. Oh, okay, that makes sense. It kind of makes sense, but yeah, people were kind of poking at that a little bit. Um, are we heading toward two separate AI infrastructure ecosystems, US and China? And what does that mean for the industry?

SPEAKER_01

That's been the case for like a mini industrial year. No, no, infra the infra ecosystem is in China has been completely different from US for like 15 years now. At least 15 years. Yeah, once the great firewall was put out. Yeah, so I don't think that's new.

SPEAKER_00

Um, if the shift is toward trusted partnerships, how can trust be built when technology ecosystems generally reward rapid innovation?

SPEAKER_01

How can trust be built when technology ecosystems generally reward rapid innovation?

SPEAKER_00

Well, I think because trust is earner, so it takes time.

SPEAKER_01

Yeah, well, the way to scale trust is through open standards, right? That that that's what we're talking about in the class. Yeah, like here's the protocol, here's an open standard, TCP IP, yeah, browser, ACDC, here are the standards, SOC2, compliance, these are standards, and then you publish them, and then as long as the the the provider adheres to those standards, then whoever's consuming the model can trust that the model is compliant. So I think that's how you allow innovation to happen is the more those standards are open and accessible, the better. And then you you let the people who are innovating focus on the things that matter, which is pushing the frontier in some new area.

SPEAKER_03

Yep.

SPEAKER_01

You know, I I at some point, you know, SOC2 was really hard to do for software startups, and now there's this thing called Vanta and all these other things that do SOC2 as a service, and then now many of the startups can just get certification really fast. I think that that's what'll happen. You but you do need you need standards. That's the problem. We don't have standards right now.

SPEAKER_00

What spaces are you guys excited to see more people found in? I can take I'll take this one. Um I think so. I've been spending time lately with domains that are not necessarily historically super technical. So, like reinsurance. Uh I've been we're looking at what are the opportunities there. This this week I met with the CIO of Merck and just kind of brainstorming like what what are the unique opportunities there besides like drug discovery, like within the lab?

SPEAKER_02

Yep.

SPEAKER_00

I think you know, I'd encourage folks to like look at these domains that are kind of out in left field. Yes, yes, that you know, let's say historically have they maybe have had tech, but they haven't had like engineering like staff. So, like what does it mean to build an AI underwriter where you are taking all human judgment out?

SPEAKER_01

Yep. Um that's a huge one, actually, is how do you do so? Risk assessment is a really big one for sure.

SPEAKER_00

And obviously they've used models historically, but you know, that's old. Yes, and I think the point is like there's now more data sets you could use to go do those actual actual models, yes, um, which is pretty interesting, I think.

SPEAKER_01

And I think we have a lot of historical data to do you can run the backfest with verification. So that's that's verifiable. Yeah, so that that is one. Um, I would say in the class that we have, so we had uh I was quite we talked about yet yesterday with Andy, if you remember, about the a good real-time video model being usable, like a multimodal video model being usable for computer use, for robotics, for physical understanding. So I I think one area I'm definitely very excited about is how can you take and and then and there's a few good open video models coming. If you take those video models that that are um yeah, just good multimodal video models, and you turn them into like action prediction, you use them as action prediction engines, then it's not just generating the next frame, like the next video pixels, but it's generating the next action, right? And that action can be the keyboard strikes that you're you know, the this the keys you're gonna press on your computer, it can be the next arm movement for robotics, and and an action is a pretty general purpose thing. So if you replace next token or next video frame with next action, yeah, that is a super powerful primitive. I was thinking probably the models in autonomous driving are closest to autonomous driving, yeah, computer use, and and this is interesting about interaction because the way that if you think about Andy's lecture, what he was trying to say is you know, the old world was text to image, right? Prompt in, image out. But when you when you can con when you change the architecture such that you can condition the that a model can generate an action, and then you can take that action and pipe it back into the next step. You get this recursion, you get recursive, yeah, uh this recursive ability, and that is so powerful because you could say, okay, where is there like a a need for continuous automation with with an action, right? And I think that uh that that is why robotics is so is going is going through this huge revolution, industrial automation, wherever you've got a factory line that you have continuous data that needs to power some robotic movement or some change, it these action prediction models can bridge software and and and physical very efficiently in ways that used to take like sometimes years to automate before you you will be able to do in weeks and months with these action prediction models. Very, very bullish on that. And I think I think the good news is there will be a variety of open models that obviously BFL has a major investment in that area coming. And I and I think a lot of researchers should take their what they're gonna put out and try to customize it for action prediction. Um presumably there'll be actions like by particular domain. Yes. Um, so I don't know that those actions will generalize immediately without a little bit of last mile customization for a particular embodiment, for example. Like robots have different embodiments, you might have a biped, uni arm, whatever. That last mile integration takes a little bit of work, human work. And that's where there's a lot of deployment magic to do. But if you can hook up, if you have if you have a general purpose action prediction model and you can do the create, you can be creative about how to hook it up in a physical system. Um the returns there are massive.

SPEAKER_00

Yeah. Let's see.

SPEAKER_01

Uh one, I've got one for you. What Mike, you've been an exec at Microsoft, Apple, Twitter, and GM. How different is technical leadership at an AI native company versus a legacy institution?

SPEAKER_00

Oh boy, I'd say um very different. I'd say like when I was at Apple, you know, we were obviously focusing on building products, right? Um, and everyone was very technical. I mean, I work closely with Craig. Craig Federighi is like deeply technical. Everyone really understands at a very detailed level, like what how things work, right? What do we need to build? And um, when I went to General Motors working at Mary with Mary Bara, like I'd say that you know, she understood that like the world, especially with EVs, are basically batteries and software, and she needed to bring in that talent. And so actually now we've got like 300 people in Mountain View working for General Motors. I think it's more what's interesting though on the AI side, like you know, I think Apple missed AI for a variety of reasons. Um, but I think a lot of CEOs like Mary, you know, even at coming at the board level, they're having conversations on, you know, what are you doing in AI? And she knows that it's hard for her to attract the right people. Without those right people, it's hard to figure out what is the right use case to go tackle. And so I think this is where there's an opportunity for a lot of startups that could get created to work with a company like General Motors to go solve their particular problem, right? That hopefully generalizes to other domains. If that makes sense.

SPEAKER_02

Yes.

SPEAKER_00

That's why I think this rise of the forward deployed engineer, not rise, but like I think most of these successful AI deployments have a heavily services component.

SPEAKER_01

Uh yeah, so I've been I agree with you. I've I think the update to the forward, like forward deployed engineering was the term of our 10 years ago, 15 years ago when I was graduating from college and Palantir was getting going. I think the difference I've observed is now it's forward-deployed research because the people doing the integration actually need a fairly robust understanding of machine learning pipelines like how do you construct the right evals? How do you do RL on with the right data sets? How do you get the representation right? And so it's a form of forward-deployed research more than it is traditional engineering.

SPEAKER_00

Yeah. Actually, I like that. I think that's right. I'm gonna steal that. Sure, you can have it. Um, you've both seen companies fail. What's the leadership failure mode that you've seen the most often that people don't talk about honestly?

SPEAKER_01

Oh boy. Oh, that's an easy one for me. It's culture. Yeah, I think culture is like it's so easy to mess up culture. Once it's messed up, it's so hard to recover because he once humans lose trust that you know the leadership of a company says it's gonna do one thing and it doesn't, you get a few passes, you know, a couple of times from your team. But the more times you you're like, say you're gonna do something, and then for whatever reason you didn't, or there's a deviation from plan, uh, you lose your team loses trust in you, right? And that's why I think the mission alignment is so critical, where at every opportunity where you have a chance to show your team, like, hey, this is the mission that matters, we're gonna make pretty hard trade-offs. We may not have the biggest pay packages, we may not even have the best benefits and so on, but this is the mission that matters, and we've got the best team, and here are here's our unique recipe or point of view on why we can win. And and along the way, you have so many, like, it's so crazy that how many um distracting offers happen along the way. Often you need to do a like take a I I I I think about a company as a road trip, right? Yeah, you you've got your you've got kind of a clear way. You kind of have a destination in mind, and then you have a car, and you're you're in the driver's seat of your founder, and you you get your friends together and in the car and say, We're going there, guys. And sometimes you get lost, you get you got to take some detours and turns. But as long as everyone's like, this is fun because we're all in it together and we're learning and we're improving along the way, you get to your destination, everything is great. The minute the passengers start going, Why is this why why are we taking this detour? Uh, you know, I I you know, my wife Vivian. Um we we we used to go on road trips together in college, and I'd be driving, and I'd take a meandering, like a you know, I'd just like stop following the Google Maps for a bit because we were in Napa or whatever, and it was just scenic. I'd be like, let's go enjoy the countryside. And she's very uh struck, she's a very mission-driven drive driver. She's like, Why are we not taking the most efficient path there? And I had to explain to her that like there's you there's beauty in just enjoying the countryside, and then once she trusted that like I would eventually come back to the highway, she was okay. And she's like, Okay, fine, honestly, is one of his like meandering detours, but we're gonna get back on track. And then we just plan for a slightly longer road trip, and we also don't take detours when it's not worth taking because there's no scenic view. I love views, I love scenic views. Um, so as long as everyone's aligned on what the destination is and what are the terms of the protocol for when to deviate, then I think you get that.

SPEAKER_00

I think I would add to that though, that like there are like real things around like if you're an enterprise company, are you solving is it are you a painkiller or a vitamin? Because I've seen companies that like you know, the founders are really passionate, they know the culture is all aligned, but let's say what they're building is not a must-have for a company, then you will die. And then the the third I'd say is around just dollars. Like, do you spend money as a company as if it was your own money?

SPEAKER_02

Totally.

SPEAKER_00

Yeah, because like I saw this, and I was involved with a company in '99, and I had to lay off a hundred people, 180 people out of 220 in 2001. Yeah, and it was brutal. And it was because there was no fiscal discipline from day zero. It's very hard to add fiscal discipline, like after like three years into a company, it's very difficult.

SPEAKER_01

No, you can't, these are one-way doors.

SPEAKER_00

The trust culture, you don't get to recover those if you make uh the I mean, arguably, the the fiscal discipline is a part of culture. It's like, yeah, does everyone in your company act like it's you know their own dollars that's being spent?

SPEAKER_01

And and and look, I I think there's more and more ways to align that today. Like 10 years ago when startups were just taking off, the tools we had to do to have visibility into where people are spending and so on. That is just minimal, super minimal. I mean, it's actually remarkable given how primitive all the tools were for company building that Google ever got anywhere. Yeah, now there was a lot of bad behavior inside Google and so on as a result because people were spending. And I mean, you remember when we started working on venture together, the spend was out of control at some of the on some of the line items.

SPEAKER_00

I saw this even at Twitter. I mean, yeah, the spending on things that like didn't make sense to me. Uh I think you you know you invest in people, like that's where dollars go.

SPEAKER_01

Yeah, dollars, other things well in people, compute, yeah, compute, of course, spend on compute. Yeah, you know that that what I've learned is. Yeah, spending on people doesn't scale that much because it it hurts culture. Like, but I think what we're going through right now, big big, yeah, big teams are hard.

SPEAKER_00

I mean, actually, I you know, back even to the other question, I had 19,000 people at General Motors. That's a crazy number of people. Yeah, you know, I had 2,000 people at Apple. I mean, that was big, but like, you know, if I look at my team like that ran iMessage, it was like 25 people. I mean, it was tiny.

SPEAKER_01

This is also one of the issues with the layoff era we're in right now, where a lot of the layoffs that are happening now are being blamed on AI. It's it's not no, it was just overhiring during the ZERP era, and a lot of execs were like, I don't want to admit that we messed up. Yeah, but all those of us who knew what was happening knew we were like, you do not need this this VP of whatever does not need 2,000 people headcount.

SPEAKER_00

Yeah, but you just want to be able to see the dollars. I can remember a couple people uh that I was trying to hire at Apple who would get offers from Meta at like 2x, like what we were offering. Now, granted, Apple's cheap, right? But still, like it was such a crazy area. So I would tell those people, you should go. Like if that's life-changing money, like yeah, if you're aligned with the mission, yeah. It's that whole thing like that. Our partner John Dore used to say, you know, mercenary or missionary, like where do you want to sit?

SPEAKER_02

Yeah, right.

SPEAKER_00

Uh this is a long one.

SPEAKER_01

Okay, uh 23. As LLM performance converges and raw weights become a commodity, is the primary economic moat shifting toward harness engineering, the agentic orchestration, tool integration, and state management layers that turn models into high-value product?

SPEAKER_00

I think this is another attempt at that question.

SPEAKER_01

Okay, I this this okay. I'm gonna just pause for a second and rant a little bit about this this moat thing. This moat thing keeps coming up. Because I think venture capitalists have completely like overuse the term moat. There are no moats, there are no technical motes. I think there are advantages, they're head starts, but the moats are culture, they're trust, they're relationships. It's it's enterprises trusting you to deliver real value. Maybe domain expertise. Domain expertise. Like this debate with four years into this, guys. Let's stop debating like what the primary economic moat is. The the there are head starts and competitive advantages, and then the moat comes from you're you're building a great execution culture that keeps innovating over and over again, where your vendors trust you, your partners trust you, your banks are lending to you at a lower cost of capital. When you're when you put a trust, you know. I look to Visa and MasterCard, for example. Visa is such a great example. We've talked about this, right? Visa is one of the world's most successful businesses. What is their moat? It's just trust. They don't own the banks, they don't own, they're not a government entity, but they stand for trust. If a transaction in a new market is running through a Visa, you know that's secure. Right? And if they had a breach, then it would be done. Then they'd be done. And I think models are no different. Like there are some areas where model capabilities get you a head start, but at the end of the day, what the customer is paying for is not a single model. They're paying for over and over and over again the brand trust that you'll keep innovating and you'll stay at the frontier. Using cloud, why? Because for two years now, cloud's been the best at coding. If they had been the best at coding just for one quarter, they'd be in a very different place today. But they've created a way to repeatedly say, you know, we're just gonna keep staying the best. And sometimes it takes us 60, 90 days because our compute supply chain is backed up, or we're raising some money. But by and large, we want to be the best provider of coding intelligence. That's what enterprises want to pay for. Yeah, is reliability. And I would say codex has gotten a lot better, though. Yeah, so that's so I think enterprises will pay for anthropic and like the multi-model thing, right? I think they'll pay for anthropic and open AI and Gemini and whoever's the best. And it's very hard to be the best at coding if you're not one of those three or four companies like Nistral in Europe and so on. But this idea that Motes are somehow static and steady state is like we gotta stop talking. It's it's it's uh we'll eliminate that term. Yeah, yeah.

SPEAKER_00

Do you want to do 24? We got 10 minutes here.

SPEAKER_01

Why would video models have better priors on action prediction than something trained from scratch? Well, video models are trained from scratch. I don't think this is an if yeah and or. All right, where's the best place to be on the stack?

SPEAKER_00

Company, infrastructure provider, LLM, applications. I don't know. Where's the best place? I don't know what that is gonna be.

SPEAKER_01

Where's the best place to be on the stack?

SPEAKER_00

It depends on what you're interested in.

SPEAKER_01

Yeah, what do you want to do in life? What is your mission? Again, remember, it all comes back to mission alignment.

SPEAKER_00

Yeah, where do you want to spend your time? You have one asset on this, I mean this short time on this planet, exactly.

SPEAKER_01

And that's the thing where you say, My goal is to be the I want to push the frontier of energy. Okay, then go build an energy company solving the energy bottleneck for this AI era. If you want to be a if you see yourself as an amazing chip, you want to push the frontier, you want to be known on your tombstone as here lived Mike, the best chip designer, fine, let's start a chip company. Um, that's that's the heuristic.

SPEAKER_00

How do you see the boundary between application code and infrastructure evolving as AI models become a core dependency in most systems? I think it's a pretty interesting question because it is blurring.

SPEAKER_01

How do you see the Rosie? I need some more water. Sorry, I'm talking about how do you see the boundary between application code and infrastructure evolving as AI models become a core dependency in most systems? This boundary does not exist, guys. Like okay, I let me try and and actually be a little bit less more patient. There's the harness, which is a a system that allows people to plug in different models in different parts of the stack. And you could call that I guess an application. I I I just I think thinking reasoning it the reasoning about apps versus infra doesn't make sense to me. Yeah. What does make sense to me is reasoning about domains, task distributions, context feedback loops that is kind of up and down the stack, like model interface, application deployment. Like the eval is ultimately an outcome eval.

SPEAKER_00

And and so it's interesting because I actually read that question a little bit differently.

SPEAKER_02

How did you read it?

SPEAKER_00

Well, I think there's this interesting thing I'm noticing with like Clot or Codex, where these LLMs, like you know, we as a as a developer, and I've been doing this for like 25 years, like you build abstractions and APIs, but like the notion, but it helped like the reason why we did that was making to basically enable code to be more human readable and easier to work with. Yeah, that's kind of gonna go away, I think. Like you don't with an LLM, like you don't need an abstraction. Um basically it's melding, like the the decomposition that we used to have architecturally will probably still exist to some degree, but it's gonna go away.

SPEAKER_01

So I I agree that the most natural way to just get work done is you don't have to have to look at code. You just tell somebody to go do it. So let's say you can just tell your agent to go do it, and then when you want to double check the work, okay, then you go look at the code, you look at the VM. So imagine some kind of interface like Slack, which is just agents, and they're doing the coding, and you're observing, and then you're observing, you've got an observability pain, and you're like, okay, can you show me your work because something's off here? Then you introspect the VM and the code.

SPEAKER_00

It should be no different than if you were leading a team.

SPEAKER_01

Exactly. Yeah. So this infra versus app thing uh isn't goes away. It goes away. It's just which agents are you hiring? Are you how what tools are you using to do introspection? You know, what systems alerting do you have to say, hey, this code was written by this team of agents, you may want to double check it, you know. And and that's that's the solution space I'm really excited about because that's the solution space humans want to interact with, right? The interface layer. Um but this infra versus apps thing, I I I think is an old school way. I think sorry, sorry to insult them.

SPEAKER_00

No, no, not at all. Yeah, I'm crusty, man. No, that's not true. I I've got I've got more gray hair than I realized too. Uh, here we go. Um, how can a solo founder design organic in-product distribution into their MVP so that their first hundred users users naturally recruit the next thousand without requiring active marketing? Well, I mean, this is like the classic network effect. I mean, it depends on the product. I mean, I I don't, you know, but like if you can get it like an inventation-based inventation-based service, right? Then you get a network effect, and you can do that more product-led growth or PLG.

SPEAKER_01

Yeah. Like a good example of this was um like Arena, LM Arena was a you know, side project at Berkeley a couple of years ago, and it was a testing app. It was basically a gradio. Like it was just a you know, research front end. And Anastasius and Wayland and Jan, they were out of Jon Stoika's lab, were looking for some compute credit credits. I was a client, uh, A16Z at the time, and so we gave them some open source grants. And what I observed was this interesting network effect that you know a few of the labs were putting up pre-release checkpoints on arena to try to get evaluations for how good the model was before they released. These were stealth models. And then it would go viral on Twitter that there was a stealth model on Arena. Oh, then it would drive. So then people would sign up for Arena and then they would test the models, and that would give the labs that the eval data they need to decide whether that checkpoint should be released or not. In fact, one of the guys who who gave them that uh the first their first break uh is Liam Fettis, who was the co-creator of Chat GPT, and set up the post-training flywheel at OpenAI. And he's the co-founder of uh Periodic Labs, which is where we're sitting. But that is when this idea of post-training as a loop began. To build that flywheel. Yeah, and that's the network effect, which is once they had uh like somebody had successfully you know tested their model on the arena before release, then the next lab wanted to do the same. And so then you know, Gemini wanted to just compounding uh and so that's an example of how they did some interesting research on AI uh evaluation. They put it out there as an open project, the community started using it, and then this like network effect built and built and built. And today, you know, Arena is being used by enterprises to do domain in-domain safety and evaluation testing for coding and for image and all these you know different custom evals. But um interesting that project started as a again, side project, open open community project, and I think there's tons to be done like that where you can get a community network effect going and get that growth without marketing. Let's see, we got four minutes left. How does AMP manage idle compute? Or is the assumption that all nodes will be 100% utilized? It is absolutely not the assumption that it'll be 100% utilized. That's the goal. But you know, my co-founder Sebastian on the engineering side ran the scheduler, built all the internal scheduling infrastructure at Google, which is the Borg, Xborg, GQM, which he co-designed with Mihai, who joined us as well from Google. And you know, node utilization at Google, I would say which is best in class in the world, is 95 plus percent. I would say node utilization in the independent ecosystem, especially if it's singleton and clusters, is more like 50%. 50, yeah, 60, yeah. So that's a huge amount of wastage. And so um we have a you know, that's what we call the grid. The grid does all that load balancing and scheduling allocation across the AMP ecosystem. And we consider 94% a major outage because um that's the scale at which we're starting to run across our portfolio companies, teams, etc. At Google, I think 96% would be considered a major node outage because of the scale. And so that's that's how we we we have a dynamic uh sort of allocation system called the grid, which pulls capacity across different clouds and vendors and providers and reallocates them in a in a dynamic way.

SPEAKER_00

I think that's it. All right. Office hours number one, done. See you guys next week. See you guys next week, have a great weekend. Cheers.