Frontier Systems

Anastasios Angelopoulos of Arena - Office Hours, Episode 3

Anjney Midha Season 1 Episode 3

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 53:50

In this "Office Hours" episode, Arena co-founder and CEO Dr. Anastasios Angelopoulos discusses how a Berkeley side project became the gold-standard evaluation platform used by every frontier lab. He explains that benchmarks fundamentally cannot capture post-deployment reality, and what happens when a model meets tens of millions of real users doing real work across coding, creative writing, image editing, and increasingly agentic tasks. Angelopoulos argues Arena is structurally ungameable because the question distribution is constantly refreshed by new users, and shares the origin story of NanoBanana - Google's stealth image model whose viral run on Arena marked the first moment the Gemini app inflected globally with consumers. He unpacks the safety-versus-steerability tradeoff in rankings (suggesting LLMs may eventually need movie-style ratings) and why neutrality is not just ethics but the economic foundation of Arena's business. Looking ahead, he predicts long-running agents will become the central unit of work, fundamentally changing what reliability means.

SPEAKER_02

Welcome everybody to CS 153 Office Hours with Mike and me. We're super lucky to have here today one of my um favorite people in the world and who I think is um behind the scenes toiling away to make sure that Frontier AI remains reliable and has been for years, quietly. Um and they also have a bunch of great swag. This is, for those of who can't see, this is um the arena swag. It's the lab notes book that I just got, and it is fantastic. It makes it very nice to write on. So thank you for the free swag, Anastasios. Um Dr. Angelopoulos is the founder, co-founder and CEO of LM Arena, now known as Arena, which originally started as a um side project while he was finishing up his PhD at UC Berkeley with Wayland Chang. The project started um as a collaboration with Professor Jan Stoika, who is a legend, of course, in the system space at Cal, and also one of the co-founders of Databricks, one of the most iconic infrastructure businesses of all time. Um the project was originally called Elem Arena, and then uh after Anastasis and Wayland turned it into a commercial project, they cleaned up the brand and called it just Arena, which I think is fantastic. Um, thanks for joining us.

SPEAKER_00

Thank you so much for having me.

SPEAKER_02

Um Should we jump right in? We're we're gonna or or do you have anything you want to? How about we start with what is Arena?

SPEAKER_00

Arena is an evaluation platform. Um it started as a student project, as you said, at Berkeley. Uh and at the time, we just didn't know how to evaluate LLMs. Uh and it had just ChatGPT had just come out. And so some students at Berkeley had this idea of why don't we put ChatGPT side by side with you know, an open source model that was being built at Berkeley and an open source model that was being built at Stanford, and um see uh which one's better. Just you know, chat with them and see which one's returning a better response. And the platform that we built on that principle grew and grew up until the point that it became the central gold standard evaluation benchmark in AI, still used to this day by all of the major Frontier labs. And what distinguishes it from a standard benchmark is that we have tens of millions of users that are constantly giving fresh voting feedback that's telling us which models are best on their real use cases. So it's impossible to overfit and represents the voices of millions and millions of real users.

SPEAKER_02

And um, you know, early on, I remember when you guys launched the project, you said you're launching Arena to help move the frontier of AI reliability. Could you talk a little bit about what that means? Why, what does arena have to do with AI reliability? It's such a big word.

SPEAKER_00

Well, the truth is that still to this day, nobody knows how to reliably deploy AI. And part of the issue is that we lack measurement. Um and the particular type of measurement that's most difficult to get is post-deployment. So, what happens when I take my AI and I put it in a real situation in the wild in front of real users? What's going to happen? Is it going to, you know, sort of go wacky and do unexpected stuff? Um, is it going to remain factual? Is it going to be good in different types of workloads, whether that's coding, whether that's creative writing? And the space is just so wide of possibilities that it really needs new methodology to help us understand uh what reliability even means. Um, because it's not just does the code compile? Um response needs to be sort of qualitatively and quantitatively better along many axes to be more reliable than another. Uh and so Arena is a platform built with that in mind, really, that the post-deployment capabilities of models is what we should be targeting.

SPEAKER_02

Um, could you talk a little bit about why it made sense to pursue this as an academic project? Like why did this have to happen independently outside of you know one of the frontier labs? What structurally about this problem about reliability sort of necessitated it to be a uh a project that was birthed in a university environment and not sort of a traditional industrial lab environment?

SPEAKER_00

Well, at the very core of it, every game needs a ref. Like players don't self-adjudicate their own game because they have incentives. And so, yes, labs have internal evaluation teams. Some of them have very sophisticated internal evaluation teams, some of them less so. Um, but either way, what you can't trust is that a lab is going to self-report their results correctly. Um, instead, you need sort of a neutral third-party entity to help customers uh make better decisions about which AIs are strong and weak in different areas. Um and so it needs that trusted, neutral, independent third-party voice, which is what we provide.

SPEAKER_02

Uh, Mike, do you have any questions?

SPEAKER_01

Yeah, well, we've got a question on on the from someone that's watching that when we go through that. So the question is how do you account for safety differences affecting arena rankings, particularly in image generation? Could models that comply with unethical requests be unfairly ranked higher or more capable?

SPEAKER_00

It's a great question. And it's actually um it's a regression question because there is fundamentally a trade-off that humans like their questions or their responses to be answered. They don't like refusals. People just do not like refusals. So it certainly affects the way that rankings are calculated. Now, I'll start by saying that on Arena, we we have a baseline level of safety that we expect all providers to comply with. So, for example, no CSAM. And we use, you know, we have a filtering pipeline for this, so on and so forth. So it's a pretty aggressive. Actually, some of our users feel it's too aggressive, to your point. Um but within that, it's also possible to disaggregate models in terms of the basically the steerability versus safety trade-off. Um at the same time, it is a value judgment that the user needs to make as to what level of safety they would like. Some users are there for NSFW purposes, right? They're like using chatbots for that. And you know, let's we might not want to do that, but we shouldn't also yuck their yum. That's sort of the game that they're trying to play. And so, probably the right way of doing this in the long run, we've considered doing this at arena and probably will at some point, would be to think about it more like movie ratings, where you can think about an LM as being like G-rated or PG 13 or rated R or whatever. And then within those classes or subject to those constraints, you can think about what the rankings are. Um, but we we don't have that currently, we just have the basic safety filter.

SPEAKER_02

Cool. We have a whole queue of students, uh, questions piling up. So you find if we take those one at a time? Dude, whatever you want. Okay, so the first is you did undergrad at Stanford EE, PhD, and postdoc at Berkeley. What did each environment teach you that the other one couldn't? And what should students here be deliberately seeking outside of Stanford? No one, no, no one's uh well, I've definitely forgiven you for getting your PhD at Berkeley. And as I have learned, in many ways, Berkeley is much further ahead than Stanford. Um, but you know what? I'll let you answer.

SPEAKER_00

Yeah, sure. No, I uh I have a strong allegiance to both schools. Um so I came to Stanford as an undergrad. I had really no idea what technology or the business of technology was. Um, you know, I grew up mostly in Los Angeles, um, and so I was really in tune with the fine arts, but not so much in tune with technology. Um and so Stanford, and I was also not really in tune with the level of sort of wealth and power that the techno, the technology industry has. And so it was new to me. I came to Stanford and it was like, what the F is going on? Like, who are all these people that are just like the sons and daughters of billionaires? I was like, I how am I meeting these people? Like, I'm sure that you got you had the same experience, Aunj, um, as an as an undergrad. I I I was a bit more clueless than you are, but no, no, I I was clueless, and then I started seeing what was happening, and I was like, this is crazy. And I didn't really want any part of it. You know, I did I was interested in helping people be healthier. I did EE because I wanted to first because I was interested in the connection to music because of like signal processing, and then um to medicine. And then at some point, I realized that I didn't feel like my undergrad education had given me enough um sort of foundational uh understanding. And so I decided to do a PhD, and the PhD I decided to do was in theoretical statistics, theoretical machine learning with Mike Jordan and Jatendra Malik, who I thought could give me a good education. Um, basically, my goal was just to train my mind to be a strong fundamentals thinker for the rest of my life. That's why I did my PhD. Stanford was way more entrepreneurial, it was like in the blood, it's in the water. It's just like swimming everywhere. At Berkeley, because I did a more theoretical PhD, I wasn't seeing the entrepreneurial side of things as much until I started working on Arena with Jan and in Skylab and you know, Skylab formerly Rise Lab formerly, AMP Lab. And that ecosystem was just like, whoa. So I mean, it's so industrial, right? Um, so my experience there was quite extreme. Um, now my wife is a professor at Stanford. You know, my my sort of plan A was originally to be um a professor at Stanford, and I'm still sort of deferring a slot there. But uh it's uh it's an amazing place too.

SPEAKER_02

Well, unlike you, you actually made it through your PhD. I dropped out, and so I remember when I was at your wedding earlier last year. Um, it was cool to see how proud your parents were of the fact that you were Dr. Anastasios. And I was like, oh, this is what would happen if I had actually had the the brains to make it through my PhD like like Anastasios.

SPEAKER_00

I'm sorry you missed out on the parental. Yeah. I'm sure they're proud of you, buddy. It was Mike's fault.

SPEAKER_02

It was all Mike's fault.

SPEAKER_01

Yeah, it's my fault.

SPEAKER_00

But uh how did you how did you get that to happen?

SPEAKER_01

Well, I I tried to I convinced them to uh that Kleiner was a better use of his time than finishing his degree, which his parents probably always hold against me. But Mike was also a PhD dropout.

SPEAKER_00

Oh nice, congratulations.

SPEAKER_02

Mike dropped out of his band from XPHD too. So of the three of us, only one of us had the endurance to make it through.

SPEAKER_00

Some people, some people think that the the dropping out is actually more of an accolade than the award, you know, than the actual degree.

SPEAKER_02

Well, to be clear, I didn't even make it to the PhD. I barely made it to Qualz because I was in the master's the co-term program at Stanford, and Mike sniped me right before I could finish that first year and make it to the PhD. So that it this is there's a progressive funnel problem here. Um, but you you lasted the longest. Okay, we'll go to the next one. Um, question two, you're hiring ML scientists and engineers aggressively. What does a strong arena hire look like at the new grad level versus what you'd want from someone with five years of industry experience?

SPEAKER_00

You know, that's a great question. Um, for new grads, uh you know, the the definitely the qualities are different. I think we'd be looking for people who are super hungry, super low ego, willing to do anything, um, but also people who have demonstrated a serious ability um to uh like produce something important in some way, like whether that's writing a great paper or whether that's creating a great side project that's you know uh relevant and modern, um, those are the sorts of things that I would I would look for. And you have to have strong, strong, strong fundamental skills to be an ML researcher.

SPEAKER_02

Um, Mike, do you see the cue? Feel free to ask the next one.

SPEAKER_01

Yeah. In five years, do you expect Frontier Labs to still be releasing one general purpose model that tops a single leaderboard, or has the field fragmented into specialist models that each dominate their own occupational arena?

SPEAKER_00

It's a great question. Um, we see examples of both. Uh the industry itself definitely is fragmenting in the sense that multimodal versus also different, you know, niches, model providers are basically carving out different niches for themselves. Like OpenAI, of course, has done super well on consumer. Google is doing well in the prosumer market with the sort of docs and so on and so forth that Gemini has access to. Anthropic is doing really well on the coding market. Uh, and then there's all these multimodal providers like BFL and like Luma and like you know, many of the Chinese labs like ByteDance uh and so on. These labs are uh carving out niches for themselves and video and with image generation and image editing and you know, video editing and so on. Um, so we see fragmentation at that layer at the same time in the big like mega labs that are training massive models, there's definitely a tendency for the ultimate model that they train to be an omni model. That is happening at um many of the top, many of the top labs.

SPEAKER_02

How far are we from the first omni model from Google coming out publicly?

SPEAKER_00

Oh, I mean, who says it hasn't already?

SPEAKER_02

Oh, that's fair. Actually, Nano Banana Pro 2, nano banana was an omni model, right? Essentially, it was like a jointly trained image and uh text model inspired by Flux Context, which was the first uh real in-context editing model, but uh it wasn't branded omni, I guess. Actually, you know, I think it'd be fun story for you to guide for you to talk about the history of nanobanana on and on arena, because I don't think most people know that.

SPEAKER_00

So, yeah, nanobanana, you might not be familiar with where the name comes from. So, why is this thing named nanobanana? And so the reason is that on arena, we have a process for labs to release models in stealth, uh, which means that model providers will come to us before actually releasing their model and they will put it on arena under a code name to let the community use it and to get evaluations, which we send back to them about how people are using the model and how good it is with respect to their peers, so on and so forth. And so the code name, you know, sometime last year, you know, late last year, Google approached us with this image editing model under a code name. And the code name was Nano Banana. Okay, how they chose that name, it was because one of their PMs has sort of a nickname that's related to Nano Banana, and so it was just a funny random choice on their side.

SPEAKER_02

Yeah, her name is Nana, and she loves bananas. And we did, I think, a nano banana party here at Liberty Outside my uh where you know she brought this amazing banana cake. Do you remember that? Were you there for that? I do remember that out there, yeah. So shout out to her for lots of banana cards.

SPEAKER_00

So they so they named it nano banana, and then it went completely ape shit viral on arena. Like we're talking like high tens of millions of like of mouth that are just like here on arena using nano banana, and it just became such a like global sensation um that uh that they just branded everything nano banana from there on out. And so now Google has all this banana brand because of the code name that they put on Arena.

SPEAKER_02

Her her pudding was excellent. The net the banana pudding like was amazing. So again, shout out to Nana, but more importantly, that was the first time where uh I think ChatGP uh the uh the Gemini app like started the started to inflect globally with consumers, that's um right with and started to sort of pick up and catch up to ChatGPT. What's your lesson from that experience? Like what was the you know, typically people think of arena as just a pure sort of model testing platform or evaluation platform, but that was the first thing where I realized it was actually a way to sort of for for the community to test the reliability of product value as well. Does that does that make sense?

SPEAKER_00

Yeah, I'd say my main takeaway. So Arena is actually a much larger consumer platform than people realize. It's got you know mid tens of millions of monthly visitors, it's bigger than like meta AI and bigger than Manus and bigger than Gen Spark and bigger than like all these platforms. Uh it's bigger than Hugging Face. It's it's actually quite large. Um, and so when you release something on Arena, it really does give a sense of how it will perform on real consumers, um, on real prosumers, because most of the platform is like on on coders and you know, it has a lot of lawyers and it's got a lot of scientists on it. Um and I'd say that the my takeaway from nanobanana, and this has been it's not just nano banana, but it's been replicated many times, is that um there's almost frictionless switching in this space, and people will are really hungry for the best capabilities. And so market share is a really temporary thing. It's just people will flock to the smartest models, and that's why the capex is so high in this area. That's why people want to, it's like so important because if people didn't care, why spend a hundred billion dollars training the next version of whatever and the and but the reality is that people actually do care. People are at this point in history, people are still switching frictionlessly between interfaces to find the most intelligent model, the model that's best for their needs. And that's where a lot of the opportunity, I think, entrepreneurially lies. Um, is that if you can compete in this um really uh yeah, this like really exuberant market, you can just win users.

SPEAKER_02

I I won't go into sort of the endless rabbit hole of binding on how long we think this will last, but it's been four years and it hasn't it hasn't stopped.

SPEAKER_00

So we should we should uh it's only accelerating because the because the product surfaces that it's touching is only accelerating.

SPEAKER_02

Yes, that's right. The diffusion is still super early, from what what I can tell.

SPEAKER_00

Yeah, I mean, you know, from Anthropic, Anthropic has gained like is basically tripled its revenue in the past two quarters.

SPEAKER_02

Yep. We're gonna all right next question. Go ahead, Mike. Yeah.

SPEAKER_01

Um are there any risks to the ranking system of Arena where user preference may not fully align with model capability?

SPEAKER_00

I would say arena provides a particular view on model capability, which is what happens when you take your model and you expose it to a large amount of heterogeneous real users that they're doing their real tasks with it. And then what's their what's their basically stated preference with respect to model performance? Um, it's not the only way of viewing model performance, and there's many other possible signals to extract, some of which we're working on now. So, for example, we have a factuality grader that we're working on that we'll eventually release. Um, we have other implicit signals that we look at. We also do style control to try to um you know adjust for the effect of response length and for markdown and style uh on our leaderboard, because we know that sort of humans tend to like longer responses. So we adjust for that and make sure that that doesn't affect the ranking. Um, but I but there's there's many possibilities for for signals and and arena, um, although expanding is only providing one.

SPEAKER_02

There's a question here, which is you're hiring ML scientists and engineers aggressively. What does a strong arena hire look like at the new grad level versus what you'd want from someone with five years of industry experience? Five years?

SPEAKER_00

That was asked earlier, though. Yeah.

SPEAKER_02

Oh, did we already go through that? Sorry, I was reading the other one. Okay, then we've got occupational arenas like medicine, legal, and finance require domain experts whose time is expensive. Is the arena methodology actually portable to those settings, or does expert evaluation need a fundamentally different approach?

SPEAKER_00

Oh, it's I think it's definitely portable. Yeah, we see a lot of experts on arena right now. Um, and they are asking very sophisticated questions. Um and yeah, I think that I think that we see high quality signal, and labs are telling us that we see that they see high quality signal from those those evaluations. Um we can always do better. I think the the the fundamental thesis of arena is that if you want to understand the real world capabilities and reliability of AI, um, you need to meet users where they are in their workflow. And so, you know, the the biggest criticism I would say of Arena is that the pairwise battle interface may not always be um relevant to somebody's workflow. And so we're working on going beyond that. You know, now we have agent mode that's been released to probably 10% of our users now. That's a single-threaded evaluation. And we haven't released the leaderboard yet, but we will soon. That is a great example of a single-threaded, um, single-threaded form of feedback collection that we're, you know, that we can do a very large scale. Um, and it also combines explicit and implicit preferences. So maybe that's a slightly better interface for experts, but we already have good signal on. The users are already on the platform, and the goal would be to solicit, you know, basically to like uh give them a useful product so that through their organic usage, we can mine whatever feedback is necessary. Yeah. Cool.

SPEAKER_02

You were about to start a postdoc with Jan Stoika, focused on high-stakes AI evaluation. What was the specific moment you decided the academic path couldn't deliver what you wanted? And do you think the trade-off is generalizable or specific to this problem?

SPEAKER_00

Oh, it's so funny. Yeah, so I actually briefly did a postdoc with Jan for like two months. Um, it was kind of always supposed to be a bridge into whatever was the future of Arena. Um, and so it wasn't really a decision of whether or not to switch out of that into this. However, you know, it is it is a really good question about like, you know, what are the benefits and trade-offs of the academic environment versus the industry environment? And I I have to think about that a lot, especially given my my position. Um, and from my perspective, what I have come to realize is that the academic world is particularly good at a couple of different things. It's good at um it's really good at medicine because they're because like academic hospital systems are tend to be connected to uh to academic research environments. Um and it's really good at like theory in general. That's like doesn't require um having like a plan to monetize something three to five years out. Um but industry is better at some things too, and industry is better at doing science that has immediate relevance and that then can be productionized at scale to affect the lives of millions and millions of people. And I definitely knew that I wanted that as part of my life. I didn't want my life story to be, you know, Ivory Tower. I wanted to be able to like, you know, be in the arena, so to speak. Um and so that's what I did.

SPEAKER_01

We've got another question here. Um, lad started optimizing for arena, but how can you keep it from becoming just another gameable benchmark that just measures arena's voter distribution instead of the actual intelligence of the model?

SPEAKER_00

So listen, every benchmark or you know, every leaderboard is ultimately measuring one thing. Um, but what arena is measuring is the real preferences of tens of millions of users that are constantly coming in every day. And what distinguishes it from a standard benchmark is that it's not gameable in the same way. A standard static benchmark is a set of questions. Once you see the set of questions once, the second time it's meaningless because of overfitting, right? You can look at those questions over and over and over again, and every time the model sees it, the better it gets at those questions, or maybe it can just look up the answers and memorize them, and then the test performance is not reflective of real world performance. Arena does not have that problem, it just formally does not, like provably does not. And that's because the questions that we get are all unique, you know, we dedo better, so it's impossible to memorize. In order to do well on arena, you need to like new users need to come through the door and vote for you. That means you're always held accountable to the user's actual results. It's not like some you know concocted, you know, stilted benchmark that's just uh you know a set of 15 coding questions. Um, and that's what gives it its power. Um, now I think the real question is not so much is it gameable, but rather, is it the right metric? And that is something reasonable people can disagree with because it's formally not overfitable, but you might say that hey, it's a specific set of users that are doing a specific set of tasks. And what if my task distribution is different? Okay, well, that that is the problem that you have to solve with targeting and with scale. So you can try to capture every use case and speak to every single one of them, and then you can subset the data or you can train models on the data to try to understand like the heterogeneity and performance in different categories or in different, you know, user user bases and so on. Um, that's the way that I see the problem.

SPEAKER_02

So, can we double-click on that for a sec? So we had Jensen in the class yesterday, and we were talking about the design of the next Feynman system. You know, that's their uh chip, not the current, you know, we're in black holes right now, then comes VRs, then VR Ultra, and then we Feynman. And so we start talking about what is the right measure, right, for intelligence. Because the the problem or the homework assignment that the NVIDIA teams have to figure out is you know, as intelligence becomes more and more heterogeneous, where you have you know folks training image models and you know unified Omni models at at DeepMind to do image reasoning and visual reasoning, or Black Force labs where they're working on real-time you know video and action prediction models, because that you know, Andy Blackman came by the class a few weeks ago and talked about that, or Luma, which is make working on unified agents and periodic labs, which we'll have in the class soon, with training you know, superconductivity discovery models. I mean, these are very different kinds of intelligence, right? And and for optics working on coding and so on. And as a general purpose platform, uh NVIDIA has to figure out how do you measure the right output in the right way that's aligned with the overall success of the ecosystem, but that's hard to do because the chipset is general, right? And so I asked him, Well, how do you how do you solve that? And he's like, Well, it's an art, not a science. How would you if you were Jensen, right? Given what you know about, you get to see like literally the world's best Frontier AI teams testing in real time their next generation of models. And so you have a bit of a crystal ball, right? And then you if you were co-designing the next generation of Nvidia chip, how would you design it based on what you know now?

SPEAKER_00

Well, you know, I'm not a chip designer, but I can tell you how it approached the feedback issue. And I think the answer is very simple, which is that I would try to get as much in the wild in-product feedback as possible and optimize the shit out of that, period.

SPEAKER_02

So but but let's say the eval is concretely like intelligence per like you know, tokens are the measure, tokens per watt essentially is where we're we're in the limit, most most teams are realizing that's where we're converging to across the space. But not all tokens are born equal, right? And so, how would you optimize coding tokens versus image tokens versus super intelligent?

SPEAKER_00

Like, you know, I would look, it's all about the business outcome. So I would look at it and say, what is the business outcome that NVIDIA's customers or NVIDIA's customers' customers want to optimize for, right? In the sense that they they might be selling to B2B2B businesses. So I would be, for example, if let's say I'm making a model that I want to help um a marketplace uh product sell more, you know, sell more units of you know, whatever the marketplace is on. Let's say I'm trying to help Amazon make more sales with my LLM. The unit uh feedback there is dollars. Like but but but if you have you should be able to optimize for dollars. Ah, I see. When you and when you can do that, when you can close that loop, it becomes very, very powerful.

SPEAKER_02

Everything else is an intermediate piece of but I mean, but then in the limit, doesn't that mean basically everyone should just have their own personal arena for their own business where they can define what the outcome is that matters to them? You need to be able to do that.

SPEAKER_00

Yes, that's absolutely right.

SPEAKER_02

Is that where the world is going? Like an arena for every business?

SPEAKER_00

Uh like they take you want an uh you want an evaluation system that can generalize to multiple businesses and that can collect feedback from many enterprises and businesses and help them make great decisions across one another, so on and so forth. But yeah, I mean that's the kind of feedback that you need.

SPEAKER_02

Because what about reliability for for individuals? Like, should I expect to have my own arena for myself over time?

SPEAKER_00

You should expect to that the products that you use will learn about you and personalize to you by understanding your conversions across their product and perhaps multiple products.

SPEAKER_02

And who do you think is the furthest ahead today as a reference model for the rest of the space on how to do this in a mission-aligned and sort of user-aligned way?

SPEAKER_00

Probably enthropic with Claude Code. Because I think that the Claud Code system has many subtle ways of providing implicit feedback. For example, plan acceptance is a piece of implicit feedback, or you know, just the the in in the sort of textual feedback that you give to your agent as it's continuing to code with you. That's also feedback. And I think those are extremely, extremely valuable pieces of information.

SPEAKER_02

Got it. Cool. Sorry, I do have a conversation. Go ahead, Mike. We can take the next question.

SPEAKER_01

Another question here. Uh, does Arena smooth over emergent capability jumps, or do you have a way to track those?

SPEAKER_00

Um we definitely track emergent capabilities because I mean nano banana is a good example, right? Like when those capabilities come out, people find them, people use them, and then you know, it it results often in uh increased uh platform utilization. And you know, people generally vote for the model more because new capabilities emerge and people get really excited about them. So it when those things come out, it's definitely not a smooth uh curve. It like it we see big jumps. Same thing happened with GBT image too.

SPEAKER_01

We've got another one on a personal note. What daily habits or principles have had the biggest impact on your success?

SPEAKER_00

What daily habits or principles? I think that the first of all, I you know, just to question the premise, I'm not sure that things have been successful yet, right? Job's not done. Um, but to me, one thing that's been very helpful is that I see the CEO role as being a very service-oriented role. You have to prostrate yourself to the to the company. You have to be, you have to bow to the business, you have to be an instrument of it. And if you're not doing that, you're not doing your job. And it's actually very freeing once you see it that way. Because you're never worried about, hey, what should I do? What's the right thing for me to do when I go into this meeting or when I do this negotiation, blah, blah, blah. You just have to think clinically about what's the right thing for the company. And somehow that makes it less personal. That means that means that you have to put aside your ego and it becomes a completely selfless activity. It's actually easier to reason that way. Uh, and I found that to be quite helpful.

SPEAKER_02

But that but that can end up being somewhat zero sum, right? Because if you're totally mission-aligned just on yourself and your business, then you kind of put the blinders on and don't really optimize for the ecosystem. But one thing Arena has to figure out how to do is is build this sort of ecosystem layer, right? Where different labs feel like it's an independent, trusted place. How do you navigate that tension?

SPEAKER_00

Well, I think you know, ultimately business has dollars and cents, right? It's got a PL and you have to like make money. So I think a lot of the the a lot of the question around the ethics and the you know ecosystem and so on and so forth has to come from what is the business that you want to start? Like you have to design a business that you're proud to run. I'm not running a gambling business for a reason, right? Like, because I'm not that guy. I wanted to run a business that helps the world and that can provide a neutral voice as to help people make better decisions about how to reliably deploy AI in different settings so that we can take this technology and empower people to use it safely and effectively. Um and everything um basically comes from that. Everything has to that that's like the fountain from which everything is uh from which all the plants are watered. Um so you know, from that, once you take that as the mission of the company, um, neutrality is is just a consequence. Like you cannot, it's like actually the the economic incentive of the business to be neutral. Because if we're not neutral, then nobody will believe us, and then our business has no value, and then we shouldn't have started this business in the first place.

SPEAKER_02

Yep. And I guess to extend that further, you know, Irina is better off if there's a healthy independent ecosystem of many thriving different intelligence factories, so to speak.

SPEAKER_00

It's incentive. Many flowers should bloom. Right. If many flowers don't bloom, then our business loses value, and then we shouldn't have started it.

SPEAKER_02

But you know, it's you're on that topic, the final project for this class, and we've got 500 students you know at Stanford on campus, and then a few thousand following along online. And I know you and I've chatted about this before, but the final project is the one-person frontier lab, right? And so this is your chance to have many more flowers bloom. Like, what what is your advice to folks as they start thinking about their their projects? This this week is midterms week, as you know, as a former Stanford undergrad. And so next week is when people will start really attacking their final projects. Like, what would be your advice to them on the shape of the problems they should attack?

SPEAKER_00

My biggest piece of advice would be do not solve a problem that is relevant today. You have to skate where the puck is going. And if you don't skate where the puck is going, then then you will just become irrelevant. So I would start by making predictions about where the world is going to go. And I literally, I did this when I think about product design for arena. I think about where is the world going, what are my predictions? And I will write them down and then I will say, okay, where am I? What should I be doing to take advantage of these predictions and read a lot? You know, I read market reports, I read the you know, A16Z things that A16 used to put out and still puts out. I read things from Gartner, I read, I mean, like I read all this, like my Twitter.

SPEAKER_02

I see you, you, you, you follow my Twitter religiously, right?

SPEAKER_00

I do follow your Twitter. Uh anything Anj says, I take is gospel. Um and so I'm thinking, I'm trying to think about where the puck is going in order to build the next generation of my products.

SPEAKER_02

And what are your top three predictions right now for the next six to nine months?

SPEAKER_00

I have many. I would say the most important one is that uh, and this probably won't be controversial here, but that long-running agents are going to be the central unit of work. Um and that's quite important actually, because humans are going to be disintermediate for much of the process. Uh, and I think that affects the shape of products.

SPEAKER_02

How so? What do you mean by that?

SPEAKER_00

Think about something like Arena, right? Like Arenas uh by by virtue of the way that it's designed, it's it's a um it's an interactive platform. But what happens when tasks are not interactive? What happens when they are long-running agentic tasks that are sort of asynchronous background tasks or that you know involve some aspect of that? What's how do we do evaluation?

SPEAKER_02

Uh I see, which is why reliability becomes a bottleneck. Because if you're gonna have your agent go run a job, a workload for 30 days, you don't want it reward hacking and doing something you didn't want it to. You've got to trust that it's it's reliable on whatever axis you care about.

SPEAKER_00

No question. I mean, their their reliability is even more important because also the spend there is high. Right. You know, you might spend five dollars or more sending your agent off to do something, and you might not be interacting with it, you might not be watching it. So, what's the right paradigm for that? And what's the right evaluation for that? I think is unknown, never been built. Um, and we need a build.

SPEAKER_02

Did you say five dollars? No, no, we're in the five million dollars right now. That's definitely happening.

SPEAKER_00

You're you're really you're up by like, I don't know how many orders of magnitude, but I mean we're well I was saying per background task, but maybe you're talking about expensive background tasks that yeah, per long horizon session, yeah, right.

SPEAKER_02

Exactly multi-session resolution. Uh go ahead, Mike. I think you've got the next one.

SPEAKER_01

Another question. Um, what does intelligence mean to you? Is it a specific set of metrics or something more meta?

SPEAKER_00

Yeah, I don't I don't have a strong definition of intelligence. These things are all these things are a matter of um a matter of definition. Um and I think I don't think much about AGI or what's the what is the best definition of intelligence, I think more about business outcomes. Like how are you going to like make some process more efficient um or you know help the world by taking some number of interests, you know, years, quality years of life, and make it go up. Those are the sorts of things that I want AI to be doing. I don't care much if it becomes intelligent, so to speak.

SPEAKER_01

Makes sense.

SPEAKER_02

Um when when a when labs release a new model, what's the conversation with them actually like? What do they want to learn from arena that they can't learn internally?

SPEAKER_00

Well, I think the ultimate goal of the evaluations that arena brings to labs is to give them actionable insights on where they can improve their models. So many labs are in the dark and very curious to know what types of queries they fail at, especially with respect to their competition, you know, or what types of tools they're airing out on. Um, and let me tell you, yeah, the capabilities have improved, but the failures are also really obvious. Like you can see them in the data when you expose it to the real world. On the benchmark, it's gonna look good because the benchmark is overfit. When you put it in the hands of users, what happens is you ask, you know, model X, which is a famous from a famous company with best researchers in the world working on it, to make a website for you, and then it'll just output like blank, just like empty JSON. And it's like or eight empty HTML file. It's like what is going on here? And that happens, that happens to this day, and so they want to see those things.

SPEAKER_02

Yep. I mean, and then doesn't that imply essentially at some point you just need reliability testing like arena or something like arena to be used in CI C D all the time? It's just like CI C D for reliability. Exactly.

SPEAKER_00

Yeah, that's that's necessary because how far are you from that purpose area of errors is so large that you simply cannot do it without a real in the loop, you know, CI CD pipeline.

SPEAKER_02

Yeah, like a real way as a continuous service is sort of the shape in my head of that. But I know we've talked about that.

SPEAKER_00

All this stuff manually, and now now yeah, now we've got a whole team building that out.

SPEAKER_02

Um it so I'm going for the backlog. We skipped a few questions here. Um, how do you think about evaluating agents versus evaluating models? Does anything about the arena framework need to change when the response is a sequence of actions instead of text? Did we already cover that one?

SPEAKER_00

No, we did. We didn't, but that was a great question. We did, right? Yeah. No, we did not. We did not. You did not.

SPEAKER_02

Oh, please go ahead. Yeah, please go ahead.

SPEAKER_00

Yeah, I do think that the arrangement needs to change. I think the statistical model needs to change because the um because the feedback structure is different. Because like with a with a with the voting mechanism that we have today, like we're able to use a Bradley Terry model because you have two responses, and then you know, there's no there's not like a multi tool. Or multi-model chain, and then you vote. But with agents, it will be like it'll be a chain of responses, possibly including multiple models or subagents. And then at the end of that, you're gonna have some sort of a human interaction like a text bubble or or a thumbs up, thumbs down, or whatever, what have you, whatever kind of feedback you collect. And then there's an attribution problem of which stage in the system caused this feedback to happen. Um, and so it will require remaking, yeah, remaking the methodology, which I've been doing.

SPEAKER_03

Yep.

SPEAKER_00

I mean, like personally, I see you know developing that science because it's I mean, this is essentially what your PhD was in, right?

SPEAKER_02

Was it developing? Yeah, yeah, go ahead.

SPEAKER_01

No, I said um what's the problem if you weren't doing arena.

SPEAKER_00

What would I be doing if I wasn't doing arena? Yeah, what's a problem you would work on? What's a problem I would work on?

SPEAKER_02

If you weren't doing arena, it's a great question.

SPEAKER_00

My mind is just so focused on arena that I don't think about that that much. But um, you know, I think that the uh the AI science world is just so rife for disruption. What do you mean? Well, I think that the next stage of research will involve um autonomous scientists that are able to make discoveries. I mean, periodic is doing this just you know in a particular area, Anj. And and I think that's really going to change the entire scientific method and make it more of a system design problem than it is now, kind of like you know, we train scientists in the scientific method because scientists are in charge of executing the science. But what happens when the science is actually executed by another system? Like what happens when the actual method is executed by AI? Like that whole space of things, I think, is very interesting. Everything from the hypothesis generation problem, which I think is a big unsolved problem, um, to the actual mechanics of how do you get science to be effectively automated.

SPEAKER_02

Well, we're the we're gonna take the kids on a virtual field trip to periodic labs, and so they'll get to see in real-time action what you're describing, right, Rosie? For doing that? Yeah, it's it's in progress.

SPEAKER_00

I think that's a big that's a big, big deal. And I think you know, periodic's an amazing company doing this at the frontier of like material science and stuff like this. And I think that there's many other companies to be built and academic research to be done um to support this civilization scale rethinking of science.

SPEAKER_02

Yeah, I couldn't agree more. Um, here's the question here that says, What's something you've learned about human preferences that surprised you as a statistician?

SPEAKER_00

I think the the richness of human preference data was surprising to me. You can learn really, really sophisticated models based on human preference that can do uh very granular analysis at the scale of like individual users or individual prompts, um if you have large enough quantities of it. So human preference at scale is a really different beast because it's you would think about it as being so noisy that it would be hard to learn anything because humans never agree on anything. Like, you know, the most even if you look at like ballot issues inside in the US in like US politics, it's like the highest agreement between humans is like 70%, right? It's like doesn't get better, right? Um and the same is true in like arena, also that like users rarely agree because they all have their own preferences. But when you look at enough of those at scale and you look at the topic modeling and you look at the user base and blah blah blah, the amount of that that's actually predictable and that can result in really useful and interpretable models of um preference and behavior is actually uh it's stunning to see. It's a really cool problem.

SPEAKER_02

And and for students in the class who kind of want to go deeper on that problem, what what could they do? Could they collab with arena? Is there open source data sets?

SPEAKER_00

You can, you know, you should apply for a job at arena. We'll we'll love to look at your application.

SPEAKER_02

All right, there's the plug. Um next question is uh if arena was being over. Did we talk about this? If arena was being overfit, how would you statistically detect it before it's too late? We didn't talk about that.

SPEAKER_00

So it's a it's a really good question. Um again, I think we need to look at the premise of the question because when you're constantly getting fresh data, it's not really possible to overfit in the standard sense. And what I mean by that is if the questions are new, that if they're like new IID samples from a distribution, which they are, and the distribution is actually changing, then um in order to like get a high score on the arena, you just need to like do well on new questions. It's like you're constantly getting new samples. So you might have trained on the previous samples, you might have seen the previous samples, but you still need to do well on new ones. Um so it's not like it's it's possible to, you know, it's basically impossible to overfit in the standard sense. And to the extent that like there's questions of multiplicity, so on and so forth, the um the right thing to do there is to kind of monitor, um monitor like how many submissions there are at all times, and also make sure to collect enough data so that the variance of the confidence intervals, like the size of the confidence intervals, decreases enough that we can certify results at the level that we want. Um and and we're doing all of that. So we we try to like dot our i's and cross our t's.

SPEAKER_02

Yep. Um I think we've actually covered most of the questions.

SPEAKER_01

We missed one, which was what's the hardest part of your job that you didn't?

SPEAKER_00

Hardest part of job. I mean, creating a great company is not easy basically in any aspect. And uh it's required a lot of personal growth for me. I'm a very different person now than I was when I was a PhD student. You need to like have a certain level of like an iron stomach that um nothing can ever faze you. But to be honest, that's not even really the hard part. That's table stakes. Because if you don't have that, you will just fail. You cannot be emotional, you cannot have ego, you have to just be like, you know, you have to be zen with very zen.

SPEAKER_02

Yes.

SPEAKER_00

I mean, any entrepreneur will tell you this, but I think the the thing that's actually hard beyond just like, hey, you know, I have a new job now, is um is that the space is moving so quickly and the dynamics are so aggressive, and and that the divide between the haves and the have nots, and what I mean by that is the hyper-growth businesses and the businesses that are just doing okay. Those divides are so big and those lines are so bright that unless you're aggressive about skating where the puck is going, as we talked about earlier, um, every day you're losing ground. And so having to continually rebuild the business, having to continually restructure things, having to be really aggressive about hiring and firing, having to be, you know, that that level of like having the foresight and then pushing yourself to say, no, it's not good enough. Like X in revenue is not good enough. I need 10x because I see what where will where we'll be going if we're not aiming for that and it's gonna change our strategy. Um, those types of questions are actually hard because they they require you to have like to have Zen about your subcost, um, and also to be really smart about predicting the future. And it's not to say I'm good at it, it's to say everything else I can handle. That is actually the hard part that you know we might fail at.

SPEAKER_01

Thank you so much for your time today.

SPEAKER_00

Of course, it's my pleasure. Thanks for having me. Thanks for all the students for listening. Yeah.

SPEAKER_02

Okay, thanks so much, Anastasia. See you all. Have a great weekend, guys. Cheers.