Benchtalks

Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) - Building the benchmark factory

Snorkel AI

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 22:38

In this inaugural episode of Benchtalks, Snorkel AI co-founder Vincent Chen sits down with Alex Shaw, MTS at Laude Institute and co-creator of Terminal-Bench, to unpack what the rapid hill-climbing on TB2 reveals about the state of AI agent evaluation — and where the field needs to go.

This interview covers: 

  • Why TB2 went from 20–30% during development to 75–80% at the frontier today
  • The bet on the terminal as the right abstraction for general computer use
  • How Harbor became a benchmark factory — and why that matters for RL post-training
  • The "benchmaxxing" problem and how the community is keeping TB2 honest
  • What Terminal-Bench 3 needs from expert contributors to shape model development for the next year

Full interview/transcript: https://snorkel.ai/blog/benchtalks-alex-shaw-terminal-bench-harbor-building-the-benchmark-factory/


SPEAKER_00

Welcome to the inaugural episode of Bench Talks, Snorkel AI's podcast on benchmarks, evaluation, data quality, and real-world impact. Today, Vincent Sun Chen sits down with Alex Shaw, co-creator of Terminal Bench and Harbor, and a founding member of technical staff at Laud Institute.

SPEAKER_01

Once you have tasks, in theory, like that becomes a verifier for your agent to start creating the automation itself. Right. So that gets deployed into production, you get production traces, you take those traces, you turn them into tasks, and like hopefully it's like this is like the flywheel or the loop everyone talks about. But I do think that um that is probably the North Star we're fielding towards.

SPEAKER_02

Thank you for having me. Yeah, we're super excited to have you here. So I guess first of all, four months ago, we talked a bit about TV2 Harbor. That's launched. It's had a ton of momentum since then. What has been most surprising in the last four months?

SPEAKER_01

It's a good question. I think, okay, two things have been pretty surprising. So first is how quickly the model developers have hill climbed General Lynch 2. So I think when we were making it, while it was in development, 20%, 30% is the scores that we were seeing. And then by the time we actually release it, like the better models were scoring, I think, up to 40 or 45%. And now the most recent submissions from OpenAI, I think were like 75%. And then if you like really build like a custom harness around it, people are have even gotten up to 80%. So I guess in school, that still is a B minus. So it's fair enough. And we do, we've actually seen every single task in Terminal Bench 2 has been solved at least one time by one agent model combination over the course of the benchmark. So in theory, that means 100% is possible. So there's still 20% room to go. But yeah, we're like actively working on Terminal Bench 3 because of this very reason. Like these benchmarks are very hill climbable these days, which is good for all the users of the agents, right? Like they get better agents because of it. But yeah, that was surprising. And then the other surprising thing is how quickly Harbor was adopted. We put it out there as like an experimental thing. People seem to want this. We'll put it out there and see if people pick it up. But yeah, people seem to like it. That's awesome.

SPEAKER_02

On the first point, what do you think that says about either the benchmark or the models that, you know, they're hill climbing so quickly? Any lessons or hot takes on that phenomenon?

SPEAKER_01

It's a little bit confusing to be honest. So the Terminal Bench 2 tasks are, I think, very high quality. Yeah. Terminal bench three will be higher quality even. So I'm excited about that. But I think it's a relatively difficult benchmark to reward hack. So I do think the models are actually getting better at those specific types of tasks. But when I use like Claude Code myself, I use Claude Code all the time when I'm coding. And I see a lot of shortcomings still. Do you know what I mean? 80% on terminal bench, you'd think it would just be like solving every problem instantaneously or something. But it introduces a ton of complexity. It still I think has a relatively hard time building off of its work and building like robust and secure code. Yeah. And it has me thinking a lot like, how do we even code more and more of these shortcomings into tasks that exist in either Terminal Bench 3 or just any other benchmark people are building? I think maybe the lesson learned is coding is an extremely broad domain. 89 tasks aren't going to cover all of it. So we need a thousand times more benchmarks that we have right now.

SPEAKER_02

Yeah. And that's a big part of our philosophy in launching open benchmarks and kind of working on benchmarks as a core initiative for us too. Is it's not just about evaluating these models, looking backward, and snapshotting progress. It's also just shaping and defining, hey, where are these models going? What are the real ways that they're used in practice? And actually articulating that well is actually really hard. And captureing, hey, what am I doing in my day-to-day that is not in these benchmarks ends up being quite important, in my view, for closing that gap. So very much you read there. At a high level, what do you think made Terminal Bench 2 stick and acquire the big splash that it did on all the kind of recent model cars with Claude and Codex? Even as of today, I think that there's a new one out there. What do you think made it sick?

SPEAKER_01

I guess I should preface it by it does seem like we got a little bit lucky. And but by luck, what I mean is I think there you need a few different things to have a candidate for a benchmark that becomes a standard in the industry. One is I think two that you have the most control over is it needs to be high quality. Yeah. By high quality, tasks that people care about and tasks that are not easily reward hackable. That that's one. Two, I think it needs to be easy to adopt. So this is something that we agonize over with terminal bench two. And one is like even with terminal bench two, which led way to harbor, was like, what is the fewest characters somebody could type into their computer to start running Terminal Bench 2? So it's like in theory, if you have Docker installed and running, you could run it in a single line of code without having anything else installed on your computer. So that that I think was a big one. And then the last one that you have maybe a little bit less control over is making a correct bet on a capability that people are going to care about that currently is not being measured by existing benchmarks. Right. So for us, this was the bet on the terminal as a powerful tool for computer use, not just coding, but like actually just doing arbitrary tasks on a computer. And I do have to credit Andy Konwinsky, the co-founder of Laud Institute, data version perplexity, and then Ludwig Schmidt, the Stanford professor. Yeah. For I think a lot of the initial ideas around terminal came from them. Yeah. And they were thinking, like, hey, it's a text-based tool that's extremely powerful. So I think they really nailed that. And I think Mike and I executed well on it and expanded the vision. I love that articulation and I agree on all fronts.

SPEAKER_02

To go a little deeper on the bed of the terminal, why do you think that's the right abstraction? Because I agree with you. It feels like it's not just a kind of CLI agent. It's way more general purpose than that. It's a kind of much broader affordance for general computer use and general purpose tasks. Why do you think the terminal is such a good abstraction or kind of core foundational layer to build on?

SPEAKER_01

I'll say what I've said since the beginning, which is before UIs even existed, it was just terminals, or it was the way that you interacted with a computer. So it makes sense that most things you do on your computer that you do through a UI originally maybe were either done through a terminal or can now be done through a terminal. So I think just in general, these are large language models and they tend to excel at text-based domains more than they do at vision-based domains. A lot of user interface is vision-based. So I think because of that, they excel at writing commands and scripts and code in the terminal. And then I think people underestimate how much you can do with the terminal. Cloud Code released Cloud Cowork is literally just like a UI wrap around Cloud Code for people to do like document style work, like lawyers and stuff like that. And then even myself, like when I, for example, when I like shop for groceries, I will go to this app called Pabrika and I'll pick a recipe and I'll add groceries to my grocery list, and then maybe I'll export them into my reminders app. And I'm doing this all like through my user interface. But just as an experiment the other day, I asked Cloud Code, like, hey, can you see my paprika app? And it was like, I found the SQLite database for it. And I was like, Can you read it? I'm seeing my recipes. And it's like, yeah, I just read your SQLite database. So I see all of your recipes. And I was like, can you pick a couple and then and then make a grocery list? I was like, yeah, okay, I did that. I was like, can you export that to the reminders app? And it like found some scripting library that Apple has created where you can like interact with Apple applications on your computer through the terminal and added it to my reminders app. So it's it's just to show that you can do, I think, really almost anything on a computer through through a terminal. Yeah. Yeah.

SPEAKER_02

I think that's well said. And I think more broadly on the quality point you mentioned, task quality as a differentiation for TB2. What do you think contributes to that? How do you maintain such high quality on a benchmark at the scale you were trying to run towards?

SPEAKER_01

Yeah. Originally, like there was no scale. Originally it was just terminal bench one, and it was me and Mike making tasks ourselves, trying to hit some amount of tasks to put a benchmark out. At what point we realized like Mike and I don't have enough collective knowledge to create a hundred really interesting tasks. We maybe hit 10 each, and then quickly we're like, it's hard to come up with more. So then it became a crowdsourcing, crowd, crowdsourced benchmark where we went and asked for contributions. I think one thing so first and foremost, the thing that helps most is just having smart and capable contributors who understand or who are able to recreate difficult problems that they've solved historically in their careers. So for us, that was like Nicholas Carlini joining early and contributing like tons of tasks to terminal bench because he had just solved so many difficult problems. So I think just like having really expert level contributors makes a big difference. And then maybe also providing them with tools to check their own work, right? Essentially. So we created like rubrics and some nice stuff where you could actually like have LL and judge look at your task and be like, hey, it's well specified, like your instructions are missing, like output schemas or something like that. And then the last check, or maybe the last step of the process, which was really Mike invented a lot of this, was the QA process, which is once we had the tasks, right? How are we now gonna just put them through a bunch of thorough testing to see like where do they break? Are they brittle? Are they reward hackable? Should they make it into the benchmark? And I think that all of those things combined led to a high quality benchmark. Yeah.

SPEAKER_02

I think that resonates a ton. Even as we're building data sets on our end, often what we're finding is generation isn't really the bottleneck. It's often the verification and quality control piece. And the more that you shift left in terms of making things more interactive or kind of giving feedback sooner, the more you have those types of checks or protocols in place, yeah, the more you can tighten that feedback loop for quality. So a lot of those principles resonate a ton. I know there's also even some kind of recent you guys have become so big that people are trying to cheat the dashboard. Uh what has it been like to manage kind of community there more broadly and really keep the integrity of the benchmark as this thing has blown up quite a bit?

SPEAKER_01

Yeah, I so we have like simple checks in place. We I guess initially we just had people email us their results and we would go manually put it on the leaderboard. That didn't scale very well. So we started, we started a hugging face repo where people could submit their results to the hugging face repo and they would automatically get like simple validations like did they run the tasks enough times? Did they use the correct configuration, stuff like that? Yeah. And then those would upload to the leaderboard. Those kind of help with like basic things, but then really just having everyone's submissions publicly available, right? Is I think what really creates the final layer of verification, which is anybody can go in. Like this leaderboard entry that recently got taken down, we didn't catch it. It was some other contributor dug in and figured out that they had bundled some gold trajectories into their agent binary. And that helps us create the most high integrity benchmark. But I will say there's no easy way for us to detect, for example, like was the model fine-tuned on our actual solutions or something. There's things that you just can't verify. So I think the only solution to that is like people will overfit to your benchmark over time. You need to make new ones fast. Yeah. And that's what we're trying to do with Harbor, especially is like, how can we build the benchmark factory, like the machine that other people can use to make their benchmarks, as opposed to just creating our own benchmarks one by one? Because we need, I think, like I said, way more benchmarks than we have now. And we need them to come out quickly so we're not overfitting to the current one.

SPEAKER_02

Let's talk about Harbor. It's for way more than just Terminal Bench now. I think Suite Bench and many other third-party benchmarks are on the registry. What's the vision? How has it evolved since you literally launched it kind of just a few months ago?

SPEAKER_01

Yeah. Like I said before, when we put it out, it was mostly experimental, I think, just to see if we had built a lot of interesting and good infrastructure for running Terminal Bench. And we had, like I said before, cared a lot about making it easy to use. What is the simplest way to specify a task on a computer that an agent should automate? What is the simplest way to start getting rollouts on it at scale? So we put out Harbor. From the beginning, the vision was always we want a lot of the popular benchmarks to be adapted into the framework because it's flexible enough to support that. And we want people to create their own benchmarks. So I think uh we have seen like 10 or 20 benchmarks. The one I saw this week was called Runebench, and it's literally like agents playing RuneScape. If you guys know High Game, it's like a World of Warcraft style game that was popular probably like 10 or 20 years ago. And so yeah, people are using it to build interesting benchmarks. We've adapted like 40 or 50 benchmarks into it, like Street Bench, like you mentioned. And then maybe the surprising thing has been, or maybe the thing we're trying to do most right now is expand outside of just coding as a primary use case. So part of this is because of how we've seen people use it, like to create finance tasks, law tasks, like all of these things in the same vein as cloud cowork, right? Being being an extension of cloud code, it's why shouldn't all of these other benchmarks or types of tasks also fit into the Harbor framework? So we're just actively developing like Harbor towards just general agent automation tasks.

SPEAKER_02

Yeah. What do you think that Harbor does well that was previously a gap for researchers, agent builders in general?

SPEAKER_01

I so shortly after we made Terminal Bench, I got wind that Guillermo Rausch, the founder of Vercel and also the inventor of Next.js, yeah, was a fan of it. So I like tried to hop out call with Guillermo to talk to him. And the thing I wanted to tell him was what while building Terminal Bench, the thing that I was trying to do the most was like give it the same level of usability and developer experience that like a Next.js has. Like all these web frameworks that are known for just being like extremely nice and usable. So I think that we obviously took that same philosophy and instilled it into Harbor. And then I think Tinker is another good example of like how getting an extremely simple interface that abstracts away 90% of the complexity, but still allows you to have 90% of the flexibility you need is a really powerful tool. And I think that's what Harbor did well. Is there was a lot of frameworks out there for spinning up containerized tasks, like measuring an agent on it, but nobody had really gotten it to be super simple, super usable, and extremely flexible. And I think that's what we did right.

SPEAKER_02

Yeah. Where is Harbor going and why should people building or interacting with agents be excited about that?

SPEAKER_01

Yeah, it's a good question. I think what we've seen is right now, okay, one model development is an extremely empirical process, which is why these benchmarks are so important, is because often the benchmark is how you make decisions about what you're doing during the training phase. And I think what we've seen is agent development is an equally empirical process. Pretty much anything that touches a model, which is more or less a black box, it's not like some lines of code that you can go read and then know ahead of time what's going to happen. Right. You need tools to be able to measure how your agent is performing and whether your changes are improving it or harming it. And yeah, I think what we've also seen is data right now, tasks move between a lot of hands. Yeah. And it's very helpful to have a common language. So tasks, in a way, are a way of specifying these are the capabilities that we care about, that we're trying to optimize towards, like it's your product roadmap, like it's your product spec. And having a unified format and even framework that people can use to get value out of the data in that format, I think helps solve a coordination problem in the industry.

SPEAKER_02

Makes a ton of sense. I guess moving forward as you're trying to predict, hey, how do you build out Harbor? How do you build out the next versions of Terminal Bench? What do you view as the invariants moving forward? What are the things that aren't changing that you feel like will be true for many years to come as you continue to shape the future of where, again, eval's agentic development in general is going?

SPEAKER_01

It's a good question. I think I think there's a lot of talk right now about continual learning and learning from online only. I really hope that is a problem that gets solved because it just seems like those are it's fairly obvious how that would benefit all of the users of agents. Right. But I do think that in the foreseeable future, offline is going to be extremely important, which is like how do you take these workloads that are happening in online settings and then encode them into these tasks and environments that can be used during offline training, offline agent development, these sorts of things. And I don't see any reason why you can't like directly take what you're seeing in the online setting and try to encode that into these offline tasks and environments. But I think that that will probably continue to be a core piece of agent and model development and into the next few years. I think maybe the trend that we're starting to see now that I think will continue is a lot of companies are going to start caring a lot about data and evaluations, and they're going to be trying to turn their workflows into tasks that they can then start automating with agents and models. So I just think the demand for this type of data and these this type of optimization is about to explode. Yeah.

SPEAKER_02

One of the things that we believe internally, very much so, is hey, by shaping the data, you're shaping your product specs and you're really defining where not just your own company, enterprise products are going, but also where the field is going. And we were joking earlier that probably the best way to get a capability into these models moving forward is to contribute a terminal bench test, right? Because that's the way that these frontier labs are going to be benchmarking. That's the way that progress in hill climbing is going to be defined. And so um, thinking of data as the way to chart the roadmap for the whole space is something that resonates a ton. With that in mind, where do you want things to go? What are some benchmarks that you're excited to see out there that maybe don't exist today?

SPEAKER_01

So the benchmark that I personally want to see the most is a it's like a meta benchmark. It's a benchmark that measures the ability of agents to write verifiers. So it's so I think that verification is often the hardest part of task creation. Yeah. And you can usually tell if a task is a good one because the instruction will be relatively concise and often the verifier maybe will be relatively concise as well, or it will lean heavily on like existing verification infrastructure that exists in the real world. And I think creating the verifier is often the hardest part of creating a task. Yeah. And there's so many ways to like seed environments and seed instructions, but it's hard to seed verifiers. So in theory, it's okay, I could take my agent observability rollouts that that gives me instructions, maybe even environments. If I could just automate that verification piece, then I could just start like generating tasks like crazy. But right now we don't have a way of knowing like how good our agents at creating verifiers. And it's difficult to measure that. I think maybe you'd have to like create a ton of different solutions and make sure that all those solutions still pass against the verifier or something. Like you need some, yeah.

SPEAKER_02

What do you think that interface looks like? It sounds like it'd be an expert or human plus some sort of agent or model actually helping build these tasks. What is your view of what that ideal loop looks like in practice?

SPEAKER_01

I think I think it would look roughly like what I was just referring to, which is you take you take workflows that are encoded into production systems somehow. So this could be like Sweet Ench with GitHub is a perfect example of this. It's like that's a unique example where it's a workflow encoded into GitHub PR history, but it also happens to come with a verifier in that case. But that's typically not the case. So agent observability traces, other platforms that people do their work in, using those to like seed the environments and the instructions. Yeah. And then synthesizing verifiers probably need a human in the loop to do like last mile, take those things over the finish line. So you actually have tasks. And then once you have tasks, in theory, like that becomes a verifier for your agent to start creating the automation itself. Right. And then that gets to put into production. You get production traces, you take those traces, you turn them into tasks. And hopefully it's like this is like the flywheel or the loop everyone talks about. But I do think that that is probably the North Star worth building towards.

SPEAKER_02

That's super exciting. To me, it's about making the way we program agents way more accessible in general. The more you can build verifiers with experts in the loop, of course, searing where the capabilities are going, the more you're able to shape these models in the ways that people actually want to use them or businesses want to use them. I think it's a great bet. Okay, last question. What do you feel is something that people don't tell you about building benchmarks that that is generally surprising?

SPEAKER_01

I think it's more fun to create a task than people realize, especially if it's a problem that you have solved historically that was hard for you. It's it's at least for me, it's a fun mental exercise to think, okay, what is the best way to frame this up in a way that's verifiable? And then I get to go see if the agents can solve it. So that's maybe an unexpected thing. And then the other unexpected thing is I think how painful it is to review tasks. It's extremely difficult to look at somebody else's task, try to load it into your brain, decide whether or not it's implemented well and what capabilities it's testing. So I think it's a lot more of a grind than people think to review the benchmark and put it out, but it's a lot more exciting than they expect to actually like create the tasks in the benchmark. Anything else you want to share more broadly that I didn't ask about? I guess the only other thing I'd mention is we're working on terminal bench three right now. And the thing we want to do most is this is what I was actually telling Vincent beforehand. Offline is with Terminal Bench 1, it was mostly we tried to create, and then we had to like pretty much beg people for tasks. And then we achieved some level of success based off of it. Terminal bench two is a little bit easier to get tasks for like other PhC students and stuff because it was becoming more popular for benchmark. But I think now we're in a position where Terminal Bench 3 can be. I think it's for a lot of people, it's their best opportunity to influence model development. And if we get enough buy-in from people who are great at making tasks that otherwise can go make a little mini benchmark themselves, but decide, okay, this should be part of terminal bench three, we can create like The best benchmark out there that should like dictate model development for the next year or something like that. So this means like trying to get the best people at the best companies to come and contribute a task of the most difficult thing that they've had to solve that they would be paid a lot of money to solve. So yeah, we're looking for contributors for Terminal Bench 3, and we would love to get as many people as possible involved in that process.

SPEAKER_02

Love it. So if you want to shape AI and where the field's going, Terminal Bench 3, contribute and you'll play a big role. Awesome. Thanks so much, Alex. Christian Bench Talks. That's right. We'll be excited to have you back soon. Thank you so much. Thanks.