Benchtalks

Benchtalks #2: John Yang (SWE-bench, ProgramBench) - The future of coding benchmarks

Snorkel AI Episode 2

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 54:33

For our second Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with John Yang, a Stanford PhD student and creator of the SWE-bench franchise, SWE-smith, CodeClash, and most recently ProgramBench.

This interview covers: 

  • Why every frontier model scored 0% at launch — until GPT-5.5 cracked the first task (cmatrix)
  • Why ProgramBench grades the runnable artifact, not the implementation — and lets models build in any language
  • The post-training tell hiding in plain sight: how much models (especially GPT) love Python, even when it looks like a handicap
  • The reward-hacking problem — models with internet access cheated up to 36% of the time, and why nine LLM judges still couldn't agree on what counts as cheating
  • The lineage from SWE-bench to SWE-smith to CodeClash, and what ProgramBench needs from expert contributors to grow

Full interview/transcript: https://snorkel.ai/blog/benchtalks-john-yang-programbench/

John:

Honestly, when SWE-bench was released, I got some comments from I won't name who they're all very supportive, wonderful people, but at the release they were like, oh, this seems impossible. I don't know if models will ever be able to do GitHub, you know? And funnily enough, I think you get the same wave of comments with the ProgramBench release, which I think this kind of skepticism I think is incredibly helpful. And I think it's almost healthy. But to me, I think the reason I've gone in the direction that I have is to sort of continue to question. And I think the reason I say why it's a big pivot now is I think we've very very much entering this realm of before you could sort of find things where it's like, well, you know, Vincent and John and humans in general can do this task in models. And now, I mean, I can't implement FFmpeg from scratch in one month, but yeah, maybe the model can. And I think that's a gear shift that's meaningful. Yeah, it's challenging to get right, and I'm sure we'll get it wrong in some ways, but yeah, overall I have faith in that direct.

Narrator:

Welcome to Benchtalks, Snorkel AI's podcast on benchmarks, data quality, and real-world impact. Today, Snorkel co-founder Vincent Sunn Chen sits down with John Yang, the creator of iconic software engineering benchmarks, including the SWE-bench franchise, SWE-smith, CodeClash, and most recently ProgramBench.

Vincent:

I'm here with John Yang. John is a PhD student at Stanford and creator of our favorite software engineering benchmarks, including the SWE-bench franchise, SuiteSmith, CodeClash, and most recently ProgramBench. we're super excited to have you. Welcome to Benchtalks, John.

John:

Thanks so much for having me.

Vincent:

Last week you dropped ProgramBench. this was a really exciting new benchmark focused on end-to-end program level output, similar to Carlini's work on the C compiler. When you first launched it, every Frontier model had 0% pass rate. I guess just this week, you know, GPT cracked that, you know, 0% ceiling. what has the release been like? what what has the last week in general been like for you?

John:

Yeah, it's been a fun time. I think a lot of people were quite excited because in many ways, there have already been really great works that kind of have investigated this problem at a case study level. And to some extent, I feel like it is sort of a natural follow-up to sort of the SWE-bench setting when you have sort of this much more simple, you know, environment of just here's a GitHub issue, and then you create a PR for it. And that was meaningful for two years. But I think now, now that these models are so good at solving these PRs, the question is can it put together a large application? And so, you know, Nicholas Carlini and Anthropic's blog on the C Compiler was a really meaningful inspiration. there's a blog post from Cursor that was talking about kind of different multi-agent setups they were using to put together a browser. and then Epoch AI along with METR. they had RE-Bench where they did this case study on four repositories. that sort of captured a lot of the things that we were thinking about as well. So that was super exciting. to kind of TLDR, I feel like the role of ProgramBench is really to formalize this setting and also to add sort of enough task instances such that we can study this domain with meaningful statistical power. Because I think when you have these kind of one-offs, these settings don't necessarily transfer. People use different things. There's certainly sort of different incentives that people have. So I think as with all benchmark builders, the hope is like here's this kind of equal ground, equal footing where we can study this task, you know, with purpose and sort of with a lot of clarity.

Vincent:

Yeah, no, it's super exciting launch and overall kind of vision for where these software engineering style tasks are going. Talk to me a little bit more about the style of the eval. in this case, you're evaluating the artifact, not just the code path or the implementation. you know, you're up-leveling the interface and abstractions at which these agents are working. Help me understand that a little bit more. What is the what is the goal with that specific task formulation?

John:

Yeah, I really love the description you just gave. I think I couldn't have said it better. one of the key things, I think, for me was there have certainly been these kind of zero-to-one code generation benchmarks that existed before. so there's like really great work that Sasha Rush and his student Wenting Zhao did with CommitZero, which was released like close to around when SWE-bench was released. and so that paper was sort of a very good reference point and also inspiring for me because the idea there was we're gonna generate a Python library from scratch. They, you know, took I don't know, something like marshmallow or NumPy, they wiped out all of the implementation, but they still kept kind of the function headers and the class headers and everything. And the way they evaluated was the models effectively asked to implement everything, and then they would use the existing unit test suite. I think sort of the big thing I wanted to do was how do I just make the solution space completely up to the model? Meaning that, you know, with CommitZero, you're still imposing a general programmatic scaffold that dictates what language the model should be implementing in, you know, what are the classes and the functions and the relationships and the parameters and the typing. And so it's certainly not everything, but it's a very good chunk of like system design and language selection and how do you think about decomposing modules? And I think these are all things that we want to meaningfully study. So the point, sort of, sort of the small eureka moment was when I was talking to Kilian about this idea, and he said, Oh, you know, like it's not so much the implementation that's important, it's really the artifact, the deliverable that it creates. And so that was when we kind of made this decision of, all right, let's just completely put aside implementation as sort of a general principle for the benchmark, and let's just do everything around the artifact. Or, you know, that's why it's called ProgramBench, the program. but specifically in this case, we do kind of executables and binaries that you can run in the terminal.

Vincent:

Yeah, I find that super interesting in your characterization that this is a way that we can study, you know, the actual ways that coding agents are working. I think in the paper you had a few interesting findings, at least to me, you know, there were flatter directories in general. I find it funny that it brings back up the kind of monolith, you know, type of debate in software engineering. interestingly, you know, we saw GPT one-shot, you know, almost 100% of the code bases it tried. What were some of the surprising findings on your end around how agents behave in these in these type of and more end-to-end types of problems?

John:

One I particularly love was that so in the default inference setting for ProgramBench when model solve it, the key constraints are they're not allowed to use internet, and you know, we don't allow them to use Ghidra or these kind of like strings, like reverse engineering of the binary. So those are the two key constraints amongst the smaller ones. but we do allow the model to implement its solution in any language it wants, including the native language. So instinctively, you might think that, oh, if the model, it's if if the executable was originally written in Rust and the model chooses Rust to kind of re-implement its solution, it must have a bit of a leg up, you know. But what we found is that's not always the case. So first and foremost, models don't just kind of pattern match and they'll say, oh, this looks like FFmpeg. I'm definitely going to implement it in C. we have this kind of nice language confusion matrix in the paper that I really like that kind of shows this. I think the key takeaway that was quite fun for me was that models really like Python. And I think this that that artifact of post-training is very, very evident in the setting. But what's more so is with GPT 5.5, which was the model that you mentioned that solved the very first task instance out of the 200 that we have, right? That that cmatrix animation was not written in Python, but the solution it gave was totally written in Python. And if you look at sort of the proportions, so I think the Claude models tend to kind of be more like sometimes I'll use Go or Rust, rarely C or C++ for anybody, which I think is interesting. I'm not sure how to interpret that takeaway, but yeah, I mean GPT like overwhelmingly really enjoys Python, which initially you think maybe it's shooting itself in the foot, but it's very much not. So yeah, that's fascinating.

Vincent:

I think one higher level reflection I've had, I mean, obviously the headliner here is hey, you've built this really effective, you know, first of its kind benchmark, right, that that had a lot of headroom, right? The 0% you know, frontier model score at launch. but I will I really see this all as almost a research tool, right? We're able to ask a lot more questions about how to fuzz or validate, you know, a lot of these repos as we're talking about. We're able to ask questions about, you know, cost, cost versus quality trade-offs, you know, in terms of implementation, introspect different types of tasks, right? You know, how does performance different differ from bugs versus features? how how do you think about this as a research tool or something that you know the community should be building on top of more broadly?

John:

Yeah, for sure. I mean, I think in the same way that when I read the cursor blog and just that one case study was just gave me like so much insight and information. we do 200, and I think that's the right way to go to sort of scale these things up and whatever. But I think even just the experience of like looking at what models do just for this task instance, and you run multiple models, or in this case, we kind of fixed the scaffold to mini sweet agent, right? which is sort of the justification being that this is the successful paradigm that came from the SWE-bench era. Right. And now we want to stress test it. And it's actually good, I think, that it starts a conversation around very concretely of like where your scaffold can be better, right? But sorry, to answer your question very directly, I would just recommend just like taking one task instance, like over-indexing on that, just looking at sort of like when this model does this, like what does the actual trajectory look like? What proportion of time is it spending actually probing the executable? How often is it writing it? That's kind of where that insight of like, you know, cloud is a little bit more interwoven, where it'll probe a little bit, implement a little bit, right. And then kind of the next round of probing is to understand sort of what is the gap in what I have in progress in the reference. Versus GPT says, oh, you know, let me just do like a really thorough job, right? Flesh out this very sort of thorough specification for myself. Right. And then let me one-shot it against the specification. And so I think it's just the tip of the iceberg in terms of the, you know, the way we can understand these models as developers, but also understand ultimately kind of what we want to garner and how we want to affect this kind of software development behavior.

Vincent:

So what is your view on how these benchmarks should be used by, you know, software engineers or others in dictating, you know, the changes in the field or or how folks are going to be approaching this discipline moving forward with agents in the loop, of course. Yeah, definitely.

John:

I think something that a lot of prior benchmarks, especially Terminal Bench, I think has done a really fantastic job of is, you know, to collect a lot of community power, community effort, right? Like I think this tagline that you can add your task instance to Terminal-Bench and have that be something that the frontier is evaluated on. big credit to Mike and Alex and Ludwig and obviously the folks at Snorkel, that's very compelling to me. So the way I kind of hope this message carries over, and certainly I'll try to operationalize it, is that yeah, whatever your favorite command line tool is, put that in a ProgramBench. You know, we'll we'll open up a leaderboard, but also I think we'll figure out how to open up a way to add tasks. and I think just the whole pipeline of understanding how to take a task, how to build effective tests for it, right? How to actually sort of, whether it's a prompting problem or a scaffold problem or you know, the model training problem, understanding the failure points that arise when you point it at recreating sort of or developing different programs. I just think there's a lot of general takeaways that will be sort of generally profound. but then also things that you can really that are really valuable in terms of the nuances of like, wow, for this application, you know, the first one that GPT 5.5 solves is cmatrix, which is this kind of nice terminal matrix animation. Yeah. And you can see, wow, it actually deals really well with kind of like terminal UI things. and I think that's a surprising finding that I wouldn't have expected.

Vincent:

I'd love to talk a little bit about the lineage you know, building up to ProgramBench and maybe even the future of it. Yeah. you know, walk me through how you got here, right? Like a lot of the band is back together from ProgramBench as we were talking about. But you know, you started with SWE-bench, the SWE-bench franchise of data sets. the community kind of really took that on and stride, and there's been a lot of extensions to date. You know, you did a bunch of work on SWE-smith that we're we're fans of, where you kind of synthesize tasks against environments, tried different methods like CodeClash as well, the tournament style eval, and most recently kind of ProgramBench. What does that journey in lineage look like for you and where do you see that going?

John:

Oh man, a lot of luck for sure, and a lot of thanks to the people I've been able to collaborate with. But yeah, I guess if I were to construct a narrative in hindsight, yeah, the this journey kind of started back in when I was a master's student at Princeton. and the first person I had really worked with, who I think I learned so much from was Shunyu Yao, who did ReAct and Tree of Thoughts and a lot of this really fundamental agent work. and so I guess to answer this question in terms of sort of, you know, the way to think about this lineage and where it's going and kind of the things that still are fundamental principles. I think the fundamental things are that, you know, there's rigor, like that, there's high reproducibility. I think this is something that when I collaborated with Carlos on on SWE-bench, we putting together the task instances and finding those kind of the original was 2,294 task instances. So finding those PRs and issues, it wasn't you know difficult for me or Carlos and all. We kind of got together in like June 2023 and we just kind of sketched this out on a whiteboard in one day. Right. And we spent like the next month and we had it. Really, what ended up taking us the next three or four months was reproducibility, where I mean it sounds like amazing to people now, right? But when we released the SWE-bench originally, we didn't use Docker, we used conda environments, and that ended up being kind of a nightmare for reproducibility. this is totally my fault, like because Carlos was like, maybe we should do Docker. I was like, oh, but it's so heavy, there's so much, like you need to download gigabytes of images. Let's just do conda, it's gonna be fine. Yeah, he was 100% right at the end of the day. so that was kind of a driving theme. And I think like Kilian and I invested so much time in making sure ProgramBench was like this too, where people could download it and run it. So that's kind of sort of what's remained the same. I think what has remained different is to be willing to challenge our notions of what language models can do. And I think ProgramBench in particular to me represents a fairly big pivot. Right. So the very, very kind of first project I did was called Inner Code. And this was just kind of taking a lot of like NL to bash, and people probably don't remember these, like MBPP and HumanEval, and these the a lot of these code benchmarks that were meant to be sort of input-output, no interaction at all, right, and casting it into an interactive format. Right. And so the question there was can the model benefit from this? With SWE-bench, it was like, look, this stuff is not realistic, like, or it is realistic in terms of passing LeetCode questions, but this is not what we do day to day. Can models do this? And right honestly, when SWE-bench was released, I got some comments from I won't name who they're all very supportive, wonderful people. But at the release, they were like, oh, this seems impossible. I don't know if models will ever be able to do GitHub, you know. And funnily enough, I think you get the same wave of comments with the ProgramBench release, which I think this kind of skepticism I think is incredibly helpful. And I think it's almost healthy. But to me, I think the reason I've gone in the direction that I have is to sort of continue to question. And I think the reason I say why it's a big pivot now is right, I think we've very very much entering this realm of before you could sort of find things where it's like, well, you know, Vincent and John and humans in general can do this task, can models. And now, I mean, I can't implement FFmpeg from scratch in one month, but yeah, maybe the model can. And I think that's a gear shift that's meaningful. Yeah, it's challenging to get right, and I'm sure we'll get it wrong in some ways, but yeah, overall I have faith in that direction. Yeah.

Vincent:

I mean I completely agree. I think one of the things I find really exciting about the approach you and other, you know, benchmark builders have taken is you're really charting the roadmap for for where the where the field is going, right? You're putting forward these challenges, you're you're charting a path where, hey, this is where we think capability should or or can go. And you know, the skepticism is often a good sign that you know, you're going in a direction that's orthogonal to what people believe is possible. But I think that ethos is really, really inspiring and an awesome way, you know, for all people building benchmarks to you know, think about their work.

John:

Definitely, definitely. Yeah.

Vincent:

I'd love to talk also a bit about the methodological differences of the benchmarks that you've worked on. You've explored a number of different ways to grade model outputs from unit tests to you know tournament style eval, now to kind of this awesome fuzzing and you know output-driven validation approach. what have you learned about how to grade these models? What are the trade-offs between them?

John:

how do you think about this now? The evolution of verification itself, I think, is a very meaningful problem. And in a lot of ways, kind of drives the way I think about it. And I think a lot of people think about it. So I guess with SWE-bench, I think the verification, it didn't do anything new conceptually, in that MBPP, HumanEval apps, like all these kind of LeetCode style things, they also had unit tests as well. The change there was like, okay, with HumanEval, even in the paper they mentioned, like they actually have people write the tests. And right, Carlos and I were like, we don't, we're not open AI, we don't have this kind of money. How do we look for tests in the open, in the wild? Right. You know, so I think that was kind of the key thing there. Otherwise, from like a formulation standpoint, I think it was very familiar to people, which is part of why I'm sure it got adopted. With CodeClash, I think I was very kind of inspired by the Ella Marina stuff that the wonderful Berkeley people have done. Right. and I thought, okay, what does it mean to kind of bring this into a code setting? There were some prior works that had done things like, you know, Copilot Arena or even in the products with VS Code and Cursor themselves, you kind of see, oh, there's some voting mechanism of proposed version A, proposed version B. Right. there was something where I felt like that made sense, but I also feel I questioned whether it would scale because like how many times are you gonna be able to ask someone, do you prefer A or B, where it's like two different code snippets? Right. Before they kind of throw up their hands and they say, I don't care, right. As long as it works. Right. You know, right. With CodeClash, it was kind of testing this idea of like, all right, let's actually literally take the code artifacts and make them compete against each other. Right. and so that would that then it landed itself into this idea very naturally of how do you rate competition. I think something I still like about CodeClash, even though it hasn't been picked up as much as ProgramBench, is that technically it deals with the saturation problem a little bit better. Right. Right? Like it's very open-ended. It's it's truly faithfully open-ended in that way. You're not gonna run into like, if a model is able to create one solution that's better than another, that's all that's needed for sort of progress, right? Right. so I think like I'm hoping that maybe it's like a little bit ahead. It's also possible that because the arenas are all games, that that's kind of something that people are sort of, I want more economically valuable things, which I can understand. But I think that idea of making artifacts compete against each other was certainly then part of the inspiration for ProgramBench, which is okay, it's still sort of form factor-wise written as a pytest, but they're all invocations of the executable, it's all calls of the program. It is no longer touching anything about the implementation. So now you completely disentangle the evaluation from sort of the problem specification. Right. You know, the overarching theme is sort of like fundamentally what do you want to test about the model? I think with SWE-bench, it was can you implement correctly? With CodeClash, it's hey, can you kind of evolve your code base and compete it such a is better than B? Right. And then with ProgramBench, it's can you create this artifact? Yeah. And I think these like from that sort of what's the end goal point of view, I think then that will kind of naturally sort of with some with some thought and deep thinking. Inform sort of what the verification should look like. Yeah.

Vincent:

That that resonates. The hey, what is the outcome that I'm trying to actually evaluate? And then, you know, let's let's find a clever way to actually, you know, inject that reward or learning signal as a result. Where do you see as the role of human knowledge or human expertise in driving these evaluations or steering these types of tasks more broadly? I know you had a position paper, you know, around how humans need to be, you know, a bigger part of these coding benchmarks and research in general. Where do humans need to be in the loop as you're thinking about these types of mechanisms?

John:

I've had a couple conversations with my advisor, Diyi Yang, who's very into like human AI collaboration. And I kind of came into the PhD program being sort of more autonomy-driven. But in the two years since I've been very convinced that like this is something meaningful. And so to really sort of ground it, the inspiration, Program Bench is a much longer horizon task than SWE-bench for sure. and ultimately it's exciting that the solution space is so large. But in some sense, like even as human software developers, like we do prefer that solutions are, you know, have certain characteristics to them. And it's kind of conditioned on the purpose of the software. But generally, you know, we want them to be styled well, we want them to be readable, we want them to, you know, be portable. I think there's some general heuristics that are true, where I'm sure some of these labs are probably already post-training like good style and common documentation into this, but something about sort of like empowering the individual to say things and then very quickly steer the model effectively to kind of like go with their preferences. Right. I think generally it's just important to me this idea that people can build software the way they want it built. I don't understand hardware. I don't understand what goes into like HDL. Those people are experts. But as you and I, as the people who are putting these tools together, it's not sort of my. I think a better way to kind of solve the problem is not to kind of figure out what they do and then decide kind of like, all right, these are the things we should inject, but more to just make the model itself fundamentally steerable. So in the position paper, like me and Zora, who's a wonderful co-author at CMU who works with Graham Neubig and the OpenHands folks who are great, she talks about this idea of steerability in the paper. And I think that's something that feels very compelling to me of like, sure, you know, when we get to the day where we're at like 80% on ProgramBench, that's very exciting. But then what does that mean for the people that ultimately use these tools that are being benchmarked on these things? Right. If it can get 80% on ProgramBench, but it's just implementing everything in Python and in single files, and it's not really using a file system and the code is not super reproducible. Yeah. that's like it's a meaningful end state, but I think we can ask more out of the models. Yeah.

Vincent:

Yeah, I agree with that. I think that that end state is more ambitious and harder to kind of evaluate and measure in general. But yeah, we're really glad that you put out that position paper and others thinking is an awesome charting of again, kind of where we think things might go. I'd love to talk a bit about quality as well. I think one of the, you know, pieces that we you know pay a lot of attention to when we're building data sets or environments or benchmarks is hey, how do you make sure overall aggregate quality, you know, is at a really high standard, individual task level quality is at, you know, high standard. you know, having models or agents in the loop is really effective. It can be, you know, a double-edged sword in some cases. Yeah. in the case of you know, ProgramBench and you know, where where you used models in many ways to kind of steer, you know, generate a bunch of the kind of individual instances. You've had a lot of learnings from you know, SWE-smith as well, where you know generated tasks from well-curated environments. How do you think about these settings where you know agents and models are used so heavily in the loop? but you know, you also want to manage really, really high quality tasks.

John:

So far, to be honest with SWE-smith and ProgramBench, it's been pretty like case by case, but it's kind of a funny thing with benchmarking where it's like you want to build something that effectively models can't quite do yet that you want and you want to help them get there. But along the way, if there are certain portions that the model can help with, then you want to get them engaged, you know. Of course. So it's it's kind of funny. But with ProgramBench, I think the big thing was that test suites are not available, like behavioral style test suites that involve invocations of these programs are not available. empirically, we could ask a person to kind of help with this, but you know, we're kind of slow, and to some extent, like I'm not sure that a person without true expertise in FFmpeg can really write a good test suite. To concretely answer this question, I think models these days are getting better, good and good enough that there's probably three vectors I would imagine you could sort of experiment with using the model. Right. One is to assist you like with generating the verification. Right, right, right. One of them, so the SWE-smith-oriented take is to actually create task instances, which is like you have the code base, it's installed, and then you do things like you know, you ask the model to write, you know, kind of a funky implementation that breaks tests. Right. And that's another thing. The last factor I would say is actually constructing the environment itself. So this is something where for both SWE-smith and ProgramBench, it was very useful. Where for ProgramBench, it was, okay, if I want to get the actual reference executable, clone the code base and then ask the model to actually write the compilation script with all of the source code available to generate the executable. With SWE-smith, it was here's the Python library, make sure you install it, but then also, you know, make sure you can run the unit tests as well. So I think sort of it's hard for me to come up with like a good sort of underlying conceptual answer, mainly because like I think people are just experimenting with it. and it's just empirically in effective. Yeah, yeah, yeah, yeah. but yeah.

Vincent:

No, I don't think there's a one size fits all solution either, right? I mean, in practice, a lot of what I and our teams find is you know, injecting human expertise and steering and and you know, high-level guidance in the right places efficiently is it's tricky, it's case by case, you know, it's pipeline or workflow dependent. Yeah. but you know, having an intelligent way to say, hey, this is where a human needs to be in the loop, you know, to steer this part of the program. This is where actually, yeah, let the agents rip and start and take it over. I think makes a lot of sense. so you know, you're you're certainly at the frontier of that, which is why I thought it was a yeah.

John:

I mean, I think like in some sense, we can afford to kind of be very precise and focus more on sort of small scale experimentation. Right. Because unlike before, once you get that right, it's much easier to scale up with agentic systems.

Vincent:

In the ProgramBench paper, you stated that models that had internet access, you found reward hacking up to, I think, 36% of the time. You've also found, you know, back back in your SWE-bench early days, right? let's call it leaderboard hacking when it came to you know verbatim solutions. how do you think about leaderboards, open benchmarks? and as a benchmark builder, you know, how to address this overall challenge of you know, making sure that we're measuring model performance in an honest and trustworthy way.

John:

I think there's like two parts to this really great question. The first half, which is like models are cheating, and the second, which is like, how do you set up a leaderboard that's open? And they're both really meaningful questions. When ProgramBench was launched, I think a big topic online was, well, humans use internet, and the internet is certainly useful. so why disallow it? and I would say like the central thesis that we were really going for with ProgramBench is that anytime there's X percent improvement on ProgramBench, we want it to be sort of like undoubted, that there's no question, there's no asterisk around whether the model actually got it. It would be really cool to see a model sort of go on GitHub and find adjacent relevant code, but not the source code itself. Right. and sort of do these things. Maybe it checks Stack Overflow, maybe it checks like a language specification or something like that. I agree that the internet is useful, but not at the cost of sacrificing the rigor of the benchmark. And so the finding that you pointed out, it was twofold. We early on in the experimentation process, we allowed the model to use internet. And then what we would do, and this is big credit to Kilian, is run nine different LMs where they look at the trajectories and then they decide like did the model cheat or not? Right. And you have these kind of cheating criteria specified in natural language. The problem wasn't just that there was, I mean, like a third is extremely high, and it would have been very tricky for us to release a benchmark where we're saying zero percent, but also like they cheat on like the messaging is very muddled at that point. Right. And Kilian also found that the judges would disagree. And we have like a couple examples in the paper where like it's like a five to four vote, and five are saying, like, oh, it's not allowed to look it up on SourceForge or something like that. And the four is saying, well, it was only GitHub that was explicitly disallowed, so SourceForge should be fine. Right. And it becomes this kind of crazy cat and mouse game. So I think this is a constant tug of war for benchmark builders, but the typical advice I would go for is that rigor is king, reliability is king, reproducibility is king. Everything else is kind of cherry on top. But if any of those things start to really affect, you know, because the rigor is kind of really core to being able to then, you know, have people come together and hill climb and not point at each other saying you didn't do it right, you didn't do it right. You know, that's important.

Vincent:

For the second half of setting up the leaderboard, yeah, I think

John:

curating submissions for SWE-bench was certainly like a wonderful, a wonderfully interesting experience. like as of today, and I we've kind of you know tailed off on accepting solutions at this point, but I think we ended up accepting like over 300 on the leaderboard. I think the experience there is to set up leaderboard submission pipelines that enable members of the community to kind of be inspired by and also sort of like double check each other's work. Right. So there was a big moment in SWE-bench where because of sort of some things that had happened, and this has been repeated in later benchmarks, that we said you must submit trajectories. And I'm really glad to say I think this is something that Terminal Bench has adopted now, and we'll certainly see that adopted for ProgramBench as well. But the moment we ask people for all those trajectories, one, I think like the people that didn't submit, I think naturally, you know, you kind of like remove confusion around how exactly those solutions came to be. And then the second thing is when the trajectories get uploaded, that people can look at it and they they can kind of it does actually inspire new approaches, but it also does help with sort of answering meaningful clarifying questions of is this cheating? You know, like like the credit to the fair coding team, they were the ones that brought up this behavior of SWE-bench like of some of the models fast forwarding, like going to future commits to solve the problem. And thanks to them, because after that we realized, you know, a lot of people weren't like reporting these things, right? Not necessarily out of malice, maybe because they just hadn't caught it. but yeah, I think just as open as possible because it just helps with everything, with like reproducibility, with kind of trust that you can build with, you know, you as a leaderboard host and the people participating in that. Yeah. yeah, never bad to be too open.

Vincent:

That makes sense. How do you go about that error mode, failure mode analysis? I can only imagine, right? As these trajectories get longer and longer horizon, as the methods, as you pointed out, get more and more creative. how does one try to understand and even kind of dig into the data trajectories to make sense of it?

John:

I think I there's like two sides of the coin to me for me on this. One side is like, we should definitely invest in sort of better tools, right? Using language models themselves. I think there's been great research around this. Like you know, for instance, the Transluce folks have the Docent tool, and you can you can you know inspect SWE-bench trajectories using Docent. They're all loaded up. You can ask questions about it, you know, and it'll it'll do sort of give you an answer aggregated across all of the instances and inspects. So tools like that, absolutely worth investing in. But then on the other hand, nothing beats just like sitting down and like manually scrolling through one of them. Yeah, yeah. So I think it's kind of like a balance. If anything, it kind of echoes the thing we said earlier about like the role of model involvement and human involvement in these things of just, yeah, take a little bit of time, figure out how it looks for one or two instances manually, yeah, go through that and then you know, scale to you know above and beyond.

Vincent:

Yeah, I'm a I'm a big believer in both as well. We used to have as an onboarding exercise at snorkel, you just gotta dig into the data, dig into the trajectories. Because to your point, there's nothing, nothing beats building personal firsthand intuition for what the shape of the data looks like, you know, what artifacts might look like, right? And you know, then translating that to more programmatic approaches. You know, I agree with that method. Yeah. It makes a lot of sense.

John:

Yeah, there's just the threat of drift if it's always kind of one layer above, you know. Exactly.

Vincent:

I'd love to move to a lightning round, if you don't mind. So you know, hot takes are welcome. you know, no need to go super, super deep, but let's just love your love your opinion on on some of these high-level ideas. You tweeted that we're we're moving from a regime where you know language models can do X, but you know, the X things that humans can do, but we're shifting to one where, you know, we're asking the question, can language model actually do things that were previously impossible? what are the types of tasks that fall into that second category? Yeah.

John:

I've talked with Ofir Press, who's like a longtime collaborator about this, and I actually think he has some really excellent thoughts. He has this one diagram which is called like the five steps or five, you know, levels of benchmarking where like one, two, three are kind of one, two, three are all kind of in the realm of like, can humans do this, you know? but then four and five are kind of like, oh wow, humans can't do this. So I think ProgramBench, and this is kind of very much stealing his words. So thanks, Ofir. But four is sort of in the realm of like there is kind of a reference solution that exists. So for example, for ProgramBench, the FFmpeg source code exists. Right. you can generate tests on top of it. There's continuing to be progress on it. Basically, we're now compressing the timeline with which you know we can do that. So I think there's almost sort of models can effectively do the same thing, but like in a way shorter time span than people can do. And so I think that's sort of superhuman in the sense of like, yeah, we could still, we could still pull this off, but it just took us a way longer time. Right, right. so I think that's one way to think about it where it's like, what are artifacts or what are sort of general, you know, representations of, you know, collective human intelligence or, you know, you know, things that took a long, long time that you can sort of now have the model try to recreate, like recreating because it's such a good proxy where it's like if a model could meaningfully create FFmpeg, then you can have faith that maybe it's gonna create equally revolutionary game-changing software that's novel, that's not FFmpeg that we haven't seen. Right. You know, but it's hard to grade it on those unseen things. So I think the seen things are good. The yeah, I think the fifth realm, and I think this is maybe where sort of like it's more sci-fi to me, and there's maybe people with like better formal ways to approach this, is like, yeah, just things that we know that are literally unsolved, whether it's kind of those crazy Millennium Prize things, like mathematics and physics, like really challenging the boundaries of kind of science and our understanding. And I'm sure this is kind of where a lot of the excitement around you know, recursive self-improvement and AI scientists really lies. And I believe with that, I agree with that. I think my personal approach is more like having models be able to recreate the things that we have that exist is a good stepping stone to then having meaningful empirical faith in creating something then new that's you know, just as, if not more impactful.

Vincent:

Yeah, I subscribe to that mindset as well. I think there's a real value in empiricism, right? That's what benchmarks give us, right? I'm yeah, I am as much a fan as the kind of vibe eval or kind of you know, future looking more per projection-based eval, but you know, having a way to empirically, scientifically study some of these models in the current state, you know, I think there's nothing that beats that. Exactly. Exactly.

John:

Couldn't agree more. Yeah. Awesome.

Vincent:

so GPT-5 just cracked, you know, one of the tasks in ProgramBench. when does the first model hit 80%?

John:

80%. Yeah, I guess for our sakes, hopefully not too soon. Just joking. maybe like a year, maybe like a year and a half from now. the way I'm thinking about it is with with ProgramBench, there is not super intentional, but sort of nicely happened to have a gradient of difficulty. Right. So, you know, cmatrix, we have this kind of like difficulty scoring thing that we put in the appendix. If you look at cmatrix, it's it's literally one of the easiest ones. And our difficulty rating is not empirically based. It's purely just based on looking at lines of code and the number of dependencies. It's purely grading with respect to the characteristics of the code base. So it's totally independent of performance. And why I think that's meaningful is yeah, you know, the one that it's solved, it's fantastic, but you know, it's one of the easier ones, you know. So I think there's this nice grading of difficulty really at the limit of ProgramBench, which is kind of the toughest task instances, SQLite, the PHP interpreter, you know, tiny C compiler. there's this one really amazing dev called Fabrice Bellard, who wrote like a bunch of this really, really fantastic software, like FFmpeg, whatever. I think there's just, I mean, there's a lot of just like true human ingenuity in that software. and I'm pretty, I have faith that the models will sort of become, you know, we'll we'll get there one day. I think 80% to me on ProgramBench, which is like 160 out of 200 instances, I think that 160 are things that sort of like start are quite meaningful at that point. Right. in terms of, okay, maybe it gets things like ripgrep or search or tools that kind of are are more focused in their utility. the remaining 20%, I think it I'm not sure. Maybe it doesn't take that long. My my sort of hunch, at least for a human, is that you know it's a little bit of a long tail in that, you know, something like ripgrep or jq or some of these things, they're tricky. Yeah, but I mean compared to FFmpeg, this is kind of a different leak. So I think, you know, one year is kind of my my more bullish take, but I'm happy to be proven wrong. How how would you characterize?

Vincent:

You said there's a meaningful difference between the 80% versus newer and I like do do you have words to capture what what that shift actually looks like? Oh, for sure.

John:

I mean, I think some of these executables, like if you just literally see some of them have like a lot of subcommands. And so, in some sense, like even putting aside the lines of code and dependencies and all that, they just encompass different amounts of sort of functionality, you know, like that FFmpeg handles so many different audio and video formats, and it does all sorts of different kinds of, you know, you can transcode from MOV to you know mp4, but you can apply this kind of like encoding to it and things like that where you have a model, you know, they're gonna really have to do some heavy probing there of like downloading different assets. These assets vary by how long that video is, kind of what the quality is, like, is it just a video of a person? Or, you know, I mean, I'm just coming up with these dimensions on the fly, but I'm sure there's so many things I've missed that FFmpeg accounts for. and so I think to some extent, ProgramBench beyond just implementation correctness and models as software developers, I think it's the one that truly tests curiosity to some extent of like, did you probe enough? You know, like because if you didn't probe enough, it doesn't matter how good of a software engineer is. And I think something about that almost extends to models then becoming better research scientists and better sort of, you know, innovators in that so much of what we do, there's no specification, there's no kind of well-written doc for it. It's just someone really kind of poking endlessly at this thing until they kind of discover a behavior that wasn't realized before. and they formalize it. You know, this is the whole like it's not a matter of just you know, Thomas Edison not being smart enough. Like he just literally had to try a bunch of different metals until he got the one that worked, you know. So I think this kind of like persistence I think is exciting, you know.

Vincent:

I love that notion that a bunch of software is actually the cumulative, you know, result of humans contacting reality and and really understanding. Understanding their own preferences, being curious and trying to really capture all the different ways that they might be able to use it. And for something like ProgramBench to be able to extend into that form of, you know, as you put it, curiosity or persistence, I think is a really exciting way to think about, you know, where these models are going. Yeah. What benchmark should more people be paying attention to?

John:

I think these days my personal preferences and focuses are relatively on long horizon things within the realm of kind of 20 to 30 to 40 turns, which I think is very much SWE-bench. I think we have the formulas in place to kind of solve that pretty well at this point. so in terms of long horizon, I think it's meaningful because it then makes us ask questions about, you know, I guess continual learning is a hot word these days. So I'll lean on that. but very much in the sense of like, how do models construct memory for themselves? And that memory, it doesn't just have to be natural language. It could even be sort of like this model and this model. They both solve this task instance. But which one provides a solution that's more extensible to whoever picks it up next? You know? so I think concretely, kudos to you guys, like the continual learning bench that part from Berkeley and that your your team did, that feels like really great. I almost kind of like it in that the way I see it is that you have kind of ProgramBench, which is like a hundred turns or you know, just as a random number, but then you have continual learning bench, which is also effectively a hundred turns, but you kind of have these checkpoints in the middle depending on kind of like where things pick up. Like one of the tasks I really like was it's kind of resembles like a lot of related SWE-bench issues where you have issue A, and you can kind of find these naturally rarely, but they still occur online of like B was blocked by A, but then C was blocked by B, which was blocked by A, you know, and then obviously you can keep going there. And so once you unblock A, when you solve B, I mean, it's one thing to solve A, but you know, writing that function in a way where it's kind of at the right layer in the call stack, yeah, such that B can correctly invoke it, right? Versus, oh, you know, the solution is too high up in the call stack. So then when you implement B, you end up redundantly repeating that code when it should be one layer below. that seems really fascinating to me. Yeah. Yeah.

Vincent:

Well, appreciate the kind words. Yes. A lot of the goal there was, hey, how do we measure an agent's ability to learn from experience, learn from, you know, actually interacting with an environment, multiple instances within an environment. So it's something that humans can do quite efficiently and well. And I agree that that's that's you know one of the ways that we're really excited to also measure, you know, autonomy long term. definitely. What is a benchmark that you wish existed? Oh man. I think there's there's a lot of them.

John:

One I think I'm fairly excited about is using coding generally to just tackle problems in domains that aren't just coding. So what I mean by that to expand is for instance, SWE-bench, I think is effectively a benchmark where we grade models on their ability to write code for the sake of better code. You know, we fix this bug. But I think as we've seen with like Claude Code and just like overall the proliferation of these kind of SWE-agents that people are using them for so many things beyond just code, right? Like in some sense, like I mean, Claude Cowork is fantastic, but I think it's almost like the hood of an engine where underlying the engine is very much just a coding agent, you know, with a couple extra tool calls and whatever. So I think I would just be really excited about leaning on the imagination of people that are kind of outside of. I mean, this is this is maybe speaking in too general, you know, too many generalities, but I truly believe like, hey, if you had sort of law problems, could you recast that into something where, you know, it's manipulating code to some extent, it's like looking through a bunch of documents, but you know, it builds its own retrievers, stuff like that. So I guess this is like a very, very high-level answer. if I had to kind of sit down today and brainstorm and start acting on something, maybe something in the biology or medical industry, I think you know, something along the lines of just like science of like, hey, there are these experiments, there are these things that people kind of repeatedly do. I guess I don't want to kind of say too much out of distribution because I'm not a biology person, but I would just be really excited about understanding more there and understanding like so many of these clinical experiments that are truly realistically long horizon. Right. what's the blocker there for why we can't deploy an agent, you know, that can write code and code is easily operationalizable. What do we still need to solve to make it then be able to do those things? Yeah, yeah. Yeah.

Vincent:

Two two reactions. One, I think one thing I talked to Alex about also on the last last one on Terminal-Bench and harbor. I think the bet on the CLI, the bet on code in general as a universal interface for agents was was an awesome one, right? Because to your point, it's underlying a lot of the affordances that agents are using inherently to interact with the world and actually move it. two, I 100% agree with this higher level notion that, hey, we need to pull in domain experts, right, into the benchmark building process, right? A lot of you know what we as scientists or engineers or researchers have contact with are problems within our domains or adjacent ones, right? But there's a whole universe of you know use cases and workflows and ways that we could be accelerating and you know taking burden off of people's work that I think needs to be represented in the broader family of benchmarks. so lots of agreement there. I think the biomedical space is a super interesting one.

John:

Yeah, with you 100%.

Vincent:

Five years out, does a benchmark still look like a benchmark?

John:

If we asked this question around when SWE-bench was created, right, right, I think the answer would be like not that different, you know, from like 2018 to like 2023, right? but I don't want to give I don't I don't think so. And I think like a lot of it has to do with kind of my my thoughts kind of that we discussed earlier, right? Which is that we're entering this realm of benchmarks and sort of grading models on things that technically are like borderline, or if not absolutely, you know, not possible for people to do. And I feel like a lot of benchmarking that has at least shaped the language model space, has a lot of it has been sort of like silos of you know what people can do. we did GPQA, we had question answering, we had sort of, you know, entailment, we had, you know, when the T5 paper came out, uniting all those things. We had machine translation, blah, blah, blah. And these are sort of like different cuts of what people can do with language. and so fundamentally, I think having tasks that are inspired by, well, I can do this, but can the machine do this? I think it was, I mean, it was a very, very good source of task instances. And we kind of calibrated which benchmarks to pay attention to along the way, based on kind of where models were in terms of their performance, right? So I think SWE-bench in 2018 probably wouldn't have been that meaningful, but SWE-bench in 2023 was fairly timely. so going forwards, I am excited to see how it evolves. I just think like the human inspiration will be a lot, like the inspiration of can a human do this maybe is gonna be less and less the actual source of it, in my opinion. Yeah. And more so along like how do we engage with it, how do we work with it, how do we interact with it? Yeah, being more of the core premise than just sort of like hill climbing what a person can do. and in terms of just like the form factor of it, you know, I feel like this idea of like having environments, whether they're digital or maybe they become real-world environments sort of in the coming years, I think that sentiment will stay around. I think the verification is where I'm kind of more curious how it evolves, you know, because I think at least in the regime of code, verification has evolved so much in the past three, four years. Yeah. And I don't expect that trend to stop at all.

Vincent:

Last question for you. how can how can people leverage ProgramBench? Yeah, definitely.

John:

programbench.com, you know, the URLs there. It was a little pricey, but you know, we got it. so we're gonna set up a leaderboard very soon. We purposely kind of are encouraging people to just pick at the low-hanging fruits, you know, like, hey, we use mini-SWE-agent. You use your own scaffold if you think it's better, you know. single agent versus multi-agent scaffold, give that a go, you know. I mean, as with SWE-bench, at some point I would be really excited to see people train their own models on this and see, like, hey, how far can we get with like a 32B model? you know, well, is there really sort of, you know, given that this task is longer and bigger, you know, what is training on that? You know, what how does that affect sort of how we train on code tasks when they get longer and sort of wider and bigger with larger solution spaces? So anyway, that's for setting up the leaderboard. I would also love to set up a way for people to add tasks because I think just ProgramBench as a formulation is meant to be very generalizable. You know, there's very we want to impose as little constraints on the Word program. You know, we just do executables, but there's no reason someone couldn't take a look at a Mac application or an iOS application or a website or you know, anything that constitutes sort of like rendered software, you know. And so those would be, I think, the two call to actions. And, you know, Kilian and I are very sort of, you know, looking at the GitHub every day for issues, looking at the website traffic. So we're we're very looking forward to hopefully growing this and just showing people and having them be just as excited as we are about sort of why this benchmark is meaningful.

Vincent:

Well, I'm excited as well. I hope you know, folks can contribute to the leaderboard, add tasks. and yeah, thanks again so much, John. This was an awesome chat.

John:

Yeah, this was fantastic. Thanks so much, thanks, thanks.