Computers, Coffee, and Beer

Truth About LLM 'Intelligence' -- Two Engineers Who Built the Internet Debate AI Intelligence

Keith Adams & Julien Verlaguet Season 1 Episode 1

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 1:34:43

Keith Adams (VMware, Facebook, Slack, PebbleBed) and Julien Verlaguet (creator of the Hack programming language, founder of Skip Labs) dig into one of the most contested questions in tech: are LLMs actually thinking?

From Keith's early (failed) experiments training a tiny stack VM at Facebook AI Research, to Ilya Sutskever's prophetic 2014 NIPS talk, to the present-day race between Anthropic, OpenAI, and the open-source world, this episode covers the full arc of how we got here — and where programming languages, coding agents, and AI security are headed next.

Topics include:

  • The "glass of beer" theory of LLMs and why Keith was wrong about Ilya
  • Why LLMs crush coding but struggle with the physical world
  • TypeScript vs. Python — which language will AI-native development run on?
  • Julien's framework for getting dramatically better results from coding agents (hint: it's about state)
  • The Anthropic Mythos announcement: real capability or marketing stunt?
  • China, open-source models, and the 6-month moat problem
  • Whether there's an escape velocity data flywheel in AI coding — and who has it


IN THIS EPISODE:
0:00 — Are LLMs "thinking" or just pattern-matching?
5:54 — Why coding is the perfect test bed for AI intelligence
10:50 — The glass of beer analogy: simulating the world one token at a time
18:10 — LLMs consume 10,000x more data than a human lifetime — does it matter?
21:30 — Should we design programming languages for alien minds?
41:56 — The "never call this method" incident (AI's literal brain)
1:01:47 — Why LLMs will obliterate humans at finding security vulnerabilities
1:08:23 — The six-month moat: why no AI advantage lasts
1:15:47 — Geopolitics, open source, and who controls AGI
1:31:32 — Elon Musk, data centers in space, and printing out lines of code

If you're a software engineer wondering whether AI is coming for your job — or just curious about what happens when two systems-level thinkers who've been in the trenches since VMware's early days get real about the future — this is your podcast.

🎙️ New episodes every week. Subscribe so you don't miss the next round.

From the Computers, Coffee & Beer podcast — Keith Adams and Julien Verlaguet dig into what  who helped build the modern internet sharing their honest takes on the biggest ideas in tech.


🎧 Full episodes on YouTube + your favorite podcast app.
@ComputersCoffeeandBeer  

📩 For Business Inquiries: don.broida@gmail.com 


=============================

Channel About: 

Computers, Coffee, & Beer is hosted by Keith Adams and Julien Verlaguet — two engineers who helped shape the modern internet and have the war stories to prove it. 

Keith built the HHVM JIT compiler at Facebook, helped virtualize the world at VMware (employee ~80), and went on to Slack before founding Pebblebed, an early-stage VC firm for technically ambitious founders. 

Julien created the Hack programming language, spent years writing safety-critical compiler software for Airbus and nuclear plants, and now runs Skip Labs. 

Together they dig into the systems, languages, and decisions behind the technology that actually runs the world — told by two people who were in the room when it happened.

=============================
#ai #llm #artificialintelligence #softwareengineering  #techpodcast  #programminglanguages  #machinelearning  #aivshuman  #futureofcoding  #podcast



SPEAKER_00

I used to be able to make this joke when I was young, but like I got into to computers for the women and the money. Like, because there was no fucking women and no fucking money. But like now now that there are sort of are some women. And also there's a really money.

SPEAKER_02

Yeah. Now there's money, yeah. So I don't know about the woman part, but there's money.

SPEAKER_00

Uh so uh what's the name of the damn thing again? It's it's coffee, computers and beer, right?

SPEAKER_02

Yes.

SPEAKER_00

Killer. Um can I give it a shot here? All right. Uh welcome again to Coffee, Computers and Beer. I'm Keith Adams here with my good friend Julian Verligay. Um rocking the coffee, rocking the beer time zones. Today we're going to talk about these newfangled machines that think and code, um, and kind of what they're doing to our industry.

SPEAKER_03

Actually, maybe you said thinking. That might be the the first thing to talk about, right? I are LLMs thinking. Because I I hear a lot of back and forth on this, right? There's a lot of smart people who say the they're just applying patterns. It just happen they just happen to have ingest so much data that it looks like thinking. Yeah, yeah. Right. Um, and others who say, no, no, that's genuine intelligence. And anybody who says otherwise doesn't realize that us humans are exactly the same, right?

SPEAKER_00

Like we use the data that we So what's your take on that? Intelligence. Not a technical term. Right? Computer scientists, we don't have some special meaning of intelligence that enables us to say, like, oh, this system possesses more intelligence than that system. There's no like grand unified truth. The field where intelligence is maybe a technical term would be psychology. Like if you're into psychometrics, they have a a test that they administer. It's called an IQ test. Scoring higher on it means you're more intelligent by definition. Um, if you're curious, like if you give this thing to like cutting-edge LLMs these days, they top out around like 140 or so, depending if it's like textual enough, if it's one of the ones that kind of is more is more language oriented, um, which is like a smart person, right? It's one of the smart people on your team. It's not a you know, world beating super genius or anything. So uh, you know, do they think or are they intelligent is kind of uh a vibes-based question in a sense like I don't think I have special insight into this. But that said, um I think it's even if thinking is not literally what they're doing, it's interesting to me that it's at least not a misleading metaphor for what the computer is doing. And that's the first time that's been true. Like, as in, as a as a normal person trying to think about like your database query and why it takes so long, if you're using the idea of thinking to try to understand why you're why Postgres is running your query slowly, you are confused and you are going to spend a lot of time debugging something in a way that's not going to work because you need to understand its query plan and like other mechanistic stuff that has nothing to do with intuitions around thought. Whereas if you're trying to understand why your agent isn't able to reliably like parse your calendar and figure out who you're about to meet or whatever, thinking analogies are actually gonna be sort of helpful. Like it's gonna, it's it might be that it is forgetting, it might be that it is getting confused, it might be that it is, you know, been given an ambiguous instruction you need to clarify. And thinking is, you know, cognitive metaphors are actually instructive and helpful there. So I think it's a useful metaphor both operationally and kind of uh intuitively for what the darn things are doing. So is that thinking? Yeah, it's like a kind of thinking, probably. It's it's you know, it it makes intellectual progress on intellectual problems that are ill-posed the same way people do. What do you think?

SPEAKER_03

Yeah, I think they're not human. And we shouldn't forget that. And that's I guess where most people are confused, where they believe that these things if they don't behave exactly like a human in every single circumstance, then they cannot be considered as thinking things. But for me what matters is, you know, can you tell the difference? Right. You know, you you have I give you uh a human being or I give you I I make you have a conversation with an AI and after one hour you're not able to tell me the difference, then why does it matter what the definition of AI is, as long as to you you're experiencing the same thing, right? Like it's like in maths where you know there's this definition of equality where if for all property P, you know, those two objects, all the all the all the property that are true for one are also true for the other, and vice versa, then you can say that they're equal, right? So that's that's a line of reasoning that I'm applying here where I I don't want to go and look inside of how the LLM thinks, because if you do that, it looks like nobody really can understand what's going on in there. I mean, unless it's simple things. Um so the best way to approach the problem is go like, well, can I tell the difference? There is one thing where you can tell the difference, and that's when you ask the LLM to think about the world. And so that's something that Jan Jan Lakun that we both know uh pretty well has been talking about a lot and actually went off and built a startup around it uh to build a system, I believe it's called Jeppa.

SPEAKER_00

Correct. Yeah.

SPEAKER_03

And the idea is the missing link for LLMs is that they don't understand the world. And so if you ask them to reason about the world, they you know they're not going to be able to. Because the only thing that the the to them the world is just words, right?

SPEAKER_04

Yeah, yeah.

SPEAKER_03

However, I think it does not make LLMs not interesting until they can understand the world, uh, especially in the world of coding. I mean, you don't need that intelligence to understand the world to be able to code well. Exactly. That's basically what I'm saying. And so that's I think why we see LLMs having a ton of success in the coding world. Because in the world of code, it's all self-contained. It doesn't need to know about the outside world.

SPEAKER_00

Yeah, yeah.

SPEAKER_03

And so the LLM can can go and and has all the data, it's need to conceptualize what the world should be.

SPEAKER_00

Yeah, yeah. There was a uh If that makes sense. Yeah, absolutely. Absolutely. And I can if I can tell like a dumb story about something, you know, though there'll be lots of stories like this along the way of things I tried that didn't work, by the way. So uh a dumb story of like the early days of Facebook AI research, which you know, we I started with Jan about 13 years ago now. Um, because we were part of the generation, I was part of the generation of people that got excited about AI because of what was going on in the ImageNet uh competition back in 2012. There was this model called AlexNet that kind of wiped the floor with all these specialized computer vision things. And it was built by people who were kind of neural net people. They weren't really computer vision people. And you know, it was this fascinating thing, and it was really clear there was an engineering paradigm there. You could like pour more data and pour more parameters at it, and it would just work better. Um, and you could do that on hardware that you bought at Fry's, which we literally did, right? We went and built a gaming rig basically at Fry's and, you know, could sort of show, oh gosh, there's lots of headroom here. You know, I thought that this moment where superhuman coding machines surround us was coming for a long time. But I thought that it uh that putting natural all of natural language processing, which was completely unsolved at the time, right? Like you kind of, if you asked a computer to try and like translate a sentence from Chinese to English, it couldn't do it. If you asked it to try to summarize a paragraph, it couldn't do it, and so on. Um so saying like we're gonna put all of natural language processing in the way of solving coding seemed crazy to me because coding's in this formal land, right? There's this world of little abstract machines we can fully describe. Um and real programming languages were too high entropy, too, right? Real programming languages have all this like all these characters in them and curly braces and blah, blah, blah. So I thought like what we'll do is we'll just make like a dumb little virtual machine. We'll make like a little stack VM, and then uh we'll we'll sort of beam search and do like old-fashioned AI crap to try and make a model that can write little increasingly complicated little programs for this thing. And we'll start off by training on some curriculum of like make me a function that adds two numbers, make me a function that subtracts a number, and uh and so on, right? Um and this doesn't work. Like this just straight up didn't convert, right? Like I kind of there were there were no tools available to us at the time in hindsight that were gonna make this work.

SPEAKER_03

So so so wait, wait, wait, wait, one second. How did the the AEI understood the understand the instructions? So you say make a function, so there must have been some natural language processing in there, right?

SPEAKER_00

Oh no, it had an output space. So instead of sort of having a softmax across 65,000 tokens, it had a softmax across like whatever 10 stack and stack machine instructions. Does that make sense?

SPEAKER_03

I understand, but how did it understand the instructions? When you say make me something that adds one, two, two, how how in what form did that come?

SPEAKER_00

Uh imagine that you have a 10 instruction VM. This thing's output was a softmax over 10. Like this thing's output would be like instruction one, instruction three, instruction seven, instruction eight is return. Who's able to figure out that it should end with return? That kind of thing. This didn't work, right? So so don't worry about this too too much. Um but it seems sort of so much simpler and so much more tractable. But the only problem, of course, is like where do you get the curriculum of more and more realistic and richer and richer programs? Well, you don't have one, right? And it's hard to like go harvest one from the wild. And you know, what would you do? Like compile language down. We ended up solving this problem round-tripping all the way through all of natural language, right? Like round-tripping through like this thing can speak every language in the world, it's read every website in the world, it like has read every program on GitHub, it has, you know, it's great at Forth, it's great at Lisp, it's great at, you know, whatever. Um extremely shocking to me and extremely interesting and extremely crazy. And um to prove that that this is like I guess by by way of admitting like how stupid and wrong I've been about this stuff for so long. Um, when I was at at NIPS 2014, it was still NIPS, it wasn't NERUPS yet, and and Ilya gave his sequence-to-sequence learning talk. Okay, he gave this talk that like got the test of time award at NURIPS 2024. And it was basically LSTMs, right? So kind of like spinffied up R and N's, right? Kind of what we used to do language modeling with. And and this paradigm of quote-unquote sequence to sequence learning, where he was saying, like, hey, if you can frame a problem as here's a bunch of input sequences, here's a bunch of output sequences, here's this recipe for how you can sort of generally solve problems that have that shape. And then the paper itself like basically solved pretty convincingly translating sentences between, I think, like uh, I forget what language pairs were, but I think it was like Chinese and English, which is hard. And it was one of those things that made NLP people pay attention because people have been trying to do that for a long time. And so, you know, his talk is sort of like five slides about this amazing translation system and this model he built that that built it, and then sort of 10 slides that are like, oh, and by the way, this is how IGI is coming, right? That and by the way, this is how we're solving the whole kit and caboodle. You know, are you in or are you out? I was shutting down a bar in Toronto with Ilya later that night, basically, right? And I was so pleased with like this is I still think this is like the cutest little thing I said, which was like, oh, you're just gonna predict the next word, and so you'll be able to think about everything because like the last word in the murder mystery is who did it, and to like say that the last word is you'd have to understand the whole story or whatever. Cool. You know, this this glass of beer, like all the quarks in it, are in conversation with all the quirks in the universe and like gravity and blah, blah, blah. So all we need to do to like make the world's perfect world simulator is like really nail this like glass of beer simulator, and then the rest will take care of itself, right? Like and and I swear to God, Julian, I'm not even like sure what's wrong with that rhetoric. Like, I'm not even sure what's wrong with my reasoning. Like I still think that's a pretty good reason language modeling as a project for AGI is a dumb idea. Except he was right and I was wrong, and here we all are, and we live in his future, and like, you know, like it's just a crazy world.

SPEAKER_03

Well, we we don't really know if he was right or wrong. What what we know is that what the approach they had is working. Now, I think there's there are many things to be said here. True, true. There's nuance, you're right.

SPEAKER_00

You're right.

SPEAKER_03

Yeah. So the first one is, you know, falling into the trap of hey, we have systems that are better at understanding a certain problem, and maybe we shouldn't re-encode everything from scratch or let the AI reinvent everything from scratch because we know better. But this is like a recurring theme in pretty much everything, right? Um, you know, you had people building you know expert systems with a bunch of rules. Yep, yep. And then and and not not just that, like pretty much everything. Like in every single game that you know had to be beaten, they they all started with a rule-based engine, and then eventually they taught the machine, the AI, how to how the game was working and let the AI rebuild all the rules, and then this stuff was always better than anything that came before that was rule-based. So I think there's an intuition there that makes sense, and everybody fell into this trap, right? Like uh the trap of, well, we have systems who know how to parse file. Surely this can help, you know, a computer. We don't have to relearn from scratch how to pass, right? And counterintuitively, actually, you don't. You probably want the AI to have the full pipeline because if it does, it has a full understanding of what's going on and it understands things in ways that are different from us. And so oftentimes letting the AI do its thing is not uh is is is actually what you want. The second thing I would add is your glass of beer analogy is good, but I would say you might not be wrong. Maybe what happened if I I went to keep on with this analogy is that they got the glass of beer right. They got it so right that everything in the room looks okay. Yeah, yeah. So when you look around, you know, the physics still works, etc. etc.

SPEAKER_00

Yeah, the reflections of the reflections, you know.

SPEAKER_03

And if you start looking really closely, you're like, hmm. In fact, it doesn't have to reinvent all the physics, it just has to give you a convincing image of what the bar you're in looks like, right? And so you look around and the physics seem to match. And if you look really closer and start asking really, you know, quantum physics questions, things are off, but it doesn't matter, right? If you are with your friends having a beer, you're not going to see the difference. Maybe LMs could be that, right? They could be something that give the illusion of understanding the world enough that it's good enough for certain things, basically. And it doesn't mean that it truly understood the nature of things that we are working with, and in fact, pretty sure it didn't, because you know, you you you you can run those experiments. Whenever you ask something about the physical world that is clearly nonsensical, the AI doesn't pick it up, right? It doesn't see so you know, Jan, I think, had examples where you just describe where you're going. But I think the most common one, which I tried two weeks ago again, was uh should I go wash my car on foot? Right, right. You know, and the AI is gonna say, oh yeah, of course you should.

SPEAKER_01

You know record it after help.

SPEAKER_00

Yeah, saves the environment.

SPEAKER_01

Exactly. And for somebody you know has 114 in IQ, you're like, um, that's a bit disappointing.

SPEAKER_03

You wouldn't pick up the fact that this doesn't make sense. So to to to summarize, I would say two things. One, the first one is this instinct that to go back to symbolic computation, uh, because we have computers manipulating symbols really well, is a natural instinct. And I think the jury is still out if it's really never going to be useful. So, for example, in maths, I wouldn't be surprised if you know you had, in fact, I'm I'm I know people working this way already today, where the AI is trained to produce theorems, but then we have, you know, um uh formal verification engines that verify the proofs, and if they are correct, then they're validated, right? So I think a mix of uh AI, rule-based, symbol-based uh thing is probably going to be the most powerful in terms of reasoning. Um and so that's the first thing. Second thing is reasoning about the amount of data that is fed into an LLM is very difficult for a human being. It's it's a little bit like you know, when people talk about different uh timelines, like like when they talk about the dinosaurs and they say, like, you know, 65 million years ago. And given how our relationship to time, I don't think anybody has a good idea of what that is, right? Like we have a we we know it's a number and we can ration we we can think about it rationally, but you know, when it goes beyond a hundred, a thousand, ten thousand, a hundred thousand already is pretty hard to, you know, but millions of years, it's just an amount of data that just we're not used to deal with. And so that's part of the problem with NM, right? Like the scale makes it such that it's very difficult for us to reason about what's going on.

SPEAKER_00

By the way, like I did the Fermi exercise on this yesterday, it so happened. So like I have this like at my fingertips right now. So most human beings by their 18th birthday, it kind of depends how you estimate or whatever. But humans in the first 18 years of their life probably hear about 200 million words of speech. About 200 million, right? Which is a big number, right? Like you're saying, that that number seems large. So uh it's a little bit closely guarded secret, but like I think Llama 4 on the on its model card has said how much training data was used for it. And it is 20 trillion tokens. So that's another that's another big number. You're like, oh yeah, yeah, this big number, that big number. But we're talking about a 10,000 times larger number in the case of Llama 4, right? So that is five orders of magnitude, 10,000 human lives, right, 10,000 uh birth to adulthood lives anyway, um, more data and more kind of sample inefficiency that the LLMs have going for them. So, you know, encased in our skulls, there is some machine that does lots of other amazing things, and also is is 10,000 times more data efficient at learning language than than these contraptions we've put together. So to me, that suggests that there's like a difference in kind, not just a difference in you know, abs in speeds and feeds, right? That this isn't sort of just a slightly more sped up version of whatever it is all L's are doing. Um so I'm very sympathetic to this person.

SPEAKER_03

I mean we should probably We should probably not that's why we should probably not try to copy humans and do, you know, have different kinds of AIs doing different kind of things. And you know, exactly like you don't necessarily copy nature to build better tools, you know. Nobody tried to build a car with with legs, thinking, hey, horses have legs, so you know, surely that's a better way of doing things. I think you know, there will be different kinds of AIs doing different kinds of things, and that's okay. So, first of all, I'm not sure I understand fully what AGI is, to be honest. Like, is it something that's better than every human being on Earth at everything? Is that is that what AGI is?

SPEAKER_00

If if this comes back to kind of intelligence not being a technical term, right? I think I think we we probably know it when we see it. I think there are practical senses in which like the right setup around um you know Opus 4.7 is a pretty general, pretty intelligent system, but there's there's all these that there are all these caveats around that. And it doesn't know what it doesn't know, and it's an amnesiac, and it you know, can't learn without an external memory system and so on.

SPEAKER_03

So, what do you think this implies in terms of programming, programming languages, etc.? I'm I'm I'm curious to hear your perspective. What does programming look like in five years or ten years?

SPEAKER_00

What a great question. I think like this is actually something um I'm curious about your perspective on as a language designer, too. My view of kind of the craft of a language designer until right now has been that you're bridging this world of sort of formal languages where there's math and there's type theory and there's you know what can and you know formal grammars and whatnot to some sort of essentially applied psychology, right? Something that's essentially like, okay, this is ergonomic for human programmers and groups of human programmers to use to make you know coherent, maintainable, decent systems with good properties, right? And I think there's uh there are a couple of like interesting examples, I think of Haskell as one of them, honestly, and also Rust of like to me, people proving that that this notation kind of matters, right? That this notation can kind of channel your energies in ways where it's like actually much more likely to produce something uh safe or correct versus sort of a completely, you know, the first folk notation you might come up with, like basic or f or something, right? What I'm wondering is like in a world where like I'm I'm gonna admit, and this is partly because I'm I'm mostly building internal tools for my VC team, right? Like I'm but I guess you know, we're we've got projects that are public and other people use them and stuff. So I think of it as real coding now. And in many cases, I'm not really um pouring over every character that the model is pumping out. It's good enough for me that the test coverage is there, it's good enough for me that it demonstrably works end to end. Uh and so in that world, like who is the The audience for the programming language designer? Is the audience these new kind of alien minds that we're talking about that are very different from our brains? And if so, are there way better languages to make for them than the ones that they were trained on? Or you know, are we in like a are we in a equilibrium here where just because Python and JavaScript are what these things were trained on, it's going to be Python and JavaScript until the end of time. I'm curious what your thoughts are.

SPEAKER_03

We still live in a world where most of the code is reviewed by humans. And so that psychology that you're talking about is still there, uh, but not as omnipresent as it was before, right? So before the human was really the center of the programming activity, if the design, implementation, review, every all the all the programming, not just the coding, is done by an LLM. And human beings never have to intervene, never have to understand what's going on in there, then I think it will would be the right time to go revisit how these things work. And even if we knew it was the right time to do that, it would probably take time because of the sheer amount of data that's out there to train LMs to program in TypeScript and Python. And so I think for the foreseeable future, AIs are gonna be better at TypeScript and Python. And the first ecosystems that will um program, you know, using AI. And when I say program, just I don't just mean coding, I mean all the lifecycle, are probably going to be centered around. If I would have to guess, I would say TypeScript. But the Python people are gonna hate me for it, probably. Um, and my line of reasoning here is that Python is too slow. I would say if I compare the two, JavaScript is just winning for two main reasons. The first one is that there is TypeScript, so there is a whole ecosystem with types. Those types are not perfect, they're incomplete, but we'll be able to do much more tooling around that. And the second one is the JIT. Given that JavaScript is is jitted, that means you can iterate a lot faster and the code is going to be just much, much faster than Python. So I think the first ecosystems won't be able to break free from human psychology, if I can put it this way, just because there will be so much training data and still humans in the loop, and they won't want a system where they cannot understand what's going on inside, right? And that's that's going to be for a while. And then maybe one day, you know, let's say for a few years, AIs have been very good at maintaining systems all the way through, and no human being ever has to, you know, go in there and figure out what's going on, then probably will be time to um yeah, go look at how to design languages that are better suited for AIs. Uh and again, I don't know if a human would be the right person to do this design. I think probably AI. You know, an alpha. Yeah, probably an alpha setup.

SPEAKER_00

Like Alpha Go, right?

SPEAKER_03

Like you teach the AI what programming is and then you let it figure out how to rebuild all its own pieces. That would probably be more efficient. But you can also imagine a world.

SPEAKER_00

We can also imagine a world where like there's still Python or JavaScript available as like a skin on whatever it's actually thinking or something. Or like it could be a presentation layer for a human code reviewer. Whereas like what it's actually thinking is, you know, squiggle, squeak, Unicode, code point, you know, some some other spaced out thing that like we're not well posed to understand. But um Yeah, it's true.

SPEAKER_03

That's a bit like hardware works, right? Like it exposes an instruction set that really has nothing to do with the real instructions that they're running. So you could think of something like that, right? Where they show us TypeScript, but the way they reason about it has nothing to do with TypeScript, because under the hood it has been translated into something else. And in a way, it's already what's going on, right? Like you could say that the inner weights are pretty much that, right? Like the input um gives us something, but then those are translated into vectors and and then all that stuff, you know, isn't interpreted in in a completely different way. So one could argue that this is already what's going on. So there's a psychology. What I'm more interested in is what kind of properties those AIs would be interested in in their programming languages. Um, you know, you you tend to focus because the stuff that helps us reason, I don't know if it if it will always help AI reason. Uh and the stuff the so there's there's the syntax and how things look and how we can reason about those things. And you know, it's pretty clear that that is very important because if you have the right notations, you can express problems in a way that are uh much easier to understand. But then there's also the static properties of the language, right? Like what kind of properties would an AI want? Because, for example, Rust has a much more static properties than TypeScript, and yet it the most AIs don't seem to be as comfortable with Rust as they are with TypeScript.

SPEAKER_00

Right. That's my experience at least. And I think that's probably down to the data, right? That's probably down to Rust sort of being kind of minority language still.

SPEAKER_03

But we don't know that, right? We don't know that we don't know that if it's Could you do an experiment?

SPEAKER_00

Like like what would the experiment that sorts this out be?

SPEAKER_03

Would it be like It's expensive experiments, right? You'd have to train, I guess, the same LLM on um the same corpus, one corpus in TypeScript, the other corpus in Rust, but same same amount of code, and then compare its um coding abilities in one or the other. But I guess it's too expensive of an experiment for anybody to learn it seriously.

SPEAKER_00

We can say a few first principles things here too, by the way, Julian, that I think like are a little underappreciated, even by like I I think of myself as pretty, you know, coding LLM pilled or whatever. Like I fully accept these as like a automation factor that's never going away. Um it's just too much of a productivity tailwind. Like it's definitely part of the world. But there's stuff that is part of normal programming that I don't think they do super duper well on yet. And and one of those is actually just working on big programs. So even the kind of big context windows that you get now, like a million context window, a million token context window is enough to put you know several full-length books into, for instance. It's not enough for big code bases to fit into, right? So, you know, we when you were working with Hack, you were supporting a code base that was on the order of 100 million lines of code. Uh those are lines, not tokens, right? So, you know, let's say a billion tokens, and it's not within sort of foreseeable technology right now that a billion token context window is going to be it's going to be tractable. So there's going to be some kind of memory system that supports reasoning about the whole program, or or or you're going to need to do the thing that humans do, which is try to make the program locally coherent, right? Try to make it so that, okay, I only need to know like this thing and the things that it touches to be able to make changes that are very likely to be correct. So I mean, I think there are that's one of the properties that I don't think the LLM psychology being different is going to massively change. Uh I think it's still gonna want things to be locally you know contained and coherent.

SPEAKER_03

Yeah, that's true. But to keep things locally maintained and coherent, what what does an AI need that is different from a human, right? Like so we we can go back to these eternal debates of you know dynamic programming versus static uh not dynamic programming, sorry, dynamically typed languages versus statically typed languages, right? You'll have the dynamic folk, dynamic pro um uh programming language folks who will tell you, I don't need a type system, I have my unit tests, and you know, I'm going to have to write tests anyway, so why do I need this thing that wastes my time, right? And then you say, okay, but maintaining unit tests is actually more expensive and slower than having a type checker telling you the stuff that you have to fix. And unit tests are not perfect, so they could miss some things that you know the type checker is not going to miss, right? Um, but the same people are going to tell you, yeah, but it's more flexible for other things. My dynamic I like my dynamic language because I can cheat whenever I want to. And you know, different people find different balances. Some people just prefer that flexibility, and some others don't. And for human beings, it I think it it it comes down to personal experience, personality, the kind of code you work on. Exactly.

SPEAKER_00

Yeah, whether you're working on something that outlives you or something that's going to be thrown away next week.

SPEAKER_03

Exactly. So I'm curious of how the AIs are going to, you know, um what kind of what kind of boundaries they'll need in a in a in order to split up the problem into smaller pieces. My experience so far, because I've worked on this for what, a year and a half now, something like that, is that you want basically to split up the program into pieces of states that are independent. So my experience is that the big problem is state management, right? So the the problem with state management is that you need to maintain things in your head, right? Like so, this is another another thing I say often about programming and the difference between programmers and non-programmers. Um the difference between a programmer and a non-programmer, so if I take a non-programmer, I can get them to program things by putting together boxes and arrows, right? If I give them one of those languages that are data flow languages, and I have values that are plugged in into a plus and they understand that it's going in, and etc. etc. Anybody will be able to do that, whether they're a programmer or not. So they are doing a form of programming, but they're not. So does that make them real programmers? I don't think so. I think the test for are you a real programmer is I write you a loop, and that loop is updating things. It's like updating variables and whatnot. Can you play that in your head? Right? That that is what a programmer does, right? A programmer kinds is kind of able to simulate what the memory is doing in their head. And it turns out that this is what's hard about programming. And it's so hard, in fact, that we only want to do it at a very local level. And then we want the overall structure to not work this way at all. I don't want this particular function to go modify some global state at the other at the end of the end of the program because then I cannot reason about it anymore. And so my experience with LLMs producing code is the following is if you want good results, you want to split the state into independent pieces and then plug them together as a completely declarative graph of computation. And if you do that, that seems to work really well.

SPEAKER_00

Is that so so different from what you'd how you'd try to decompose something for a team of human programmers?

SPEAKER_03

I think so. I think a team of human programmers would be able to, you know, human programmers do a lot of things when you think about it. Like, you know, just think about a basic service, right? Like so you have um some request that comes in, right? And already just a moment where you listen and you accept the request. Well, there are a million races in there, right? Like you accept this thing, you don't really know in what state you are. Maybe, you know, multiple requests happen at the same time. You're so already your mind is dealing with a ton of statefold stuff and a ton of concurrency issues and a ton. And we are so used to deal with those that of course we know what the right constructions and what what the right trade-offs are, but that's the one thing where we are simulating it in our head, and we do that relatively well. And I suspect LLMs don't. Now, I might be wrong, but I suspect that the reason why. So, my experience with an LLM, when you don't guardrail them enough, you don't force them into a structure where it has to think about things locally. Basically, let's put it this way. You see React, how React works. React basically is a way is a way of programming where you say, I'm going to show my intent, and I'm never going to worry about updates. Right? I'm just going to say, if something has to change, then it will be another system, a framework that will take care of it. The only thing that I have to worry about is what shall I see on screen when I'm in this state? Right. And so this has drastically simplified the writing of UIs. Because what was hard about UIs was, well, whenever something changes, you have to go and update the state that you were in by hand, usually in the middle of some callback soup. And of course you're going to get it wrong. And so when you didn't update properly something you were supposed to, you're now in this weird state that nobody understands, right? And so think of that kind of mental model, but generalize to programs in general, where it's not just for UIs. It's in general, you just want the LLM to be given a good goal, like here are the inputs, here are the outputs, don't worry about the updates, and then split it into smaller pieces, and then let the uh uh another framework take care of the updates.

SPEAKER_00

And I've known you long enough to sort of skip, as it were, ahead to some of the uh some of the punchlines here, right? But what we're describing is kind of like a spreadsheet or something too, right? You're describing as one of the reasons sort of people that don't think of themselves as programmers have success building pretty complicated, really, applications, I would say, with spreadsheets is because it doesn't present itself as programming, it presents itself as like, well, this thing is nothing but those values, and that thing is nothing but these values. And then there's this very thick runtime that sort of figures out exactly what needs to be recomputed when something changes. Um and so uh you're saying that that you think they I mean out of curiosity, is that do you have like um do you have any sort of case studies of this or anything? Like have you organically tried to get Opus or something to build something with just a big prompt versus building it with with this kind of um reactive scaffolding around it?

SPEAKER_03

Yeah, we have a bunch of benchmarks. Um but the benchmark, so what we did is we have a bunch of scenarios where we try to build it with um Cloudcard uh 4.6. I don't think we've tried 4.7 yet. Uh codex and whatever, there's there's a bunch of different systems we are benchmarking against. Um the thing is right now our numbers are very good, but uh most of the time it's because Cloud Card definitely gave up too soon. So it's difficult to make a good comparison because if you just tell Claude, hey, build me this, basically it's going to build a crappy version that doesn't work at all, and it will tell you, oh, it's great, I'm done. You know? And so now you look great and you're like, oh, my system is is so much better. But it's true that by tweaking the harness, you could probably get better results. Basically, what you're doing, you're abstracting away everything that's hard, right? In a program, if I have inputs and I ask you to build outputs, especially for an LLM, when you think about it, because the LLM is going to think linearly like that, where I have my inputs, I have thought of intermediate states, um intermediate steps, sorry. So I'm going to lay them, lay them down, and then and then I just keep on going until I reach the output. This is this is perfect for an LLM. That's that's everything an LLM loves, right? And then you abstracted away concurrency, um, all the kind updates, so it doesn't have to think about state ever, uh, cache invalidation. Um you abstracted away everything that was hard, right? And so of course, if you compare it to a system, even even a smarter system that has to do all these things from scratch, well, yeah, good luck with that, right? So we see really good numbers. Like we've we've we've had, you know, we've we tried it on hundreds of examples, and so far the the this approach is is uh very promising. So excited about releasing it too.

SPEAKER_00

Well, I yeah, I I if you want to, you know, talk some more about it, go for it. If it's sort of still secret and you don't want to steal the thunder, you can also leave this as a teaser.

SPEAKER_03

No, no, it's not secret at all. It will come out very soon, so there's there's no secret worry about it. And uh you're gonna love it.

SPEAKER_00

Um if I'm if I'm picking up what you're putting down correctly, you're hinting at a combined kind of like harness and framework that the harness targets, where that combined system makes it a lot easier to get good results.

SPEAKER_03

And a type system. And that's the key too. It's a difficult balance. You need to find the subset of TypeScript that the agent is comfortable with, but at the same time, you want a subset that has good properties for you. Uh and so finding that balance can be tricky because you you will have a tendency to be too strict, but if you're too strict, then the LLM won't find a way to make it work, except if you give the right error messages. So, you know, it's a bit like you know, cooking. You're putting a little bit of salt in a little bit of this and you're mixing and so are error messages different?

SPEAKER_00

Like I remember I remember error messages were a key thing for acceptance. Like do LLMs want different errors?

SPEAKER_03

Yeah, the the kind of errors you give to an AI have to be different to those that you give to a new human. And sometimes what the AI trips over is really weird. So for example, there was this JavaScript class that has some kind of inner method and or inner capability. I don't remember what we were encoding exactly. And I figured, oh, you know what would be easy to encode that with a method. So I'm going to add this fake method that does the thing that I want, and I'll be able to, you know, put the declaration for it, and I'm just gonna tell the AI to ignore it. And I'm gonna say, I'm gonna put a comment there that's that's saying, don't use this, this is all internal. And the AI just kept on using it. So then I changed the name of this method, and I was thinking of Erling. I don't know if you remember Erling. He had uh he had inner things within Facebook that were were called do not call me a Ubifly, you know, and so I I used something like that.

SPEAKER_01

I was like, do not use this under any circumstances, right? Don't never call it.

SPEAKER_00

Or human lives will be lost, death is forever. Yeah.

SPEAKER_01

Yeah, and it's still dead, it's still called the thing. And it's like but you it's you have a comment that says, never use this, you have this thing that's it's cold, never call me, and you're still calling it. Like, can you not see, you know?

SPEAKER_03

Uh so yeah, we just remove this way of doing thing because this was not working. Another example was you know, human beings don't like noise. So usually when you craft an error message for them, if it starts to be too long, um, they basically give up and it's just noise to them, and you're and and it's not a good error message if there's too much information. That is not true for an AI. Um, most AIs actually are happy with more information, they'll be able to quickly determine which part they're actually interested in. So you'll consume more tokens, so you've got to be careful with that. But other than that, longer message error messages with more context are good. Um, and I would also say that for some reason the AI is not very comfortable with variants. So I don't know if it's because types typically doesn't have variants.

SPEAKER_00

Yeah, like let's define variant here.

SPEAKER_03

So imagine I have a collection of A's and I have a collection of Bs, and I have A that's subtype of B, right? And I want to know if collection of B is subtype of collection of A, right? If I know that B is subtype of A. Well, that's only going to be true if The collection is covariant, meaning if it can support the thing that it holds to become narrower and narrower. I mean wider and wider, depending on which direction you look, right? And contravariance is the exact opposite.

SPEAKER_00

Covariance and contravariance is like the monad of generic programming or whatever. Like everybody First of all, like it's a filter for whether you're in the fraternity or not. And secondly, like everybody has their own way of explaining it. So like is it like a burrito? Is it like a the spacesuit on an astronaut?

SPEAKER_03

It's like a burrito, exactly. So I I'll go with a classic way of explaining it, and then I'll give you my way of explaining it. So the classic way of explaining it is um whenever you have an object, you have the things that are coming in the object, and you have the things that are coming out of the object. And the idea is whenever something is coming in inside the object, it's always okay to give anything that's coming in something um that is uh more precise.

SPEAKER_00

If you have something that wants a vector of shapes, it's okay to give it a vector of squares.

SPEAKER_03

Exactly. And for stuff that's coming out, uh it's always okay to consider it as less precise. So the the thing is producing a vector of uh of squares, and you're like, yeah, but I I want to consider it as a vector of shapes, right? That's fine. And so that's one way of looking at it, right? What's coming in is contravariant, what's going out is covariant. In fact, I believe in C sharp that's actually how they annotate the generics with in and out, if you say if they are core contravariant, but usually plus is the sign for covariant and minus for contravariant, but not just notations. The way I like to explain it is the following way. You uh imagine you have a function or a method, it doesn't really matter. Um you have a function and it has annotations on the parameters, and it also has annotations on its return type, right? And I claim that those two annotations are of very, very, very different in nature. So if I look at those two annotations, they're both annotations, but they mean very different things. So imagine I have a function that has an argument and that takes an int, a number as an argument, and produces a number as an output, those two numbers mean different things. When it's the number for the argument, what I'm saying is, you better give me a number. You better give me a number, because if you don't give me a number, uh this function will just do horrible things, crash, and and we don't know what's what's gonna happen, right? And for the output, it's different. For the output, it's the function telling you, yeah, yeah, this is a number, you can trust me. You know, this trust me, I I produced a number, right? And so if I were to describe it as uh as best as I could, I would say the parameter is a constraint. You're telling the rest of the world, I want this to be a number, and the return type is a fact. You're telling the rest of the world this is a number, trust me. Right? So, and if you think about it in these terms, you never get variance wrong because then you think, okay, this is a constraint. Since it's a constraint, can I give it something that is even more constrained? Yeah, of course you can. You can constrain more and just, you know, that's fine, it will still work as long as the constraints that the function was asking for are there. So if you are a number and something else, like you are an even number, that works. You're a number, great, you know. Um and the other way around, when it's a fact, somebody tells you, I actually gave you this object B, and I'm telling you it's a B, but it turns out B is also A. You're like, well, let's forget about the fact that it's a B. I just want I just need it to be an A. And if you if you think about it in these terms, I feel like you never forget, you know, you can always retrieve what co-control iron means.

SPEAKER_00

And so And for listening at home, by the way, like I just want to say, like, I I personally like need to go look up which one is code and which is contra every time I'm trying to have this discussion. Like uh the world makes sense, it turns out. There is this sort of strange thing about containment and subtypes. Um, but like what you want to have true is is usually what your type system in a language that makes any sense and that people actually use also demands. Um and I'll I'm curious, you know, the people make mistakes with this too, I guess, is one of the things that that you've observed, right? It's one of the reasons we end up talking about this stuff is because people make these mistakes where they, you know, promise you a vector of squares but actually returning a vector of shapes and they kind of scratch their heads about it, right? Um are you fine that like you're saying LLMs have the have problems too? Is it like the same kinds of problems that that humans have, or is it like a weird kind of alien set of problems with it?

SPEAKER_03

I I don't know how much of it is because the LLM has been trained in TypeScript, and TypeScript does not enforce any variance whatsoever. So in TypeScript, they have this unsound by design approach where they consider everything to be bivariant. So as long as there's a typing relationship between objects, um, they will accept it. So for example, you have a method, you want to redefine that method with something that should not be accepted, because let's say, you know, the the return type is less precise or something like that. Uh TypeScript just lets lets it flow. It just just says, okay, yeah, and just run the thing. And so I don't know because I haven't in I haven't tested enough uh an LLM against you know a language like Java to be able to be able to answer that question is just that LLMs are struggling with variants in general, or are they struggling just with TypeScript? So this I don't have the answer, but I do know that they're struggling with TypeScript. And um Yeah, the way we fixed it was with a lot more hand holding, right? Like with a lot more so a typical case in TypeScript is that you expect a lot of covariance naturally when in fact it's completely incorrect. So for example, you are writing a program in TypeScript, and let's say you said that you return, you know, uh uh an object that has some amount of fields, right? And then you want to pass something that has more fields uh or has fields that are more precise or something like that. And uh you you return it and you expect that there's a subtyping relationship there, right? Except that this subtyping relationship only ex exists if the fields are read-only. Because if the fields are writable, then that relationship does not exist, right? I could take my object, then hold on to a reference of the more precise field, then store that somewhere, then upcast it, and then you know, bad things will happen. Um now this is very subtle, and that's something where the AI needed help. Whereas like typically the error message says, if you made this field read only, the problem would go away. You know, and things things of the sort. So that is not too different from what you would see with a human being. Uh and and I'll fully admit that we're still at the very beginning of exploring all of that, but it it is fascinating to see, you know, how amazingly good AIs can be, and then how amazingly stupid they sometimes are.

SPEAKER_01

Like, I've seen them like, you know, this this story with a method. Like, this method is called never call this method. It's a mistake, and you're still calling it. Like, how else do you want me to put this?

SPEAKER_03

So yeah, so far it has been uh a lot of things.

SPEAKER_00

The thing with like uh with the identifier names, like uh it just sort of treating them as little tokens and moving them around or whatever, it's actually an interesting thing. Like, I think one of the reasons one of the reasons uh LLMs kind of had to go all the way through natural language to get kind of good at programming and practice is like the names of identifiers is actually a lot of how we navigate and and a lot of how we handle things as humans. And I remember like when um when Copilot was relatively new, when Copilot was young, I was playing around with uh code base that that was like a unfinished implementation of a database. It was like a weird database implementation strategy. And I wanted to uh and it didn't have joins yet. And so I wanted to like you know make an outer join node or whatever. It's kind of like, you know, class outer join inherits from like expression node or whatever exhibits from and um you know, left curly brace. And I was kind of surprised that Copilot was like happy to kind of fill in the body of that because it was like, oh, it's an outer join. Okay, cool. I understand how this database organizes things. Let me start, let me do the nested loop or whatever, and I'll just do the naive, you know, explode the two things, join. Um so was it the quadratic version or like use a strategy or anything? Okay. But it was still like really impressive to me. Do you know what I mean? I was still sort of basically expecting autocomplete, right? I was still kind of expecting something that was like, okay, I can do an expression at a time or a statement at a time or something. And this is one of the first sort of full method bodies I'd seen it just plop out. And it definitely was like the only clue it had was the name. Like the only clue it had was like that this was my intent. Um but it's interesting that it's like not able to self-review the, you know, don't call me, oh, you'll be fired, people will die, death is forever, I'm serious, you know, like the one of the odd things about the the implementation of an LLM is that the actual neural LLM like naturally outputs a distribution, right? It doesn't tell you which thing to select. You actually have to make some sort of sampling strategy. And if you um if you roll one of these things by hand, like just build it all by yourself, and then just do greedy sampling, right? Like say you you always pick the most common token, for instance. That doesn't work, it turns out. Like you just get you get these crappy things that say things like, I don't know, I don't know, I don't know, I don't know, I don't know, for instance. And so the strategy for sampling of these things ends up having all these heuristics inside of it. Um, like one of the ones that works a little bit better is like actually good old-fashioned beam search, right? Beam search from you know, a heuristic search algorithm from the the 70s or whenever the good old-fashioned AI people came up with it, where you've got you know some set of candidates and you've got some scoring function for which candidates are plausible, and you kind of move each of them forward one step and say, okay, this beam looks better, and you cleave this one and expand this one or whatever. Um I remember in the early days of LLMs, like you'd have to uh Gemini did this in particular, like the early versions of Gemini. If you asked it to just output JSON, it would be like, Okie doke, here's some JSON. You know, and you're like, and you could be like, no, seriously, I need just the JSON. I don't want to strip leading text. You need to start with curly brace. JSON, please. It'd be like, I understand. You just want the JSON. Here it is. And the curly brace. And and and that strategy of like actually threatening that sort of like cities will burn and lives are on the line, and I'm not kidding here, that would finally get it to like just be like left curly brace. But uh anyway, the the thing about that sampling strategy is like if you're building a formal language, like let's say you really want JSON output, right? We got grammars for JSON. One of the things a grammar can tell you is like which characters are possible here. Right? It's a you know, it doesn't even make sense to have a left curly brace here because we haven't opened a curly brace yet. Um as far as I know, I don't know of a lot of people that have tried to build a hybrid system like that where they're like, hey.

SPEAKER_03

No, no, they are. They are there are even some people who are building type checking on the fly. So as the tokens are coming out.

SPEAKER_00

So you have a little whole incremental semantic system or whatever that's doing the whole formal language thing. Uh who's doing that out of curiosity? Do you know? Is it just you guys?

SPEAKER_03

Uh I will no, that's not us. Uh, but uh I can find the paper for you. Um I'm really bad with names. Um, but it it's interesting with a threat, because I I know a friend of mine who's doing that with um chat GPT. So she's not using it for coding whatsoever. She uses it to do her daily stuff. And she tells me whenever ChatGPT is not doing what I ask or is being sloppy about something, I say, be careful because I'll go to Claude.

SPEAKER_01

So she's just threatening the AI. Like, I'm going to competition if you don't take this seriously. And I'm like, is this working? I'm like, I don't know, but I try.

SPEAKER_00

Well, and that's that's maybe there's an interesting place for a thing we haven't touched on at all here, too, about like the the economics of building these things, which you know, they're very expensive artifacts to build. And sitting here talking to you in you know April 2026, there are basically two coding models that are clearly at the frontier and they're they're codecs at cloud, right? And uh you know, Opus 4.7 specifically and GPT-5.4 specifically.

SPEAKER_03

Did you try the la the latest model from Meta? I I haven't tried it yet.

SPEAKER_00

My impression is that it's is that Musepark is like good and like a cut above Lama 4 and stuff, but that it's not you know, correct me if I'm wrong, but I think it's like I think it's like in that open source model kind of like category, which is increasingly impressive to be clear.

SPEAKER_03

It's increasing but yeah, it's really good. Like uh uh Gwen, Gwen 3.5 is good.

SPEAKER_00

The all of those coding models, like all the models that are sort of vaguely competitive at coding, aren't just playing predict the next token anymore, right? They're they're all RL's, right? They all have have also after being pre-trained on a bunch of predict the next token, um, they are the winners of some complicated tournament that your model builder ran, and that tournament was you know scored on do these coding problems well, right? Um and and there's interesting tricks for how they augment the set of problems and how they, you know, there's a there's a kind of universe of not very public information out there.

SPEAKER_03

Aaron Powell Which is still how a bunch of stuff works, right? Like a chess engine pretty much still works this way. Yeah.

SPEAKER_00

Well, and and by the way, leading for what it's worth, like the bleeding edge chess things are all are all like this now too. They all learn through self-play, they're neural, they um they're all they don't they don't sort of it's not tree search running a heuristic after seven flies deep or whatever anymore. Um I mention this because I perceive there to be a big gap between what I trust Epis47 with encoding and what I trust even say, you know, the the coin models with. Um and there's a possibility that there's a data flywheel running that keeps that gap big because you know I I and lots of other corporations are sitting there sort of trusting the best, scariest work to these models and then giving them feedback, right? Then sort of telling them, no, not like that. Uh you need to do this other thing. And when if it's not me giving them feedback, it's the compiler giving them feedback or the type checker giving them feedback. Um and one of the things I'm wondering is whether there's a kind of escape velocity that's been reached there, where you know, once you have one of the best coding models, you also start to get a data moat that makes it that makes your model better and better. Um and this isn't even like a self-improvement loop argument, or this isn't even sort of like a sci-fi scenario where they paperclip us all and you know start building nuclear reactors. This is just sort of regular old coding benchmarks. And the question is like, is there an actual monopoly here? Or or seemingly duopoly so far, but like is it does it become almost impossible for a new entrant to enter if you don't have a million people prompting their agents to to make your model generate better code?

SPEAKER_03

I don't know. I I I s fear the the other way around. I fear that the um open the open source models are very close. And I mean it's true that Mythos seems to be doing really well on um detecting, you know, uh vulnerabilities, but that's that's a very different class of problems. And I'm not very surprised that it does an and LLM would do well at that. So I was on the security team at Facebook for two years, but although I wasn't a security expert, I was building analysis to find security holes. But I was in a team with a bunch of security experts and I would see how they would operate and the way they operate is they have a lot of patterns in their minds um and of attacks, of known attacks, and they read a lot of code and usually the stuff that they're looking for is a variant of an attack that they already know. Right. Now they can, if they sit down and find something interesting, try to craft a new attack, a completely new kind of attack. This happens, but this is relatively rare. What is more common is this is a variant of this known attack, etc. etc. Like in chess, right? Like you have variants of things that you already know, right? And so an LLM is is at an incredible advantage there. It has an exhaustive knowledge of all the attacks that were ever done ever, and is very good at recognizing patterns, and it can read a ton of code much, much, much faster than any any human being or other human being would. So it seems to me that an LLM is perfectly positioned to, you know, completely obliterate a human being at finding security holes. So I'm not surprised that it's doing how it's doing. However, if you look at the evolution in terms of, you know, coding skills, it's getting better, but given the amount of hardware that was thrown at the problem, it seems that we we we we seem to be getting a diminishing return, right? Like where you throw 10x the amount of hardware, although I don't know what the exact number is. I don't work for those companies. And but derived from the price of the token.

SPEAKER_00

Yeah, yeah. H100s cost a certain amount per hour and so on. Yeah, yeah.

SPEAKER_03

Exactly. Um it seems that they're getting marginally better on the coding tasks. And so my worry is that they're going to hit a wall, and when they hit that, if they hit that wall because I don't have a crystal ball, and I tend to be wrong on these AI stuff, so don't listen. So yeah, you should have started with don't listen to me. I'm always wrong at predicting these kind of things. So my opinion does doesn't matter. But I'd say one risk is they're gonna hit a wall, and then the open source models are really not far behind, right? If it I think they're what, six months behind now, we have which which one is it?

SPEAKER_00

Oh, uh Kemi. Kemi. And they and they are an example of a company that does have a bunch of usage data too, right? So Cursor you know has people banging on the keyboards in their product, and and they are one of the few people, it's not a model provider. Yeah. As providing an IDE or an agent is a different way to get these traces if you are not.

SPEAKER_03

So I'm less worried about having Anthropic or Codex or Gemini. Gemini is actually pretty good. We should have three series models because Gemini is is a series one too. I'm less worried about having one of these threes, these three escape with so much data that nobody can catch them, as I am more of um they hit a wall, and then the open source models catch up. That's why I think they're trying to build all sorts of things based on on Anthropic on Cloud. They're going after everything, right? They're building uh security analyzer, they're building, I think, a competitor for Lovable, they're building and so the idea is try to get a captive audience that uses their model, right? Because if they stay with a model that they have today, their audience are captive. And the reason is cloud code is not intrusive enough, which is a good thing, and this is also why it was so popular, right? Like you have a code base, you use cloud code, it modifies your code base. It's great. Everybody wants to adopt that because it's it's not intrusive, right? But the downside is if somebody builds anything that's better than cloud code, you're just gonna switch to it. You're not captive at all. You're just gonna go, all right, well, done with cloud code, thank you very much. And so I think their strategy is to build all sorts of products that use cloud, you know, behind the scenes to try to have a captive audience. That makes sense. Yeah, I'm more worried about that. And then I'm worried about what does it mean, you know, in terms of for the market, right? Like if if those big AI companies, so namely OpenAI or Anthropic, hit a wall and they hit a wall, meaning they can throw as much money and as much hardware as they want at the problem, and they don't seem to be able to build Better models or significantly better models, what happens to their business model? Right. I think Anthropic is probably in better shape, I believe, than OpenAI, because OpenAI is is burning a lot more cash. And then the open source models are right, you know, right behind them. So it will be it will be interesting. But maybe I'm off. Maybe they will just keep on getting better and better, and a new breakthrough breakthrough will come.

SPEAKER_00

Yeah, it's an interesting empirical question. I think like one of the things that uh I've always been blown away by just in the computer industry in general, in the whole 25 years I've been in it, is how quickly innovation diffuses, right? Just how quickly like, you know, the presence of a patent or the fact that something's secret or or somebody's uh you know, some business has a huge advantage around a technique. Like people still talk and switch jobs and allude to things and say things they shouldn't say in interviews and write blog posts that make it that constrain what the solution must be and blah, blah, blah. And people kind of figure it out. I mean, I um I think when there was a moment about a year ago when Deep Seek R1 first came out, and it was the first sort of open source reasoning model that was worth a darn, right? It was the first open source reasoning model that really kind of could compete in any real way with with 01. And it was sort of right before, if I remember right, 03 was released. And so there was this kind of overlap moment where you were like, well, do I feel like using 01 or do I feel like using deep seeker one? Depends, right? 01, you know, I give all my tokens to OpenAI. Deep C car one, I, you know, have it say strange things about Tiananmen Square if I make the mistake of asking about Tiananmen Square. But yeah, otherwise, pretty good. Um, have you never done this? Have you never done this experiment of like chatting with the Chinese open weights model about like why communism failed or you know, any of the any of the things that you're like not allowed to talk about with a Chinese open weight model, you owe it to yourself. Like there's this very crudely fine-tuned like, you know, jack boots clacking on a parquet floor answer you get that's like actually communism is working amazing in the People's Republic of China and they've figured out everything out about it, you know. Um common misconception, communism alive and well. Um anyway. Um but so there was this, you know, there there is and and when you talked with the with the people who actually like worked on 01 a little bit, and I won't name names or anything, but like it's a small city and people hang out, right? And if you talked with people that kind of worked on 01 and you're like, hey, like how surprised are you that this was like only a six-month moat, right? Because uh chain of thought was like, you know, there'd been like some papers written about it, but it wasn't clear that this was like something that was gonna really work really well. Um and it was throwing a ton more compute at the problem, right? Because you're basically just thinking by like having the monologue, like by having the model have an inner monologue, having it like sit there and be like, okay, the user wants me to do this. I can't do this because you know, like it if you look at the token usage, it's incredible, right? Um so it wasn't very obvious that this was gonna work well. They built, they built this thing that did work very well empirically, um, that you know, cooked a bunch of benchmarks when it was first released. It had that space all to itself for quite a while. And then this open weights thing came out that you know did all those things and more. And they had this very, I'll say the Deep Seek people like actually wrote a great paper about it too. They like were very open about all the pre-training, all the mid-training, all the post-training, like you know, there are recipes you could follow. Um, and it kind of moved forward like what open weight models could do really quickly. And if you ask like how surprising that was, like most people that actually worked on O1 were like, yeah, of course, like people are gonna figure this out. Like, of course, of course, like once you it's the four-minute mile, right? Like once you put something out there, uh there's only so many ways it could be working and they're gonna figure it out. And O1 even like did stuff to try to prevent distillation, right? Like initially, like it was launched in a way where you couldn't see the thinking tokens, you know, um, because they're worried about people just like training a model on its thought traces, you know. Um, and and still it's still just a couple months. So I think broadly, as long as the frontier keeps moving forward at, you know, at the rate that it's been moving forward, there is going to be this gap. It doesn't seem likely that the gap gets any bigger than six months. And you're right, like if there is a hiccup or there's a wall we hit, um, then things get interesting. There was this little interlude that sort of is forgotten when like GPT-4 was released, where first of all, there was a long time where GPT-4 was like alone in the world, where it was just the only thing that was anywhere near as good as it was. And it was at least possible that OpenAI like had some secret sauce. Like it was possible that they really, really knew something that they weren't saying that nobody else knew. Um, and then you know, like the the Cloud 3 family models kind of came out and they were they were closer and MSH was clearer, and then the open source models came uh caught up and so on. And I think during that interlude, I entertained the idea maybe this is it, right? Like we've boiled the ocean, right? We've we've read all the stuff on the web. This paradigm of like gigantic data sets is all is the only thing we know works really well. Um scaling laws say the models are big enough now. It's not like you can just build a bigger model, you need more data somehow, and where are you gonna get the data from? And and maybe we're just kind of cooked this way. In practice, that seems to not be true from a bunch of different angles. Like, people have been more creative about finding more data than I expected. Um, there are ways to get things that aren't just the public web. Um and and also there have been these new paradigms like like chain of thought and reinforcement learning and so on that have been pretty successful and have a lot of legs. So I think like the absolute amount of human ingenuity going into moving that frontier forward and its track record of success so far has me a little uh has me a little bit in awe right now. I'm just kind of I'm grateful to be one of the consumers that like benefits from these things getting better so much quickly, so quickly. Um to me, I think there was a chance. Uh like I think there's a chance that that sort of the self-improvement loop gets closed around coding models if you get a big enough data mode around it. And I think you know, I'm I'm talking to you a day after the news about XAI having this kind of bizarre deal with uh with with Cursor to acquire, to potentially acquire them, to have the option to acquire them later. Um, where basically they're gonna get a bunch of programming traces in the deal is what is what's gonna happen. They're gonna get a bunch of usage traces of their coding model. Because I would expect you know, in interpolating a few steps, I think XAI perceives itself as behind in the coding model race. They're correct, they are behind. Um, nobody is choosing to use Grok to code with that I know of, unless you work for XAI. Um and they're betting that like this data set is gonna be is uh is basically impossible to innovate your way around. That like unless you have this data set, you can't really compete with it. And those open source models that you're referring to, uh, you know, soon before their release, anyway, Anthropics started making a lot of noise about Chinese model builders doing distillation attacks on their models, or what they they term as distillation attacks on their models. And I honestly don't I'm agnostic about whether that counts as an attack or not, right?

SPEAKER_03

No, but I think that's the key missing in in your narrative. I think you covered everything. The one thing that's missing is up until now, when you know a company had a secret sauce, um it could, you know, use that secret sauce to dominate the market. And then once it was dominating a market, it had enough data and enough, you know, presence that it was impossible to beat. So a good example is Google, right? It had a secret sauce, it knew how to index the web efficiently and rank pages well, and it produced much better results than everything that was out there and much faster. And then eventually it was basically covering everything and everybody was using it all the time for everything, so much that it became a verb like Googling something, right? And now if you want to go after Google, it's going to be very, very hard because it's going to be difficult to get the data, it's going to be difficult to you know, uh even get in front of users, right? Like everybody everybody is used. The key difference here is imagine if Google was built at the time Google was built, a very powerful far foreign government who will stop at nothing would would not have wanted an American leading that space. That would have been very, very different because I I don't know what and here, you know, everybody could probably guess that I'm talking about the Chinese government, but maybe not only. There might be other governments who don't want to see, you know, an AI that is almighty in the hands of an American company, and maybe they want their own thing. So the odds that, you know, more than one foreign government have spies working for open AI and Anthropics.

SPEAKER_00

That's uh literal certainty, right? I'm sure those companies operate with probability one. Yeah.

SPEAKER_03

It's it's yeah, we're we're pretty close to certain that that this is the case. And and and then so now it's a completely different deal. If you have very powerful foreign countries who will stop at nothing to build the same thing as what you're doing, and don't care about, you know, the viability as a product of that thing, right? All they want is for it to work. And if it's open source and everybody can use it, so be it. Right? They just want your product to not be in the hands of an American company. That is a difficult thing to compete with. I mean, the the only way you have is to keep on innovating, moving forward fast enough that they play catch up and never hick up, certainly not a wall, because otherwise, you know, they're waiting. And from their perspective, if I was, you know, a foreign government trying to build to play catch up with those AIs, you could say, well, look at those stupid Americans spending all that money on, you know, cutting-edge stuff that we'll get for free because we're gonna make a copy of it for a fraction of the cost three months later. And thank you, Americans, for spending all that money doing all the hard work for us. So I hope it's not how it goes down, and I hope those companies find a way to keep an edge and you know, um, you know, benefit from all that uh all that work they did because I'd like to think that those who put the work get the benefit of the work. But it's it's a tricky situation.

SPEAKER_00

Reasonable people can disagree about how they view kind of anthropics positioning of mythos as like too dangerous for general release. Um I think I work we have an AI research intern at Pebble Bed named Jenny too. She was sort of finding some of these same CVEs uh, you know, with publicly available models and sort of a little bit of jail brock breaking and clever prompting. Um there's uh we also have a portfolio company called Vidoc. They're a Polish security company, and they did some numbers last week kind of demonstrating how to find these these vones with GPT-5.4 and with previous generation Opus models. Um, I mean, there's it could be that there's elements of this positioning of mythos that are kind of marketing, right? There's elements of it that are, hey, we've got the super secret AI, and it's like just not economically viable to sell it yet. So they're just kind of highlighting its capabilities. But taking it at face value for a second, it's at least possible that there was a window there where the company anthropic and and the subset of people doing vulnerability research using the new methods, using mythos and finding these bugs, where basically it had a kind of super weapon, right? Where it had something that would let them disrupt any country's power grid, um, you know, get into sensitive and national security sensitive systems in those countries. There's a way in which like they had a capability that threatens the infrastructure of the entire world. So there's an argument that if you you take Anthropic at their word, um, it's sort of a better, more stable universe if uh they're not the only entity on earth that has this capability. Um and I'm I'm not sure what they want us to which one they want us to believe, right? Like if they want me to believe that um Anthropic, this you know, PBC in California that has a very particular set of people and and is privately held should be able to sort of decide who lives and who dies in this way. Um you know, they seem like lovely people, I'm sure they take that responsibility seriously, but I'd rather have it be kind of the the global order, you know, for all of its faults um deciding those things.

SPEAKER_03

So yeah, I don't know. I they definitely want you to think of them as the good guys, and that might be because they are, you know. That's one explanation. And another explanation is that it's a marketing stunt, and really they don't have, as you said, like I mean there are many possible explanations. One of them is there's not enough money to run the model as it is. And or it costs well, it would cost way too much to release mythos now. And if we did, then especially given I believe today's tokens are heavily subsidized. And so if the cost goes up significantly, that means you have to burn a lot more cash just to keep things.

SPEAKER_00

Could you imagine a strange world where they could just charge $200 a token or something and say like how about it? If it's really that important to you.

SPEAKER_03

Um The problem with this stuff is that you will probably lose the data. So what you also want is to make your model better, and to make your model better, people have to keep on using it. And so if you if you make it too expensive so that nobody uses it except for niche cases, you lose the data that could have made your model better. Because what you are especially interested in is your model was trying to do something and it it didn't work. And then the the user complains and says, well, this is not working, this is not what you're supposed to do. This is gold to you, right? Because now you know where it's not working and how you're gonna make it better.

SPEAKER_00

I mean, you know, that's a actually that is an interesting economic moment that Anthropic's in too, which is they have one of the frontier models in market and a successor in the wings. And so like they're getting the they're getting all the traces from people using their you know, leading edge commercial publicly available model anyway.

SPEAKER_03

Aaron Powell Yeah, but the two might not be the same, right? So you could um you could I think you're maybe what helps mythos is not the same as what's helping 4.6, 4.7. We don't know that, right?

SPEAKER_00

Aaron Powell One of the things that that is described in the Mythos write-up and that I think I saw a little bit with Jenny as well, is that you can just kind of compute more and you will find more of these bugs. Like it's not it's not clear. If you like look at the CVE density of changes to the Linux kernel so far, like one of the things Jenny did was just build a data set that's like all 200,000 or so, I don't know 120,000 commits to Linux over the last 27 years or whatever. And then go into the CVE database and label all the ones that were CVEs. And it's like 8%. Like one check-in in 12, right? And Linux is supposed to be this very elitist, very high standard, you know, lots of lots of eyes make all bugs shallow, right? One of the most kind of inspected open source projects out there. Your project is probably not as high quality as the Linux kernel, nothing personal, whatever your project is. Um, and the sheer defect density, I think, is really surprising to me. Uh like I just the raw, the prior probability of you introducing a C VE being one in 12 is a lot higher than I'd have guessed. And so this is this is all to kind of say like there could be a future where the way that you know your software is secure is that you spent more on tokens to pen test it, right? We're we're kind of the and that's a strange world. Um it makes it I I've been assuming that that having all these coding agents is gonna mean we're gonna run massively more custom software because like now it's so cheap to produce and so wonderful to produce. Um so do we know what Methos is doing differently? The impression is just that, I mean, the the kind of public narrative is just bigger and better. It's just this this 10x bigger thing. They did talk about they did a lot of tests on compute when finding these attacks. So this thing sat there and really, really churned and um you know, probably had some tree of like approaches and whatever that it was that it was using. Uh so but but all that said, like it could be that there's a perverse dynamic here where you actually want to use the most resourced people's software. Like you want to use software from the biggest org because they're able to make it more secure, at least if you're a security critical application. Um another possible way this game theory breaks out is like it gets so cheap to produce lean for proofs of correctness along with your code that like we're all fine, right?

SPEAKER_03

We're all we all just like live in a Yeah, I I really don't think that's gonna be the case. That, you know. But for the kernel, yeah, it's surprising. It's it's it's it's interesting and surprising. But I'm not surprised that AIs, LLMs are very good at finding security holes for all the reasons I mentioned earlier. A lot of security holes is is about finding a pattern that matches an attack that you've seen before and try to tweak it to make it work in this context. And so if you know all that uh you know ever produced by mankind and you read code really, really fast, well, the the fact that you're able to was it be able to uh sorry, was it be able to um to find attacks without the source code? Do you know?

SPEAKER_00

I think the thing that kind of got people's attention was that these attacks were often like multi-step, right? They involved like chaining together lots of different vulnerabilities to get to something practically exploitable. So it wasn't just like, okay, there was a stack overflow here and maybe it crashes or whatever. It was like, no, it used that to put this thing here, and it used this thing there to corrupt this thing, and it used the fact that this thing was corrupted to make that thing vulnerable, and then you could actually attack it from the outside, kind of thing. And I think that was where a lot of the test time compute went to was that it would like it was kind of playing a game of chess, right? It was kind of trying to move the state of the compromise forward multiple steps.

SPEAKER_03

Which is what usually happens in an attack, right? Like you you find a you found a way in, now you want to escalate. In fact, I know that most attackers, for example, a good thing if you are in a company and you want to keep your company safe, um, one good thing to do is to add a ton of alarm bells in case somebody is reading your own security material. So when an attacker gains access to a company, the first thing they wanna they want to know is how the security of the company works. So they're gonna jump on the wiki of the company and try to read, you know, the security briefings and and how you know what kind of and so logging that is gold, right? Like if somebody that, you know, for example, has been in the company for a year, so no reason if if they've been in the company for two days and you know, they just they just arrived and they're reading the wiki on security, then great. But if you know it's a company who's uh somebody who's been in the company for a while and their job has nothing to do with that, right? Like they are working in marketing and here they are reading about, you know, mmm, something fishy is going on there. So that's uh it it was fascinating to me how much uh security was also about um human engineering, like trying to get trying to understand how humans were going to uh what what their next move was going to be and try to detect them more than prevent them from doing things. So there was a vicious one. So there's a few vicious things that uh uh Erling, the guy who was working with, I mentioned him before, a very, very smart guy, uh Erling Erlingson.

SPEAKER_00

We'd have said cracked if that was like in in common parlance at the time. I think of Erling as kind of having you know his portrait in the dictionary next to cracked for the modern definition of crack.

SPEAKER_02

Yeah, yeah. He is uh he he he is something else.

SPEAKER_03

Like when it comes to security, he clearly has. So he had a few funny stories where he he left some, you know, I I think I think they're called honey pots, right? And and usually a honeypot is you leave something that looks like the attacker was able to get something. So for example, they were able to SSH into something, and and they're going to jump on that because now they think they have access to uh to a machine.

SPEAKER_01

And the difference is that his was super sophisticated, where he really made them think for a very long time that they actually hacked Facebook, and then he had all sorts of you know really twisted things.

SPEAKER_03

Like so, for example, he gave them access to the source code, but he changed all the E's with a UTF eight symbol that looks exactly like an E. Right, except that if they tried to run it, it would not run. But they could not actually debug what was going on. The only thing they could know is that you know, they they could change the the code and try to make it run.

SPEAKER_01

And so He was he was like really you know enjoying this. He's like, so they're gonna see the source code, they're gonna be like, great, and then they'll try to make changes and make it work. They won't realize that the ease are not ease and it's not gonna work.

SPEAKER_03

And you know, it was full of stories like that. Um, and yeah, so I guess back to the point. I think models like mythos are going to be useful, but I think again, we're going to need a ton of human beings who know how this stuff works and have built these kind of systems before and able to, you know, bulletproof, you know, what the different things are. And when it comes to marketing versus reality, I don't know. I suspect a bit of both. Uh, if I had you know a system that was capable like that, I would try to make a marketing stunt out of it because why not? It's also a company that's trying to get attention, even though it already has a lot of attention. No, what's more interesting to me is where are the metas of the world going to be? Where is how is Google going to react? Because Google feels a little bit behind now, and I don't believe these two companies are willing to give up. So, you know, they spent a lot of money on on this stuff catching up. Uh, there's also you mentioned XAI. So it's it's going to be interesting to see. I mean, they have a big advantage, and that's that they have a huge money-making machine behind them, right? So the difference with OpenAI and Anthropic is that they need investors to give them money, and they need to burn that money to get ahead.

SPEAKER_00

The company formerly known as yeah, yeah. But I think we did you work there as meta? Was it meta on your badge at any point?

SPEAKER_03

Uh when was meta? Was it 2000? I left in 2019. So yeah, I must have worked there as meta. But it was so ingrained to me that it was Facebook that I had a hard time changing. So I think the difference is Facebook and Google are burning their own cash, and that is pretty much limitless as long as people are watching stuff on the internet and we can serve them ads, which I don't think is going away anytime soon. Um, and yeah, Anthropic and OpenAI are burning, you know, the cash off investors, and that's something that needs a return, a significant return on investment, which maybe they can pull off. And then there's, you know, have you seen this stuff where Elon Musk wants to train models in space? So he wants to send uh data centers to space. That seems science fiction to me, but the guy has pulled off some pretty amazing stuff, right? Like you look at those rockets going in and out of space, and I guess it's just really impressive.

SPEAKER_01

And yet again, I mean, I I am both very, very impressed with him with some things. He's a bit like an LLM to me, right?

SPEAKER_03

Very, very impressed with him with some things, and with others, I'm like, so he becomes the CEO of Twitter, and then he wants every single engineer to review how many lines of code they've written and what these those lines were. And I think he was asking for people to print them. And you're like, well, I've written software for a while, and no, that's not how now, don't get me wrong, this there is a possibility that Twitter was over-engineered. And I, you know, that there was too many engineers there that you could actually run it with way less. So I'm not saying that every single engineer had, you know, was necessary for the company to function. That's not what I'm saying. But evaluating that by asking them to show them how many lines of code they're written and print them out, that that is weird.

SPEAKER_00

Peter Thiel's kind of like this a little bit, right? Like I think like Peter Thiel has this very weird worldview that I will not attempt to summarize here, but like is not breaking census. And so he has a bunch of unusual positions on a bunch of topics, many of which I think are just wrong. But like his path through life, it's enough if one in fifty of them are extremely right. And if they're un if he's correct in places you're unlikely to be correct because of this weird worldview, then the weird worldview is kind of serving him. And I've I've never actually worked for Elon, but like my impression talking to folks that have worked with him relatively closely is like if there is one weird trick, it's sort of asking, like, what is why is this physically impossible? Right? Like, first principle. Like let's go back to sort of every possible first principle. Like, why can't we do this crazy thing with data centers in space? Well, something about cooling. Okay, what needs to be true to solve this cooling problem? Well, I need some sort of material that can kind of like dissipate heat in the vacuum space and blah, blah, blah. So, okay, great. So then what's missing? Like, what does that material exist yet? Like, what do you need to, you know, and and eventually you kind of like get to the leaves of this tree and and you know, there are a set of technical problems to solve, and you can hire people with the right expertise and you can, you know, get them excited to solve these problems that nobody else is able to solve. And if the problem's ambitious enough, they'll be excited and things happen. And I say that and I say that as somebody that like is actually agnostic about whether the like space data center thing makes sense or not. Um, but that trick of kind of saying, like, well, why can't you do this at all? Um, it's been so it, you know, arguably it's been successful in Neuralink, it's been successful uh, you know, for Tesla, definitely been successful for SpaceX, probably the most dramatically for SpaceX.

SPEAKER_03

Starlink to, I mean, the the amount of success is is is incredible. It's very, very impressive what he's achieved. Uh so yeah, no, I think, but when he when he was talking about software at the time where he acquired Twitter, I was like, no, that's not how you assess, you know, the the quality of you know the output of a software engineer. But then again, people have different dimensions and they can be extremely good at one, and maybe they have some blind spots on others, and that's completely fine.

SPEAKER_00

Okay, well, uh like just to get something on tape here, thank you for joining us for this LLM focused edition of Coffee Computers and Beer. Um it's been a pleasure here talking with my friend Julian. Uh, and if you like this, remember to smash like and subscribe, and we'll see you next time.