Frontier Systems
Frontier Systems is where the people building the future of AI explain what they're working on. Our flagship show airs live every Friday from 12–1pm PT, with Anjney Midha and Mike Abbott taking real-time questions from the community alongside a weekly special guest - researchers, operators, investors, policymakers, and founders shaping the frontier.
Frontier Systems
Reiner Pope of MatX - Office Hours, Episode 4
Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.
This week, Anjney Midha and Mike Abbott are joined by Reiner Pope, CEO and co-founder of MatX, who spent a decade at Google - chip design for neural nets on the TPU team, then writing the inference stack for Palm - before leaving at the end of 2022 to bet that frontier-scale workloads deserved a chip designed from scratch around them. Reiner walks through the architecture decisions behind MatX including why intelligence per picojoule is the eval that matters, how to manage co-design risk when an error on the logic die costs hundreds of billions in CapEx, the trust boundary problem of working with frontier labs whose model architecture is their core IP, and why one well-balanced chip can serve pre-fill, decode, and training rather than splintering into specialized SKUs. He also gets into the parts of the job nobody talks about such as scaling supply chain from zero to gigawatts, fitting inside NVIDIA's de facto rack standard while Google's vertical integration runs ahead, and where SRAM stops scaling once context windows pass a million tokens.
That was the first time we've tried Vivaldi as an opening set uh intro, but hey, that's one of my favorite songs, and I think that was a great hype up to our guest today. Um thanks for joining us, Reiner. Very happy to be here. Awesome. Um so we have people streaming in. Um welcome everybody to our second Office Ours at NCS 153. This week we've we're super lucky to have with us Reiner Pope, who is the CEO of MatX. Um, Reiner is somebody you're gonna be hearing a lot about over the next few years. And so we're super lucky to have him come give us a bit of a crystal ball into where the future is going. Um, just to before we we jump in, I'm going to sort of recap what we covered in class yesterday. We had um let me share my screen. I feel like I'm back in COVID. Uh here we go. Okay. Uh and Reiner, can you see this visual?
SPEAKER_01Coming up?
SPEAKER_00Yes. Okay, great. So this is a sort of a stylized visual of the CS 153 Frontier Factory that we all looked at yesterday in class. Um, we've spent the first few weeks, as you all know, talking about the assembly line of intelligence, right? We've got data, compute, and algorithms. That goes into pre-training, out comes a foundation model, which you then do more train mid-training on, you get a mid-trained base model, and then you can start sort of customizing that model with post-training, SFT, and RL to give that model capabilities in the real world, like being able to uh take actions, call tools, and you know, that system we typically call an agent, and then you rinse and repeat. And we've done field trips right over the last few weeks with folks like Robin Rombach, who um was the co-creator of Stable Diffusion. And the um actually, no, we had Andy Blackman. Robin was last year. Andy Blackman this year was on uh from Black Force Labs, was here, co-creator of Stable Diffusion. We took a field trip into the Visual Intelligence Factory with him, Unified Intelligence with with Amet, who's training video models at Luma. Um, of course, you know, with with Maddie at 11 Labs who's training audio models. Later in this quarter, we'll have Liam from and and Doge from Periodic. But as as we've talked about in this class, this is a systems class, right? And of course, one system is the machine learning pipeline that makes these frontier capabilities possible, but there are other systems that need to come together to keep the frontier of progress going for humanity. Um, yesterday with Scott Nolan from General Matter, we talked about the energy bottleneck, and that's sort of represented by the clean nuclear energy reactor over here, which maybe one day we'll have on campus at Stanford. Um, and next to it here, we've got my sort of idealized version of what compute looks like, you know, maybe a decade from now, which is a modular data center sitting right on-prem, so to speak. And so you could have a sort of self-contained um uh factory. We're nowhere close to this today, but we're lucky to have with us someone who has had a vision for a long time about how compute should work, um, who's Reiner. So again, Reiner, thanks for joining us. Uh can you hear us okay?
SPEAKER_02Yeah, yeah.
SPEAKER_00Okay, great. So we're gonna spend the first few minutes doing sort of fireside chat style, talking about you, what Matics is, the compute bottlenecks, and then we're gonna jump into Office Hours with the students, and they're submitting their students, which I've got on stream here. Um, and we'll try to do that for at least uh 30 to 40 minutes.
SPEAKER_01Great.
SPEAKER_00All right. So why don't we start with you, Reiner? Tell us about yourself, how'd you get here?
SPEAKER_01Yeah, um, so originally software background. I I worked um about a decade at Google across all kinds of things, um, originally to web development back in the time, but then um much more relevant to this is like uh I shifted into machine learning teams, um pre-neural nets, neural nets then took over and wiped all of that out. Um and so I worked on uh chip design for neural nets. Um, and so several years doing that at Google. Um uh and then coming out of that, um the uh uh the other big thing is uh working on Google Brain for several years. And so that was um uh help train Google Palm, which is one of the predecessors to Gemini. Um I wrote the inference stack, and I that was one of the highest performance inference software for running LLMs at the time. Um very, very strong team on Google Brain. We we did a lot of analysis of like theoretical analysis of how to map models to hardware. We've put out a book uh publicly for that. That's called How to Scale Your Model. Um, uh put out some papers around that as well. Um but ultimately, as part of that, uh that was all working on TPUs. Uh my co-founder Mike Gunter was on the TPU architecture team at that time. And uh what we were seeing is you know, GPT-3 at that start uh at this time had come out. We know that models are going to be massive. We want to target very large matrices, very low precision, very low latency, um, and sort of figure out how to do that in the most cost-effective and power-effective way. Um, so at some point uh uh we concluded the best way to do that would be a bit of a clean start, focusing on large models specifically rather than sort of the um older perspective of uh accelerating neural nets in general. And so uh we left Google in in the end of 22 to to start MadX. Um the goal of MadX, uh we've been doing that for a few years, is to make the best physically possible for running uh large model workloads specifically focusing on the needs of Frontier Labs. Uh they drive most of the spend. Um, that is where most of the build-out is happening. And so uh really, really nail that workload without consideration for others.
SPEAKER_00Uh there's that was a speed run, so thank you. I know there are their eternities can you know contained in each of those sentences you just told us, but um let's zoom back a little bit to the moment when you decide to leave Google, right? You said it was the end of 2022. Um, I think December 4th, 2022 is when Chat GPT came out. Yeah. Did that have anything to do with your decision to leave, or was that just an independent decision? What what what prompted you to leave the Google complex, which doesn't happen very often after Google?
SPEAKER_01Yeah. Yeah, I mean, so the timing is like I I had no inside knowledge about when whether and when Chat GPT was going to be released. But the the thing which was sort of a public secret uh was GPT-3 had been around for one or two years in advance of that. Um and uh it was very hard to use. Like it does did not have the product market fit that ChatGPT itself had. Um, like if you like you can use it on the website, but you have to sort of get in this mindset of um, I'm not talking to a chatbot, I am sort of completing a document, and that's really hard to use. But but there were glimmers of it working really well. Um and the one of the uh questions inside of Google, Google had its own versions as well, and one of the questions was um these models seem very capable. There are some concerns about hallucination, which like still persist today, but have have like been dramatically improved. But then there were also concerns about cost. Um, this entire SaaS business, like Google and all of these other companies providing SaaS, um, the marginal cost of a user is basically zero. Um you you can serve all of the web traffic for tiny, like like millionth of a cent uh uh costs uh to serve each query. You can serve it for free, in effect. Um, and then for the first time, GPT 3 completely violates that. Uh, you know, it costs cents per query. Um, and when you're serving all these queries for free, it it it becomes extremely expensive very quickly. Um and so the fear was like, how do you even build a product around this? It's it's way too expensive. Maybe we can never launch something. Um, and I think the big thing that ChatGPT showed is actually you can launch something and it's incredibly valuable in uh to users.
SPEAKER_00And you know, I I what I find is friends who have had what were in or around the machine learning community from that era, um, you know, you spend have have often asked the question, what is the right eval? Right? If you're the I went to grad school for for ML, and the the thing you you learn consistently is you've got to start with the right eval, right? Like that's that's what um matters most when you're trying to uh develop a new capability. And I've heard you say a few different times now that intelligence per PicoJoule is the right eval, um both for for you, for humanity. Can you can you zoom in a little bit on why that is?
SPEAKER_01Yeah, so I mean I think that's a particularly catchy way of putting it. Um but let me put uh there are like there's a few axes always when you're doing evals, you may need to make sure, like look at maybe four of them and see that you're doing well on all of them. Uh but this is definitely uh like one of the most important and understandable ones. Um the overall trade-off is I know how I can make a model cheaper, I make the model smaller, um, but then I lose quality there. And like conversely, I have this very straightforward recipe to make a model more intelligent, I make it bigger and more expensive. And so if I tell you, look, I've got a model that's way cheaper than anywhere else, it's not very useful. If I tell you I've got a model that's much smarter than everyone else, well, that is actually useful, but but uh ultimately like I'm also gonna care about the subscription cost uh of the service that you've got. And so the what users are gonna end up caring about is the level of quality somehow normalized by dollars. Um so I really like quality at an ISO cost uh comparison point. Um and then like let's zoom into what are those dollars spent on. Um uh those dollars are almost all running the data center. Um so uh the model is running in a massive data center. These data centers are um like uh these days a gigawatt or a substantial fraction of a gigawatt large. This takes you know uh many, many acres of land and is like uh more power consumption than than uh than you know than than a town or even a city. Um so very, very um sort of large concentrated buildings. Um and the the big costs going into them are the just the cost of the GPUs or other chips uh that are in them, and then the and then the second after that is cost of electricity. Um so cost of electricity really means actually you probably need to build a data, like build a power power plant sitting next to the data center, and you need to pay for that as well.
SPEAKER_03Right.
SPEAKER_01Um the main cost uh today is actually the um the cost of the GPUs itself. And then if you sort of trace down that supply chain, you're paying for the logic dies that are produced typically at TSMC, or you're paying and you're paying for the memory, which are produced at Samsung, um, SK hinex and micron. Um uh so like materials costs on on silicon. Um uh and then the other cost is just this cost of electricity, uh as we discussed.
SPEAKER_00And as you look at as as we sort of extrapolate um those costs out, you know, which one of those do you feel that you was you were confident saying, you know, the from an operating assumptions perspective for the system of MatX, um, which were assumptions that you felt were essentially steady state assumptions you could hold true? And which assumptions did you feel you know were needed significant prior updates and could actually be updated with this with the right technology? Is my is my question making sense?
SPEAKER_01Yeah, so where are the places where the the state of the art is actually as good as you can get versus where are the ones where there's an opportunity?
SPEAKER_02Right.
SPEAKER_01Um so I mean there's a huge ecosystem here, and many people are innovating in different places. Um the place where we chose to focus on is not innovate on physical technology at all. So we buy the same power supply and cooling and um uh and interconnect and cabling, all of that. We buy the same as anyone else buys. Um, innovating on physical technology is harder, iteration times are longer, um, it's sort of a challenging thing to build a business on. Other people do, but it's not our expertise. Um so we chose to innovate on the architecture of the logic die of the main chip uh itself. Um and a reason that is uh particularly sort of attractive is that you can do so much in simulation and you can actually model it out and even maybe to some extent, even in your head, uh, predict what the performance is going to look like. Um and then zooming in within that, uh, what are the different resources you optimize for? On an AI chip in general, there are maybe five or so big resources you care about. It is the HBM bandwidth and capacity, so the memory bandwidth and capacity, the compute performance measured in flops, um, and then the interconnect bandwidth measured in bytes per second. When we look at those technologies, it seems to be the case that the um your ability, like the marginal improvement you can do on compute performance is much larger than the uh how much improvement you can get on the other two things, on the memory and on the on and on the interconnect. The the memory and interconnect are sort of fairly well specified and well-understood problems of just the most bytes per second uh that you can do, and they've been pushed already very much to their physical limits. Um but on the compute side, uh there's a lot more richness in the design space. I I can uh play with things like how do I lay out the compute cores on my chip? How do I, which number formats do I use, um, what is the ratio of um storage to compute inside the chip itself? Um, and we find that when we play with those things, you can get uh sort of multiple factors improvements over uh some of the other things.
SPEAKER_00Um that was a great summary. Now let's zoom in on those those parameters, right? You've got a bunch of parameters when you're innovating on compute, um, especially on the logic die. But getting those those decisions right is quite important.
SPEAKER_02Yes. Yeah.
SPEAKER_00The vision of error is quite low, especially because an error you make there ends up literally impacting hundreds of billions of dollars worth of CapEx, right? Yes. So can you walk us through, as you can tell, I'm teeing you up to talk about co-design, but could you please talk about what the optimal way it is to manage the risk around those decisions from a systems perspective and how you've ultimately converged on why what the right process is, where the risks are, um and and how the students should think about you know the unfair advantages that you've discovered or the asymmetric bets you've made, um, that you may get wrong, but if you get right, you know, have a chance to change humanity and and the industry.
SPEAKER_01Yeah. So I think the big task and and the main work of architecting a chip like ours, especially when the focus is on the digital logic and the architecture of it, is it primarily comes down to one of resource balance. So like we have a fixed budget of area in of square millimeters of um of silicon that is determined by the manufacturing constraints of TSMC. Um and so within that fixed budget, we want to like it is somewhat zero sum in that like the resources I allocate that silicon to are um SRAM, which is the on-chip memory compute, and then on-chip interconnect, like wiring between different parts of the chip. So your goal is to pick a resource allocation that ends up when I map my neural network uh to this hardware, um, gives me overall best uh like tokens per second, for example. Um and it turns out the the models are simple enough that you can actually make uh really good analyses in a like in a quite simple sort of spreadsheet or Python modeling based way. Um uh uh and and likewise the hardware is actually relatively simple compared to something like a CPU, which is extremely complicated, um, that you can do a lot of analysis. And so a big part of what we do, especially early in project uh in project definition, product definition, is um saying uh if I have a chip with a certain amount of memory and a certain amount of compute performance and a certain amount of interconnect, um what is the optimal way to take a uh LLM and cut it up over maybe thousands of these chips? Um, considering what is the communication bottleneck between the chips, what is the like am I limited by memory capacity, memory bandwidth, compute performance? And then after considering all of that and trying all of the different ways to cut it up, what is the best way to cut it up for a given hardware? And then finally, I do the outer loop over this is sort of to in principle search over all possible hardware.
SPEAKER_00And which of those parameters are within your control and which ones do you often require input from customers?
SPEAKER_01Um, I mean, I think that's like you said co-design, and I think that's where the really big uh opportunity is, is to um not treat the workload as fixed. So if you like if you say we know what the workload is, it's deep seek v3 or maybe v4, which came out today, um, and that is the workload, I'm gonna make the best chip for that. Um this is sort of an optimization problem where you've said this one half is fixed and I'm gonna optimize the other half. And then like there's this slow long-term iteration loop where then you come up with a new hardware, and then like maybe deep seek v6 or v7 will be optimized for this hardware, but they haven't been jointly optimized. And that is a very like slow iteration loop to sort of find a global optimum. So the much better thing to do if you can is to um say I'm gonna jointly optimize the model and the hardware at the same time. Right. Um that requires you to not just be a model company and not just be a hardware company, but have a bit of both in-house so that you can you can do these evaluations. So, as an example, the way that we do that is we um firstly we narrow the search space a lot. I described like this huge multidimensional search space. In reality, we don't actually search all of it, it's too slow. We have a very strong like human prior of what are the likely good points. Um, and then we pick some of those and we train models. We we train them from scratch um and look at what the quality is. Um we don't have the same money to train models that like OpenAI does, for example, but scaling laws really help where you can evaluate um at a small scale and then and then predict up to what that is. Uh so um like making a few sort of uh careful hardware bets on, I mean, to give some examples, we've said low precision is very important. Um other things would be um what is the compute to memory bandwidth ratio? Um maybe we can change that significantly relative to where models are today, and then evaluate how good the models are that are optimized for that hardware.
SPEAKER_00Um and to what extent, you know, having the visible. So ideally, you know, if if you're a model company, you just have like like like Google did with you, you have your Gemini team or your palm team, right? And then you have your chip team right next to you. You've got Reiner, you know, doing systems co-design back and forth. Um, and that works great when you have full visibility internally, you're within one trust boundary. Yeah, right. But when you leave Google, now you're not in the Google trust boundary, yeah. And so, how do you navigate the trust boundary between your customers who are some of the world's most um the fastest growing companies, but for whom model architecture and model innovation, model workload inside is core IP. Yeah, and they can't have that IP leaking, right? And so, how do you navigate that problem where um you're you're you know you're now within the trust boundary of MatX and your customers are a different trust boundary, whether that's uh you know some new frontier lab that you're trying to work with, or or might it be Google too, because you might be wanting them as a customer. How do you navigate that co-design problem?
SPEAKER_01Yeah. Um I I think that this is a real sort of information discovery challenge. Um looking historically, uh there was a time right up until the release of ChatGPT when actually labs were publishing all of their technology. That that stopped uh in 20 at the end of 22. Um but there was a time when actually uh any hardware company could know what the state of the art is by looking at publications.
SPEAKER_03Right.
SPEAKER_01That stopped um and then restarted but at lower volume with uh DeepSeq uh with some of their publications. So for a um for a hardware company that is not attached to a Frontier lab, sort of looking at DeepSeq and some of the other um open source labs, their publications, has been sort of the most um specific public information. Um and and we we we've gotten a lot from that. And we don't literally take the Deep Seek model, we look at it and then say, okay, we understand the ideas, the ideas have been validated, and maybe what are the different ways you might scale them up uh and so on. But that is the sort of biggest, most specific for uh form of information.
SPEAKER_03Yeah.
SPEAKER_01The that puts you at a real disadvantage compared to um a vertically integrated player, like you describe as Google. Um and the way you manage that is uh I mean, you called a trust, and this is really the right thing. You like um an important Aspect of running a business like ours is actually developing a very close trust with customers. You see this from NVIDIA. OpenAI has a history of giving a lot of product feedback to NVIDIA's products. They're very aligned on that. NVIDIA products being better will make OpenAI do better. And really, you just need to sort of be demonstrate that you're impeccable with the trust that's been placed in you, as well as like actually, we also reciprocate as well a lot. We share a lot of information about our chip architecture so that our customers can make an assessment of uh does this already meet my needs or does it or is there something that should be changed?
SPEAKER_00You know, on that topic, one technique that keeps coming up over and over again in systems design um is you have either you know full systems integration, right? Where you have a vertical integration um where you just collapse uh architecturally some process into one node, yeah, or you have different nodes, but then you do pooling and utilization, right? Yes, because if you pool, then you can drive up utilization. In some sense, the day you decide to leave Google, you became a node. That's right. You went from one to the other regime, right? Yes. And where in your journey are you right now on the on it? Should should the students think about sort of MatX as a as a bet ultimately on ecosystem pooling, or do you believe that there's a different way for you to do systems co-design where you actually could make custom implementations for a single company? What what what is and and where on that sort of X versus Yax Y axis should folks think about um you know the current landscape and where you fit in and where you think the world would go?
SPEAKER_01Yeah, so I think um so I'll talk about us specifically first and then I'll I'll zoom out to sort of the ecosystem as well. Us specifically being a um an uh a startup uh working on our first generation product, we we are not able to, as in we we can't afford the uh distraction of focus to make sort of custom chips uh on our first generation for multiple different players. So we we've designed a chip that we think is a really good uh chip, has a lot of flexibility for all of the things you might want to do at not too much of a cost. Um I think I think that there's a reasonable economic argument for that to change um as you grow larger. Uh just to put it sort of straightforwardly, um the the sort of cost per tape out is somewhere in the order of um the tape out itself costs maybe $30 million. There's a lot of IPs you license as well, and there's other manufacturing processes. So call it $100 million per tapeout that you do. Um Frontier Labs are spending tens of billions of dollars or hundreds of billions of dollars on chips. And so that that spend is enough to uh to cover many, many different tapeouts, in fact. And so you like you should be able to uh develop many different uh chip products. Uh if you look at the broader market, this is not happening. Um like uh no one is really developing like a hundred different SKUs or even 10 different SKUs of their product. Almost everyone makes one or two SKUs of their product, uh, and and the variations are very small. Um this is it's it's not so much driven by the um the cost of the tape out then, but more uh from the uh the focus of the engineering focus. Yeah. Um I think there is uh I think over time we'll see this push towards more um more sort of specialized products.
SPEAKER_00Got it. Uh great. I think um there's one more question I had, which is if you could talk to the students about pre-fill versus decode and why that's an important um architectural difference that they should be aware of as they navigate the compute space, that would be useful.
SPEAKER_01Yeah. So I mean, in general, there are um sort of three categories of main workloads for uh chips running LLMs, which is um the training, prefill, and decode. Um pre-fill and training actually look very similar to each other in practice. Um the the big distinction is are all of so a model like the the chip has to process a set of uh set of tokens that run through a forwards pass of the model. And the big distinction is do I have many tokens presented to the chip all in parallel or are they presented sequentially? And this ends up affecting all of the um uh resource requirements of the workload. So so pre-fill is talking about when I have uh like the chat, the AI itself is not producing tokens, it's ingesting tokens given it to it from the outside, either from a chat context or it's reading a file in your uh uh Git repository or something like that. Um all of those tokens are presented to the model in parallel. The model can um can share a lot of the resources that would use B used per token. So in particular, it can amortize all of the memory bandwidth, all the memory fetches over all of those tokens that are processed in parallel. Um in contrast, decode is when the AI is generating uh new text, and it has to generate text one at one token at a time, or maybe like three or four tokens at a time, uh, but like substantially sequentially. And this ends up not having the same amortization ability, and so you you don't get the same savings on memory bandwidth. It ends up meaning that decode tends to be more memory bandwidth dominated, whereas pre-fill and training tend to be more compute dominated. Uh, some people uh have said, okay, that maybe that means we should have like a different style of chip for decode versus uh prefill. Um we actually at MadX have taken that different position of like you can actually make an extremely good chip that works well, really well both.
SPEAKER_00And uh could you spend just a couple minutes on talking about how? You know, what is the innovation there that allows that generalizability across both those workloads?
SPEAKER_01Yeah, so um the the I mean really what you want to what you want in order to perform well on pre-fill and training is very high compute throughput. Um what you want in order to perform well on decode is um very good memory bandwidth. For us, that means uh we get that memory bandwidth by having a huge amount of capacity and bandwidth of SRAM on the chip. Um uh to some extent that actually solves the the a lot of the memory bandwidth requirement for decode. Um, but then we also put a lot of compute performance on the chip. Um fitting both of them in into the same chip, that is actually the that's that's the challenge. Like you only have a certain amount of silicon budget. Um and so uh we have a few different architectural optimizations that we've done to be much more area efficient, both around numerics and just the way we connect all the cores together on the chip so we don't spend too much on wiring.
SPEAKER_00Fantastic. Okay, thank you so much for that speed run on the you know on chip 101. Um there's a bunch of questions piling up, so I'm just gonna start giving them around. The first one is that you went from Haskell and math to TPU architecture to founding a chip company. What did the transition from software to hardware actually feel like? And what surprised you most?
SPEAKER_01So um the context in which I was uh like the software I was doing immediately before switching to chip architecture was performance optimization. So I was writing C. I spent all of my time looking at assembly code uh generated by the compiler and saying, okay, there's an instruction here that it wasn't necessary. Can we change it somehow to fix that? So a theme when you are doing this kind of performance optimization is you really want to know the primitives you are working with really well. So know the assembly code and the the speed of every single instruction on an Intel CPU. That was sort of my mentality working on high performance software. That exact same mentality actually ended up applying very strongly on chip as well. So the question is what are the primitives that you work with on a chip? Everyone understands, like I think we understand that assembly instructions are the primitives on CPU, but what are the primitives on a chip? It is the gate library. So there is a relatively small set of gates, andor not XOR, are the standard ones, and then there's some slightly bigger ones, which are MUX's and uh a particular device called a full adder. Um, so one of the things I did quite early on was I just read through the gate library, which has a description of all of these gates, and what is the area and power, so what is the performance cost of these gates. Uh it's a small enough list that you can commit it to memory. This is really helpful when you want to reason about what is the cost of this thing I have in my mind, then I can quickly evaluate that. So the big transition was to sort of learn this new set of primitives. The other thing that ends up being sort of qualitatively different when doing hardware design is that it is extremely, extremely parallel. Uh, you can think that there are maybe 100 billion transistors on a chip. Um, all of them operate at the same time. And so you like into some extent, you are designing a uh a chip that is a hundred billion way parallel, uh, which is a very different place than being, you know, like a hundred cores on a CPU. And so there's a bit of a mindset shift of like what does massive, massive parallelism look like?
SPEAKER_00Awesome, thank you. Um next question is what did you know after two years at MatX that you wish you'd known on day one?
SPEAKER_01I there was some evolution of the market that I didn't expect. Um I think it's actually like what we see is that um if you like behind the Nvidia Behemoth uh is a huge ecosystem of um suppliers building things like power supplies, cooling, uh cabling, um, and and and and uh interconnect technology. That ecosystem has become like there have been so strong demands on that ecosystem over the last few years that that there's been so much innovation on all of these things. It has ended up meaning that if you look at the latest NVIDIA racks, their power consumption is you know five times higher than where it would have been, or even ten times higher than where it would have been a few years ago. Um, but all of this ecosystem is actually able to deliver the power and remove the heat and make all of that work out. And so physically, how ambitious you can be is the bar on how ambitious you can be physically has has raised a lot. Um, and that has led to you know everyone having a lot better products. Um, I think if if we had been even more AI pilled than we were, um we maybe could have predicted this growth in the market. But ultimately it did mean that things like interconnect and memory um and and power density uh ended up being a lot better than we expected, and we had to sort of somewhat pivot to that target. Yep.
SPEAKER_00Don't worry, everybody I'm I'm not sure anyone has been uh guilty of being two uh, but uh I find myself saying this phrase a lot more now. Better late than never. Yes. But the bitter lesson holds, it's still holding, so you might as well update your priors uh and not have an ego about it. So thank you for for doing that. Um the third question is at Google, you were a tech lead on TPU V2 and 3 and also training POM. How much did working on both sides of the stack, software and hardware, shape how you think about chip design?
SPEAKER_01Yeah, so I I I did not, in fact, lead uh TPU V2 and 3. I I helped with the ideation of um some of the later generation TPUs. Um and uh so I'll tell the story of how that interacted with the the the Palm work. Um I was coming from the uh from the definitely from the the workload mapping and the software optimization point of view. Um and in this process of uh like the big part of workload mapping is to to develop analytical models and say, I have these thousands of different ways of cutting up a model and running it on many different chips, uh, and then sort of running it in like running the programs in different orders and look at the scheduling, um, and selecting the best one of that. This is this co-design process that I talked about. When we went through that exercise, uh one of the big things we found there, and I think is a common theme for generally what you find in this exercise, is that many chips have more resources, especially around communication and memory, um, than they really need. Um, it's a natural place for that to be. Customers really like it when you have a lot of uh surplus on interconnected memory because they don't want to deal with uh like a resource crunch there. And so it's much easier and and friendlier to customers to have uh an over provisioning, but it grows cost as well. And so um there's sort of an exercise of how like your ideal operating point is that you can hit close to 100% of your compute, uh theoretical compute performance. Uh, is that limited by the other resources, the memory and interconnect uh performance? And so, like typically what you do is you say over provision uh those other resources so that you are never limited, like so that you can realize all of the compute performance. But then the optimization uh process for a chip is to bring that margin down such that it's quite a tight rope, that the the you're very close to the balance point. So this was a big part of the um sort of like we can analyze that very, very well once we know uh our experts on how to map models to hardware. That is a software exercise, but then that gives you that feedback into hardware and saying your hardware typically you've over provisioned in some of these ways and you could make it cheaper. Um and so that's really what led into some of the TPU designs.
SPEAKER_00Great. Um the next question is what's the hardest part of building a chip company that nobody talks about?
SPEAKER_01The I mean one of the probably big challenges that we are like looking at now is how to manufacture in very, very large volume. Um we our goal is to sell to Frontier Labs. Um they're very attractive because they're they have a lot of expertise in-house, um, they drive a lot of volume. Uh but the flip side of that is since they have such large volumes, uh for your product to be meaningful, you need to be able to sell a lot. Um, so I mean, selling a lot is great, you can get a lot of revenue. Uh, but especially for a startup going from zero volume to you know massive, massive gigawatts of volumes in a very short amount of time. Uh there's a whole ecosystem of supply chain that you need to build up there. You need to build trust with your vendors that uh that you are actually going to be around in a few years, and when they are ready to deliver, you are ready to pay for that. Um, and so that really shows up in terms of like logic die vendors, so TSMC, uh, and then memory vendors, which are SK Pinex, Samsung, Micron.
SPEAKER_00Yep. Um actually on that topic, could you talk a little bit about uh the cloud ecosystem and how you think about folks hosting your, you know, what happens after manufacturing is somebody needs to host those chips. Could you talk about how you're approaching that problem, that bottleneck?
SPEAKER_01Yeah. So um like I'll I'll do general economy and then how it relates to us first. In in general economy that we like there is this uh build-out of data centers that is happening like on massive scale now, and it's primarily constrained. I mean, there's a land constraint, but the bigger constraint is the power, power, like being close to power and building out new uh power plants. Um this is a substantial constraint uh in like locally now and in the past few years. Um, the expectation is that um uh sort of it's a balance of this constraint versus the cost of the GPUs. And uh there's a bit of push and pull always uh uh over the time here. Um now switching to how we view this, um, we don't want to additionally be in the business of building data centers because it that is risk stacking. Um, we already have enough risk in developing our product. Um, so we offload the the question of building data centers, securing power, and and operating the data centers to our customers. Um, that is how the industry already is in practice. Um, the Frontier Labs are uh they have like a huge amount of what they're doing is is actually securing all of these data centers as well as chips. Um, and they do this by any means necessary. So to some extent, they operate their own data centers. They sign a lot of deals with partners who have built up data centers and power supply um uh themselves. Uh any means necessary is what they're doing.
SPEAKER_00You know, um on that topic, uh, when this is a constant tension leaders often face, right? Is you have a mission in your case, it is to bring um that that sort of generalizability of of workload to frontier teams. Um I don't know if you feel like I'm summarizing the mission correctly.
SPEAKER_01I mean performance, very high performance, high performance.
SPEAKER_00Performance, right? And and and to accomplish that, you often need to make a bunch of trade-offs. And when you're early in in a category in a space, that means deciding what not to focus on.
SPEAKER_02Yeah, right.
SPEAKER_00And you decided just in that answer, you said we decide not to focus on the on the cloud hosting part. The risk, of course, of that is that then now you're dependent on somebody else to do that well with your chip. Yes. And then there exists again an opportunity to co-design, right? Between you and so there's sort of this recursive um pattern here of co-design that keeps coming up that I'd like the students to realize, you know, it is a is a pattern up and down the stack in systems design. But how would you reason about the opportunity to co-design, for example, of the data center, right? How would do you feel like with given your understanding of the MatX implementation, what would be the ideal data center that would really allow MatX performance to shine to ultimately deliver the best capabilities to your end customers? Does that make sense?
SPEAKER_01Yeah, it makes total sense. Um so I mean, I'll sort of capture what I heard there, which is how do you in general decide what is the ROI of co-design versus versus like I mean, the you always get returns on co-design, but then you take on the the risk of um distraction of focus and and and increased possibility of failure. Um the uh I think the the costs are actually maybe easier to reason about. Um uh you can just reason about how much will it uh take you in terms of capital cost and building out a team to to to to run a data center uh operation. Um and uh, I mean, in that case, it's extremely capital intensive, and so that that's a very clear cost. Um uh and then also like you can reason about your expertise. And I think again, for us, that was a very simple thing. Um we looked at where we as a founding team had very strong expertise, and is absolutely not in operating data centers. It is in the um design of the chips, and so we sort of wanted to keep it as tight to our expertise as possible. So that's the cost side of the equation. And then there's the what is the return, return on investment side of the equation of uh am I missing some huge opportunity for co-design and and how much am I gonna regret uh this lack of vertical integration? And there we uh you can try and do some back of the envelope to sort of bound what the costs would be. And so uh the big opportunities specifically we see in in having the data center designed right are do you have enough chips in one data center such that you get like a lot of interconnect between those chips in the data center? And then quite similarly to that, do you have enough power density in your data center? So a big constraint is um, or the tension here is do I pack my chips very close together physically or far farther apart? If I pack them close together, the cables I run between them are shorter. Um, and that has a lot of uh knock-on benefits. Uh I can, you know, I can use uh electrical cables rather than fiber optic cables, those are cheaper. Um I can I don't actually have to drive them uh with as much uh energy because the data is traveling a shorter distance. So packing chips closer together is much better for um uh for an interconnect perspective, and you can get more bandwidth and lower latency and all of those good things. Um packing chips close together is much harder from a power point of view because now I have to deliver a like a megawatt of power into like a one square meter uh box. Um there's a like a really big power delivery challenge, and then there's also a cooling challenge. I have to run liquid cooling to all of these chips and then extract all of that um hot liquid out and bring it to the the building's um cooling units. Um and so the big constraint for us ends up being what is the power density in the data center that you're designing for. Um the way we approach that is standardization. We said Nvidia is setting the standards, we're gonna fit within NVIDIA's standards. Um, and so given how how much of the market they have, uh we think people people are probably gonna be building the data centers for that. We don't see too much regret uh there.
SPEAKER_00Yeah, so this is a great callback to our first lecture where I did talk about the compute bottleneck. And if you look at the history of infrastructure, we go through these boom and bus cycles, and what comes out on the other side is standardization that ensures fungibility of resources that happened with railroads, meter gauge with electricity, ACDC. I guess what you just Point has an interesting insight, which is um that that the NVIDIA reference architecture for data design uh for data center design feels to you like a um acceptable standard today that then unlocks efficiency for new chip platforms to get into data centers that were maybe originally uh sort of uh forecasted for the NVIDIA reference design.
SPEAKER_01Is that exactly right? And and uh and customers think about it in a similar way of like they're they're building data centers. Lifetime of a data center is like really long, it's in in decades. Um that's for the shell at least. For the power delivery network, it's maybe a little shorter. Um, but the um they uh absolutely have this fungibil fungibility consideration in mind, right?
SPEAKER_00Um in contrast, the Google or TPU data center design, um, which is not open source, might be a different standard, right? And and so then that prevents fungibility across these two ecosystems, right, from TPUs to this other standard. Um, is is in in your view, will that fungibility always just remain a constraint? Or do you do you see a future where compute can flow across data centers uh that were designed for different specs?
SPEAKER_01Yeah, um it's the the standardization sets a bounding box. Um you can you can have this much power density, you can have this much interconnect, um uh your racks can weigh this much, and it's all it's always like less than or equal to this much, actually. So um if you want to fit in a bounding box that someone else has defined, um it's totally fine to be less than that. Um in fact, you can and and if you are more than that, like if you consume more power, uh there are things you can do to adapt, compromising your product, but you can say, okay, only half as many chips per rack or something like that. Um that makes the racks uh smaller. Um and so uh, I mean, you may lose some of the other advantages you have, but like especially at the rack level, there's uh in this particular case, there's um the design of a rack is a lot cheaper than the design of a chip, and so you have some some ability to like impedance match there.
SPEAKER_03Yeah.
SPEAKER_01Um the the Google data centers, I think, is a sort of really interesting example. Um like in, I would say in in the several years ago, um we've seen big advantages from that vertical integration for Google. So they were uh really the first to very deeply adopt liquid cooling, and that is a massive advantage in what you can do with your chip. You can run it at almost twice the speed as a result because it can cool better. Um, and then Google has been deploying much larger interconnect domains. So uh different generations of TPU have had like 4,000 chips all connected together, or even 8,000 chips all to connect all connected together at very high interconnect. And if you contrast that with NVIDIA, where they've like recently in the Blackwell generation started connecting 72 together, you can see that there's this very big difference in this spec, which is coming primarily from Google's ability to vertically integrate. As us being an outside company, we cannot rely on that level of vertical integration that Google does. And so we we have to fit within the more flexible NVIDIA profile, and then we have to make the like that is a big part of our product design is to make the choices where we're not losing too much as a result.
SPEAKER_00Yep. Awesome, thank you. Um, the next one says next question. You've mentioned being excited about unexplored model architectures. What architectural changes in future models would most benefit MadX's design choices versus hurt them?
SPEAKER_01Yeah. So in general, if you take a model and you map it to a chip, it's going to like, and then you look at how much of your different uh chip resources are used, um, you'll get some profile of like 100% usage of compute, 50% usage of memory, 20% usage of interconnect, something like that. The you should expect that all model companies, when they have a given piece of hardware, are going to adapt their model to move all of those dials up to 100%. Um so for example, if I want to increase my usage of memory bandwidth, I can either, if I'm talking about the mixture of experts layer, I can have a greater degree of sparsity, so more experts uh while holding the number of activated experts constant. Or I can increase the size of my KB cache, so uh the amount of bytes that I store per token of context. Um that would like adjust the amount of memory uh usage. To adjust the amount of compute usage, I would change the number of activated experts. And then to adjust the amount of interconnect bandwidth, I would change the size of my experts. Bigger experts uh need less interconnect, smaller experts need more interconnect. So I have all of those levels available to me as a model company, if I were a model company. Um what we then do is sort of like think of uh what do models look like if they are hitting 100% of all of those resources on our chip. Um so concretely what that means is we have we are unusual in that we have substantially higher compute to memory bandwidth ratio. That's because we have a lot more compute. We have the same amount of memory bandwidth and a lot more compute. Um, so the things that are most attractive are things that figure out how to utilize that compute best. Um, that is the simplest thing, is the one I already said, more activated experts. Um other things that are interesting there are spending compute to decompress KV cache. Like you've got a very compressed KB cache and then and then and then expand it into a more you uh bigger and more useful one when I actually uh operate on it inside the chip.
SPEAKER_00I hope well, folks are gonna take notes, but uh each of those bullet points um could win a test of time award at NeuroX if um implemented correctly. So request for request for whatever solutions going with what what Reiner's talking about, guys. Okay. Uh how do you think about RL-driven chip design? Is that something you're actively using or still more theoretical?
SPEAKER_01Yeah, so um there's the RL loop. Uh I think uh, and then separate from that is just the like AI LLM driven chip design. Um the AI LLM driven one, developing a chip is writing code, like we write um Verilog and we have another language called Blue Spec to design the chip its to express the chip itself, and then we write the software that runs on it. All of that is just writing code. Um uh existing agents are really, really good at that already. So that is the very straightforward path of um uh like implementing a chip faster. Now the RL loop is, I think what that's hinting at is saying, but can we have a real, like, really powerful optimization process on top of that? Um to make that assessment of where does that make sense versus other approaches, there are other optimizations we can do. Uh like the natural other uh optimization mechanisms are can I just do exhaustive search on a space? Maybe the space is small enough that I can do that, or maybe I can even like solve analytically for the optimum. Um for a lot of the um tasks that I described of in co-design, the the problem definition is actually simple enough that it it completely yields a perfect solution to these simpler methods, in fact. So, how do I map a workload optimally to this set of chips? Um you can solve that analytically under appropriate assumptions. Uh and so that doesn't like doesn't necessarily necessitate in sort of this context of like the very high-level chip specs, um, doing a powerful and complicated uh optimization process like RL. Um RL is most useful when the search space is really massively large and an exhaustive search is not so good. Um so that has shown up in chip design in especially in the physical design parts of the chip. So where do I place each of the gates individually, like in the 2D grid that is the uh that is the chip? Um especially where do I place the memories relative to the gates? Um Google has uh has several examples of this, NVIDIA has some examples of this. Um that's it's sort of beyond the the realm that you can fit in your head, and this is a really good fit for RL. Um I think it's pretty attractive. Um to date, the wins have been pretty modest from it uh because the um sort of the opportunities in physical design are not as large as in architecture. And so the wins are maybe in the order of 10% or so.
SPEAKER_00Great, thank you. Um question seven. How do you get model developers to actually optimize for MatX hardware when their entire tool chain is GPU native?
SPEAKER_01Yeah, um, you can only expect a little bit of this to happen. Uh, because I mean we can only expect that because we're a startup. Uh I don't know, five years from now, if we have um a huge volume in deployment, uh, the world would be different. But today we can we can we have to stick somewhat close. So the way we think about that is there are these five specs that we care about: memory bandwidth and capacity, uh, compute performance, interconnect bandwidth. Um, our goal is to be not substantially behind on anything. We have to be at least on par with NVIDIA on all of these, and then be substantially ahead on a few of them. And so, what that means for a model developer is your regret of buying a Matics chip is very low because we're not behind, but your opportunity is large. And so uh that allows you the upside without paying a lot on the downside.
SPEAKER_00Um, we are almost at time, but if you have time, we'll take two more questions and then and then wrap. Yep. Okay. Um, how do you think about the SRAM versus HBM trade-off as context windows get longer? And does your memory hierarchy need to change meaningfully at one million plus token contexts?
SPEAKER_01Yeah. So the way specifically we use SRAM and HBM is we put weights in SRAM and KVs, the contexts in HBM. Other chips have done it differently. So Groc and Cerebrus have had only SRAM and they've put everything in SRAM. And then obviously Nvidia, Google, Amazon have put almost everything in HBM and only use SRAM uh in in transit. The uh putting weights in SRAM is a good idea. Um it you get low latency. Um SRAM capacity is not actually really a constraint. You can solve the capacity problem by having like many, many chips uh and aggregating the capacity over many chips. Um Cerebrus have demonstrated that that works and it achieves much better latency than you can do with the uh weights in HBM. So weights in SRAM, I think, is a clear and proven win. Now, KVs, do they go in SRAM versus uh in HBM? If you put them in SRAM, uh you very quickly run up against the um the capacity constraints of SRAM. Uh this strategy of like sharding it uh sharding it over many chips, you get more SRAM, um, actually doesn't work uh and doesn't give you uh an effective saving for the KB cache. Um like for fairly technical details, but essentially it amounts to pipelining doesn't scale well. Um and so uh that is very qualitatively different where um SRAM, like SRAM capacity does not scale large enough to handle long context and especially long context with large batch size well. And so our conclusion is that the right thing is to put KBs in HPM. That's the only way that'll actually scale. Um that scales up to extremely long context lengths. Um the main thing you need to be concerned about is memory bandwidth, and so uh think about accessing the KB cache sparsely or compressing it very highly. Great.
SPEAKER_00Um since you mentioned recursive design, do you see CAD drafting as a real bottleneck in data center projects? Or is it more about power, cooling, and MEP coordination?
SPEAKER_01So this is talking about the the data center design, not the chip design. Yeah. Um the so I I will say first that like my expertise in rack design is not as strong as in in chip design. But um what I see from our rack design teams is um racks are being pushed very aggressively on physical limits, power, interconnect density, um, and and cooling. Um the a lot of this is a sort of a risk, uh risk reward uh analysis of like um if I push aggressively here, I can get higher performance, but then when I actually have the the part physically, I may discover that um variations and and and tolerances are not as high as I wanted, and and and my product quality and reliability suffers too much. And so uh the there's a substantial amount which is uh managing the risk there. Um the I I don't know uh how much uh of that effort is spent on the actual sort of the CAD and design process. I would guess um it is a lesser uh amount and it is more on the uh sort of overall design principles of like vertical versus horizontal trays, what is the density of trays in the rack, what are the other components that go in a rack, like power and and and CPUs and networking and so on.
SPEAKER_00Fantastic. That is a wrap. That was incredible. The information density for you is incredible, and I'm gonna be watch re-watching our lecture at least three times. Um, but thank you so much, Reiner, and uh we'll probably have you back next year. Thank you, Gunish. This is a ton of fun. Okay, thanks everybody, have a great weekend. See you all. Cheers.