Masterclass on Google's TPU v8 Networking Artwork

Semi Doped

The business and technology of semiconductors. Alpha for engineers and investors alike.

Semi Doped

Masterclass on Google's TPU v8 Networking

April 24, 2026 • Vikram Sekar and Austin Lyons

0:00 | 46:59

Google's Cloud Next 2026 keynote? Fire. 🔥

The TPU is now two chips instead of one — 8t for training, 8i for inference — but more interestingly, it's two scale-up networking topologies too.

Austin Lyons (Chipstrat) and Vik Sekar (Vik's Newsletter) walk through what actually changed, one day after the announcement. OCS? Yes. AECs? Yep. Copper? Yep. Optics? Yep.

We cover Virgo (Google's 47 petabit/second scale-out fabric, built entirely on OCS), Boardfly (the new scale-up topology for MoE inference that cuts hop count from 16 to 7), and the 3D torus Google still uses for training.

Why is optical circuit switching the substrate of Google's data center? Why do active electrical cables still carry scale-up traffic inside racks? Why did Google split the CPU layer too, with custom ARM Axion head nodes to keep the TPUs fed?

Along the way we trace the Dragonfly topology lineage to a 2008 paper by John Kim, Bill Dally, Steve Scott, and Dennis Abts. Abts went on to build Groq's rack-scale interconnect before landing at Nvidia.

Chapters:
0:00 Intro
0:21 Two TPUs for two workloads
2:31 HBM, SRAM, and Axion CPUs
7:22 Why networking is the new bottleneck
17:14 Virgo: rebuilding scale-out on optics
25:24 3D torus Rubik's Cube scale-up for training
34:50 Boardfly: scale-up for MoE inference
42:07 Workload-specific everything

Follow Chipstrat:
Newsletter: https://www.chipstrat.com
X: https://x.com/austinsemis

Follow Vik:
Newsletter: https://www.viksnewsletter.com/
X: https://x.com/vikramskr

SPEAKER_00 0:00

The constraint, the constraint is no longer compute, but it is instead the networking that underlies all of compute today. That is the bottleneck that needs to be solved, and that is what Google's innovations now address.

SPEAKER_01 0:20

Hello everyone, welcome to another semi-doped podcast. I'm Austin Lyons of Chipstrat, and with me is Vic Shaker from Vic's Newsletter. Today we're going to talk all things Google, TPUs, networking, silicon. Yes, and Vic is going to try sharing a screen. So if you're listening to this, you might want to watch it on YouTube.

SPEAKER_00 0:40

Yeah, uh, this is our first attempt at sharing video because, like, there's so much developments on networking in the Google announcements recently in their whole TPU architecture and how they deal with training and influence differently, not just from a chip side, which we'll talk about, but also from the networking side. It's easy to talk about chips in the sense of, oh look, this it has so much RAM, it has so many flops. It's very difficult to talk about networking without showing a picture. So I've like pasted a bunch of pictures on Google Slides. I'm just going to share that. It's very rough. This is not going to be seriously edited or professional or yet, yet, yet. Maybe we'll get there one day. But anyway, it's I think it's useful to show some pictures. So definitely watch it on YouTube if you can.

SPEAKER_01 1:27

Yes. And so I'm excited. I think there's a lot to learn there. I'll set the stage for listeners. So Google had at their Google Cloud Next 2026, they had their keynote yesterday, and in it, they announced the next version of the TPU. And of course, the most exciting thing that most listeners probably have heard is that there's not just one TPU, there's actually two. TPU version 8. There's 8T, which is the training chip, and 8i, which is the inference chip. This is interesting in that historically it's for quite some time with the TPU, there's just been one chip. V1 was like an inference-only chip, then v2 was training and serving, v3, v4 was training and serving. V5, they split out this like efficiency chip versus a performance chip. V6 was back to one chip. But here we are. Now we've got a specific training chip or training system, as we'll get into, and a specific inference chip.

SPEAKER_00 2:31

Maybe I should point out the elephant in the room right away. The TPU 8i has like 384 MB of SRAM, which is like three times as much as the 8T, which is for training. That was like what that's, I mean, it makes sense. Why not put SRAM? You know, Grok LPUs that Nvidia is now integrating into their platform is basically SRAM. And why not do that right now in the TPU 8i? Because you know, put in as much SRAM as you can so that you can get your low latency inferencing from SRAM right there. Get high throughput, faster tokens. So that's like one obvious uh thing to do, which is put in more SRAM. So it's a good decision, but you know, it's explicit right now after seeing how we're going for like fast decoding.

SPEAKER_01 3:17

Yes, yes. And this aligns with um when I talked to Raynor Pope from Mad X, you know, he talked about how they do weights in SRAM and KV cache in HBM. And these are architectural decisions to try to get both like really fast decode, but also all the context you need. So we see see Google going this direction with three times as much SRAM, but then also uh 288 gigs of HBM on the inference chip. And I will point out one thing that I thought was interesting, which is the training chip has less HBM. It's 216 gigs. By the way, listeners uh and people watching, Google has amazing uh technical blogs for these chips, and they talked a lot about it. So they kind of talk about why they made these trade-offs. But I think it was interesting because right away in my head, I'm seeing for training, if you're gonna have less HBM, you could sort of frame that as like why overpay for HBM you don't need. So, for example, if someone's just selling one chip, like a GPU, that's like, hey, one size fits all, use it for training, use it for inference. You could say, well, hey, look what Google's doing. Maybe am I overpaying for expensive HBM on a training chip that maybe I don't actually need it.

SPEAKER_00 4:31

Yeah, I was wondering about why the training chip has lower HBM. And I think the inference part of it is obviously now so memory hungry that you want to put in as many fast memory tiers as and as much of it as you possibly can because the decoding and the inference gets like so much faster with much better memory throughput that the inference chip is simply you know peaked. It's it's maxed out with the best memory you can ever put down in the current environment, a current scenario. This is what you can do. Why is HBM lower in training? I guess you could always put more chips together if you don't have enough memory. You could just like add more GPUs and your cluster total GPU memory and the cluster gets bigger. So I guess that's the solution to it. But in inference, it's it only makes sense to max out every tier. You know, remember Nvidia is like memory tier where you have like SRAM right up. They don't actually mention SRAM, it should have been there, but it should have been like tier zero and HBM was tier one, and you know, DRAM was tier two. There's an effort at maximizing the fastest memory. Makes sense, yeah.

SPEAKER_01 5:46

Okay. What else was interesting? Both chips do use the ARM axian. So Google's ARM-based custom CPU as the head node CPU. And I think there was a quote here, I'll read it. It said in the announcement blog, TPU introduces two distinct systems, TPU 8T and AI. These new systems are key components of Google Cloud's AI hypercomputer. So yet another interesting marketing name. Of course, again, it though it hints that it's it's a whole data center these days, you're not just thinking of the chip level. An integrated supercomputing architecture that combines hardware, software, and networking to power the full AI lifecycle. While both systems share the core DNA of Google's AI stack and support the full AI lifecycle, each is built to address distinct bottlenecks and optimize efficiency for critical stages of development. Additionally, by integrating ARM-based Axiom CPU headers across our 8th gen TPU system, we've removed the host bottleneck caused by data preparation latency. Axiom provides the compute headroom to handle complex data pre-processing and orchestration so that TPUs stay fed and don't stall. So they're really framing these as the CPUs to feed the TPUs.

SPEAKER_00 7:01

Yeah, that's what is important. Never let the GPU stay idle.

SPEAKER_01 7:04

Yes, yes.

SPEAKER_00 7:06

That's what it is. So it also points out that ARM is a very viable alternative to any x86 CPUs, not to be ruled out at all in terms of ISA in architecture. So yeah, um, I think they're all equal, fair game now. And like we mentioned in an earlier episode, the best CPU is the one you can lay your hands on. So just whoever can deliver most of them, they'll make it work. ARM or X80.

SPEAKER_01 7:31

That's the name of the game right now. Okay, now let's talk, let's talk networking. Where should we start? I noticed that they've got this nice little table, and they're showing AT uses a 3D Taurus network topology, but AI uses the board fly topology. And so this is interesting because not only are we now starting to make different SKUs for different workloads, but we're starting to have different network topologies for different workloads. So that's super interesting. So take us take us there.

SPEAKER_00 8:00

Yeah. So in terms of the chips that you know we just spoke about themselves, they're cool. I mean, they're good chips. They have HBM, they have SRAM. This is all like standard stuff in some sense. Now I think the real innovation here, and the it is a tectonic shift in how networking is implemented in the future of Google data centers. There is a big change, and this is a change that happens once in a decade, kind of the change. And so I'll explain what that is, but for that I need to explain what was there before, right? So the fundamental assumption and the basic uh need for scaling AI today is the is to realize that the constraint, the constraint is no longer compute, but it is instead the networking that underlies all of compute today. That is the bottleneck that needs to be solved, and that is what Google's innovations now address. Google has now reimagined the data center in a sense. And as we talk about it, we'll kind of break down why that is. Because they have different chips, as we just mentioned, for training and inference. They have different networking architectures for training and inference as well, and we'll talk about why that is. And what is most interesting is that a lot of this networking is going towards optics. And you will hear me talk about like OCS a lot, which stands for optical circuit switching, right? And I want to like explain what that is right up front so that you know when I keep saying this word while talking about networking, and I will say it a lot that people understand what it what this actually is. So optical circuit switching is a way to redirect light from one port in a switch to another port in a switch so that you can connect one GPU here or to another GPU here, or TPUs in this case. And you do it entirely in the optical domain. And the idea is very simple. It's just like holding up a mirror and reflecting the sunlight from coming in from your window. You know how you can point it to different spots on your wall or something. That's exactly the theory of like the concept behind OCS. Why convert optics to electronics and like do silicon packet switching like all the tomahawk switches do? Just stay in light. You're already in the optical domain, stay in light. So that's what the OCS, so this is optical circuit switching. You can simply connect one port to another by changing how you shine light into different ports. So that's what OCS is. Okay. And so that's becoming the substrate on which Google's networking is built today. And I'll explain what parts of the network too. So the first thing we should talk about was what Google called their Virgo networking solution. Their previous network was actually called Jupiter, and this is from 2015. And was at that time, it was the industry's first like petabit scale network. Like nobody had ever seen anything before that at that scale. It was pretty fantastic. Because and you know, it was built primarily for the internet era, because that's what was driving everything. And data centers relies, you know, real relied on a clone network, which is uh basically have these racks, and you can have uh different levels of networking switches. This is this is when I have to kind of experiment with uh with a picture because um I actually have a picture of what a clone network looks like. And uh I yeah, so I hope this works.

SPEAKER_01 11:28

While he's pulling that up, for people listening, Clow is spelled C L O S.

SPEAKER_00 11:34

Yes, it is.

SPEAKER_01 11:35

You might not know that when you're listening, if you're trying to Google it yourself.

SPEAKER_00 11:38

Yes, so C L O S Clone Network, it looks something like this. So basically there are these racks, and then you have the first layer called a leaf switch, that is the racks hook up to. And then at the next layer it is what is called a spine switch. Then you have another layer on top called a super spine switch. And so what ends up happening is that if you have to go from like one rack to another rack or one GPU to another GPU, you know, between different pods or whatever, uh, you can see this picture if you're looking at it on YouTube, you'll see there are two pods. And basically, you have to go through a lot of network hops. Think about it, you have to go from the GPU in the rack to the leaf switch. And from the leaf switch, you go one higher level to the spine switch. From the spine switch, you go to the super spine switch, and then you get switched into the right super spine switch, and then you come back all the way down the hierarchy. This is too many hops, okay? And when you have to go from one GPU at one end of the data center to the other end of the data center, it is it is too much, okay? So that's the whole thing here, and that's why uh we don't it doesn't work for the AI era. And the reason Yeah.

SPEAKER_01 12:50

This this was definitely designed and made a lot of sense in like the cloud server web app era. You're hitting a database over here, you're hitting an API API over there, but it's not that you need every leaf talking to all the other leaves and having to hop way up and way back down to do that.

SPEAKER_00 13:09

Yeah, it happens sometimes, and the latency is kind of you know undeterministic. It happens, latency is what it is, and the problem is training networks don't like that. And like AI does not like random latencies and all that. And another reason I think this is very important, that this clone network has so many hops and you need so many layers of networking switches, is because each switch does not have enough ports. In networking terminology, that is called the switch radix. So when you say a switch doesn't have enough ports, it's just another way of saying that is like, yeah, it's a low-rating switch. So when you have a low-rating switch, you need to multiply the number of ports by adding layers on top. Okay, so that's why you need so many layers. Now the problem has changed. Like we have really high-rating switches, and we'll come into why that is. Now, all of this communication was managed uh in a data center in the Jupyter network of Google's Jupyter network by their software layer called Orion. As this internet era grew, really in the 2020, around 2022, uh Google introduced optical circuit switching into the data center because it actually has a large switch radix. You can, I think if you see Lomentum's uh OCS switch and even coherence, you'll see that has like 300 by 300 ports. It's like a lot of switches, actually, uh switch ports, a high radix switch. And so these were like also switched to optics. Now it's not doing silicon switching anymore. So then they started doing like optical switching, and they also do started doing wavelength division multiplexing, which means they sent multiple wavelengths. So all that increased the bandwidth. So it around 2022 it was like six uh petaby bit per second speeds, so six petabits. A petab is like one thousand terabytes. So we're talking about like six thousand terabits per second. That was like 2022. And then they went to like faster networking speeds, like going to 400 gig networking. Made it like a staggering, like 13.1 you know, petabits per second. That's 13,100 terabits per second. That's a lot. Remember this number, like 13.1 petabits. Because later when I tell you about Virgo, I'm gonna tell you what the number is, and you'll see how much, how amazing that is, right? So yeah, it's from 2015 to 2023, it's grown 13 times. That's pretty cool, right? Like, and you know, they've been growing it pretty quickly. So this was great. It worked great for internet, YouTube videos, web search. It's it's great. But when AI showed up, when you're training a trillion parameter model with highly synchronous traffic, you see, the internet does not work that way. Internet is like, I don't know, Austin is checking it in his time zone, I'm losing the internet on my time zone. Maybe there are like spikes because everybody watching a sporting event at some particular time, but those tend to be regional and you know it's it's okay. It's kind of an asynchronous thing, really. Internet is an asynchronous, really, for the most part. AI does not like that. It's like when all of it hits AI at the same time, the latencies go crazy and it's always limited by the slowest deer in the pack. You know, the highest latency is what is going to cause problems, and that is what is called like tail latency. The highest latency is the limiting factor. So that is the Jupiter network. That is what that was the status of the world until like Google changes everything.

SPEAKER_01 16:32

Yeah. And so, okay, so they had a network and it was made more for the previous era of like cloud, unpredictable network. And by the way, it's not all just one big job. Like AI training, you're just running one job across the network. It's coordination, it's all the machines are contributing to one big job. But obviously, the Jupyter network was not designed for that era. It was highly distributed, tons of jobs running at the same time. Even if there was some pattern to the traffic, it's just highly distributed. And then to your point, with training, not only is it all coordinated, but it's coordinated in such a way that there's this tail latency concept where everyone has to do work and update everyone else. And if there's a straggler, the whole bus has to wait for that one straggler.

SPEAKER_00 17:14

Yes, that's exactly the word they use in their blog post, too. So stragglers are bad. Yeah.

SPEAKER_01 17:18

Yes.

SPEAKER_00 17:19

Yeah. All right. So let's now talk about a new network called Virgo. They call it the Virgo mega scale network. So fancy. It is actually a mega scale network, and you'll see why. So Virgo is basically designed for AI. They saw the network like, no, this internet era stuff doesn't work anymore. We need to redesign this to actually make it work. And the biggest change here is that they reimagined each part of the network for what it is. And this is why you know need a picture again from um their blog, which I am going to put up if you're watching it on YouTube. It really helps. But I will try to explain it without having the visuals as well. So essentially, what happens is that you have three layers of their networking stack, which is first one is the scale up. We'll talk about scale up a little separately, but you know, you got the scale up network. Okay, you got it within a pod, and you know, however, the pod is hooked up. The Virgo fabric itself, where the interconnect happens, is in the scale out, also called the back-end network. So this is where is an east-west connection where you hook up all the racks in the data center.

SPEAKER_01 18:38

And then you've got a tiny bit of nuance there for people listening. Scale up, you're trying to make all the TPUs act as one and they're sharing their memory. So it's like memory coherence. So it's like they're all talking at like as fast a latency that it seems like all their HBM is shared. And then with scale out, from a training perspective, we're still trying to make one massive training computer, but everything on the scale out network is not sharing memory. Now they're they're talking as a big coordinated system, but not as tightly coupled. And I know that in everyone, when it comes to GPUs, they think, oh, scale up is within the rack and scale out is many racks. And that's mostly the case, although you can have scale up in between like side-by-side racks. But TPUs are a bit of a different beast because even on their scale-up domain, they'll have lots and lots of TPUs, more than just a rack. Did I did I get that right?

SPEAKER_00 19:28

Yes, so TPUs go to thousands, actually. But in terms of memory coherency, there are some improvements that I will mention next. And that is very it's a combination of how the latency between memory can actually be reduced as well. And that's an actually innovation here. Yeah, actually, that actually brings me to the pretty much explaining the last part of this networking thing we're here, which is their front-end network. The front-end network is simply, I don't know, compute and storage or connecting to the internet. This is not like fancy networking, okay? So you can use the Jupyter network for this. So you can use like the leaf spine topology, the cloed networking thing that we spoke about. That's fine. Don't need to reinvent every part of the network stack. Just reinvent what is required, right? Yeah, so that's that's how this basically looks. So the Virgo network, this is what we should talk about now. They collapse it entirely to a two-layer network because they have these high-ratic switches, which are all OCS, by the way. And they have, I think, 300 by 300 ports. If they are using Lomentum's OCS, the future of OCS is going to scale to 1000 by 1000 ports. So you can imagine, you can even flatten the network further. But regardless, I don't know if coherent is an OCS provider to Google as well. Things I don't really know. But anyway, all this technology exists from these companies and are viable users in these data centers. So OCS used to be optional before, but now it has become an integral part of Google's data center networking approach. And the way you can connect with OCS, because it has so many ports, and you can connect it in simply two layers, means you don't have all those hops. You can get from one TPU to another TPU with like going across two network layers. That is that makes it really, really fast. And not only that, because you have this high radix switches, you can connect 1,000, whatever's the number I have here, 134,000 TPUs to all act as one in the data center. That's insane. So they call it campus as a computer. It's crazy.

SPEAKER_01 21:44

Yeah, that sounds right.

SPEAKER_00 21:46

Yeah. The whole campus is a computer.

SPEAKER_01 21:51

That's right.

SPEAKER_00 21:52

And you want to know what the aggregate bandwidth of this whole network now is.

SPEAKER_01 22:00

What was it before? Like 13 petabits or something? I don't you told us to remember, and I don't remember.

SPEAKER_00 22:05

Yes, 13.1 petabits per second. Okay, and now it's 47 petabits per second.

SPEAKER_01 22:12

Wow. Wow.

SPEAKER_00 22:13

That's like 4x. Okay. That's like 4x faster. So that's one reason is it it's optics entirely. No silicon-based switching really in the back-end network. And they have re-rearchitected a bunch of things that makes this happen. So it's really amazing. And now you know what happens when you put 134,000 chips together in a campus as a computer, is that stuff breaks all the time. And so they have an enormous amount of telemetry built in so that they can continuously monitor these things and keep their goodput high. Goodput is like all, you know, the stuff when it's actually working. It's not just throughput. Throughput doesn't when you say throughput, it's just like, yeah, the CPU is capable of making flops, but does that actually generate enough tokens or do what it's supposed to? Sure, totally. The goodput terminology nowadays is to be like, okay, the working throughput.

SPEAKER_01 23:10

Yes, yes. That's good. That's actually good. Yes. It's not just like the theoretical max, it's what actually happens in practice.

SPEAKER_00 23:17

Yes. Now I should quickly hit upon the memory thing you mentioned, right? They have this thing called TPU Direct now. And what that is, is basically remote direct memory access or RDMA. RDMA as a technology has been around for a while. And in Nvidia Land, it has also been called GPU Direct. So TPU Direct is like an evolution of that concept, I suppose. But the idea is like fairly simple and it's honestly this is not a good like a fantastic innovation or anything. It's been around for a while. So not new, but it's been implemented in TPUs now. So what this is, is that before, if TPU 1 had to access the memory of TPU2, TPU has to go through the host CPU of the neighboring thing, of the neighboring TPU. And then it has to the host CPU will like interact with the DRAM and like interact with the network interface card, the networking stack, and all of them will like have conversations. So what do you want to access the memory? Cool. Which access memory do you want to do? And the CPU is involved and all of this stuff. And then it'll go and tell the destination TPU, the CPU of that TPU. Like, hey, like this guy wants to access your memory. Would you allow me to do this? And the the destination CPU would be like, Yeah, yeah, fine. Let's let's do this. Let's all make it happen. And then you finally you get to the HBM of the destination TPU. So the memory coherency, like you're saying, is like has so many handshakes. It's like so many middlemen managers ruining the organization. So they're like, take it out. Just let's take it out. Let's let's handle, let's get the middle middleman out of this thing and remove the host CPU from the picture. And that is what is called remote direct memory access. So TPU 1 will talk to TPU 2 directly through the network interface. No CPUs involved.

SPEAKER_01 25:16

And this is how GPUs do it too, right?

SPEAKER_00 25:19

GPU direct is is also been there. Yeah, Nvidia does this too. Yeah, so this is how we like so this significantly speeds things up. So there's like memory, you know, speed and latency increases as well. So now we should go to uh so that was that was all about scale out. Right? This is where we are, the TPU direct thing, you know. At some point we should get into talking about scale up. Because scale up networks in this TPU V8 thing comes of two flavors. Comes with the 3D Taurus approach, which everybody is like well familiar with, which is like the picture on the screen if you're watching it on YouTube. Um, and then now they have something called the board flag. Right? Have you have you heard of any of these things?

SPEAKER_01 26:08

So no, I remember hearing about the Taurus topology, semi-analysis. I don't know if this is from semi-analysis, but they're from semi-analysis. Okay, perfect. Yeah, they had they had some nice diagrams showing how it works. Is that historically? This is historically how the TPUs have done scale up is through the Taurus.

SPEAKER_00 26:26

I'm gonna very briefly try to explain a complicated 3D picture with words. Yeah, okay. Follow this carefully if you're listening and not watching.

SPEAKER_01 26:35

Yes, picture of a Rubik's Cube in your brain.

SPEAKER_00 26:38

Thank you. I was gonna say Rubik's Cube. I was so gonna say Rubik's Cube.

SPEAKER_01 26:42

I'm sorry, I interrupted.

SPEAKER_00 26:44

No, no, no, that was perfect. That was perfect. So this is uh a very large Rubik's Cube, but you know what? Think of it like the regular 3x3 one. It's okay. 3x3x3. So the Rubik's Cube has all its inner faces, right? You don't see the inner faces of a Rubik's Cube, really. So all of those are like uh connected to each other with cables, okay? Let's just say each movable cube of a Rubik's cube here we are thinking about is a TPU. And the TPUs are connected to each other in cubes like the Rubik's cube, and the inner faces are connected with like copper cables all around the Rubik's cube. So it's like every face is connected to every other face with copper. But now you also want to connect the outside faces of the same row and the outside faces of the same column to each other. That's what makes it the torus, right? And that is done with optics because clearly you have to go a longer distance to connect the faces of the Rubik's cube of the same row or same column, but on the opposite ends together, has to be connected with optics. So this is how the 3D torus works. And it has a problem when used for inference, actually. Because in training it's fine. You know, you can you can do all of these, all of these GPUs are working together and whatever, it's fine. But the the hardest uh distance that any GPU to GPU communication will happen in a Rubik's Cube, if you think about this one carefully, is actually from the edge of the Rubik's Cube. Think about one very edge. But if you think the farthest edge is the other edge of the Rubik's cube, you would be wrong. Because you can always use the outside optical cable to reach the other end. So that's not the hardest portion to get to. In a torus, in a 3D torus, the hardest position to get to is the middle of the torus. The middle of the Rubik's cube is the hardest and requires the most number of hops to get to in a 3D torus. Okay? And so if you think about how you're going to get from one very, very edge of the Rubik's Cube to the middle of the Rubik's Cube, you're going to hop halfway along one dimension, halfway along another dimension, and halfway along the third dimension. That'll get you to the middle, right? So that is so you know what happens is like when you have a four by four by eight Rubik's cube, which is I think what is on the screen, right? Like, yeah, this is a TPU V7 configuration. When you have a four by four by eight, you're gonna hop two hops in one direction, two hops in the other direction, and four hops in the third direction, right? So you're gonna have two plus two, four plus four, eight. Eight hops in this thing. So that's how you calculate for any 3D Taurus topology, which doesn't have to be four by four by eight, it could be eight by eight by sixteen, which means you have four plus four, eight plus another uh sixteen by two, eight. So you have sixteen hops. And so that if you see the Google blog, that's what they that's the example they use. They use an eight by eight by sixteen topology, which means you need a maximum of 16 hops to get from one point of the the Ruby's cube to the farthest point in the Ruby's cube, which is the middle.

SPEAKER_01 30:10

Let me interject here really quick and reflect it back because this is we're going really deep and I want to make sure everyone's tracking. So we were talking about scale out, just kind of like connecting pods to each other. And now we're talking about connecting individual TPUs to each other. And we don't want to use a clone network because if you have these neighbor TPUs in training, you've got with like dense models, you've got just like neighbors talking to each other. Just think of like different layers of the network, and you just want like GPU one to talk to GPU two or TPU to talk to TPU three to TPU four. So they talk to their neighbors a lot, and and so we don't want a world where they have to like hop up and hop back down just to talk to their neighbor. And so the traditional way to lay out for training is like, oh, how can we densely pack as many TPUs to have as many neighbors that are close by as possible that we could connect like with copper, just like really easily. And so Vic has this picture here of this Rubik's cube just showing like this is actually pretty optimal to think about connecting these in three dimensions, where you've got a like if I'm in the bottom left corner, if that's my little block or TPU, I've got a TPU to my right, like called the X-axis, a TPU to my left, call that like the y-axis, and then a TPU right above me, call that the z-axis. So I've got all these neighbors really close to me. And then Vic was also pointing out that like if I'm if I'm in that bottom left corner and I want to connect to anyone else on my like x-axis row, it's one, it's like one neighbor or two neighbors or three neighbors, or you can put optical to connect me from zero all the way to the other end. Let's call it three. So there's there's there's actually it's pretty easy to talk to my neighbor's neighbor, but the the hardest one to talk to is the one right in the middle because I have to traverse through neighbors in kind of every dimension, down the X and the Y and the Z to get to there. So I'll hand it back to you.

SPEAKER_00 32:07

Yeah, thanks. That's a good summary. That's good summary because that's as important. That is what a 3D torus is. In the 8x8x16, I showed you like how you need half the hops in each direction. And anybody listening to this might want to pause and think about okay, what is he saying, how many hops. But yeah, this is how this this networking is serious business. But anyway, the 16 hops is what was mentioned in the Google document uh blog. And so 16 hops is a lot. This is not a very good, it's good it's a good architecture for training, okay, because all of these TPUs are talking to each other all the time. In inference, it is not a good architecture. And why is that? And the reason is because not all GPUs are or TPUs are activated all the time. When you have a mixture of experts model, only some of them are going to be active, depending on which parameters are getting activated. And so the all the GPUs don't function all the time together, and they do traverse a lot more hops. And you want that hop latency to be minimized, because otherwise it adds to the inference time and you know performance. You don't want that. So the question is how do you architect 3D Taurus for the world of inferencing with mixture of expert models?

unknown 33:27

Yes.

SPEAKER_00 33:28

And this is why we come to the concept of board fly.

SPEAKER_01 33:31

Yes, yes. So let me reflect it back really quick for people. For training, we want all this neighbor to neighbor communication. And what Vic was saying is if you've got a mixture of experts when you're doing inference, there's not necessarily going to be all this neighbor to neighbor communication because it might just be like, oh, hey, for this token, I want this expert, expert 21 over there. And for this token, I want that expert, expert four over there. And so it's it's not the same communication patterns every time, just neighbor to neighbor, neighbor, layer to layer, layer. But now it's this routing, which actually kind of reminds me back to what we were talking about earlier of the sort of non-deterministic era that we were living in. So all this is to say the workload communication pattern is different for training than it is for inference. So that raises the point that why not design the interconnect for the workload, which in this case is mixture of experts. Which, by the way, if you go listen to the Raynor Pope talk, he actually talked about how they're thinking about this as well, designing the interconnect specifically for mixture of experts. But okay, carry on, Vic.

SPEAKER_00 34:40

Awesome. Yeah, it's good context, good context, because we need that. We need to always take a break and think about and say to ourselves what networking is. That's the only way you learn networking. You have to say it out loud, you know. So pause the video and say whatever we've spoken out loud so far. Okay. So what ends up happening is we need to go to the board fly. And this is it is not an it is not a dramatically new invention, uh, but it's a modification of an existing idea. So the board fly approach is fundamentally to reduce the number of hops. Remember how we like I told you like 8 by 8 by 16 has 16 hops. Now we want to get that down, and board fly topology allows you to do that. Okay, so how this works is I uh you have a board, okay, and on a board you have basically four TPUs, and you'll see this picture in all whenever you search up TPU V8, you'll see these like a board with like four TPUs in it. Those are PCB connected, and so there's like copper. Okay, there's no like optics here, it's just a board, PCB with copper connections. So that's copper connection. Now you take eight of these boards and you put them in a rack and you hook them all up with active electrical cables. You see how scale up still uses active electrical cable? So it's not like everything's like optics and copper is dead, no. So AEC is still used to connect the boards, uh, eight boards together.

unknown 36:11

Okay?

SPEAKER_00 36:12

And this is now called a group. Okay. Now, how they are connected is what is called the dragonfly approach. And this has been around since the supercomputing days. Again, that's this has been around for decades. This is not like a fantastically, you know, revolutionary idea. It was at the time, not predicting.

SPEAKER_01 36:29

Yes, let me give you a tiny bit of trivia. I I found this when I was Googling. I've been waiting to tell you this. I haven't told you this before. Okay, so I was looking up Dragonfly, and there was a paper from 2008 talking about some computer architecture paper introducing the Dragonfly Network, and the authors on the paper, someone named John Ken from Northwestern, William Dolly from Stanford, whom you might also know as Bill Dolly, who is now at NVIDIA, and he's the head of NVIDIA's research, which is by the way, probably one of the coolest jobs in the world. Then another author is Steve Scott, some guy from Cray, which of course Cray Supercomputing, a lot of history there. And uh people who don't know, that's in Chippewa Falls, Wisconsin, which is a town of 14,000 people. So just kind of crazy, kind of cray cray. And then the the last name on it is um from this 2008 paper, is Dennis Aps. I'm not sure how you pronounce his last name, ABTS. He was at Google at the time. Then he went to Grok and he was an early Grok engineer from like 2017 to presumably doing all of their network design and stuff. And then he went to NVIDIA, and so he's been at NVIDIA ever since. So when I saw the paper, I was just like, whoa, these are some of the who's who's of networking of NVIDIA, and it was just crazy to see that this traced back to them from 2008. But with that, I'll give it back to you to keep going. I just I just thought it was cool.

SPEAKER_00 38:02

That was great. That was great. That's a great piece of trivia. And I want to add to the the fact that Dennis Apps, when he was at Grok and the Nvidia, then he went to NVIDIA and all that. If you look at how Grok architects their rack scale solution, it's also dragonfly. They don't hook up boards, they hook up individual LPUs in dragonfly configuration. That is what Grok does too, by the way.

SPEAKER_01 38:27

I see, I see, gotcha. Yeah, yeah, because it's still it still reduces the amount of hops, even if they're not connected on boards, it's still going from 16 hops down to something less.

SPEAKER_00 38:37

Yes, exactly. So this is called board fly because you're not hooking up like GPUs in a dragonfly configuration. You're hooking up boards of four TPUs in dragonfly configuration. So it's you know the port MIME too, you know, board fly. So this is still AC. The next level is that you connect up all of these groups together. So remember we had four TPUs to a board, eight boards to a rack. Also, you can call it a group, and then you have 36 groups connected to each other in a pod. So if you multiply 36 groups times eight boards times four TPUs, you get 1152 chips. And so 36 groups are all connected together with OCS. Again, you see how OCS is the underlying substrate on which all of Google networking is built. It tells you how important this technology is now.

SPEAKER_01 39:37

Yes, yes. So all of the scale out in Virgo or whatever it's called, was all of that OC?

SPEAKER_00 39:45

Virgo, yes, that is all because it is a high-ratic switch that connects two layers, that connects all of this stuff up in two layers.

SPEAKER_01 39:52

Okay, so all the scale out was OCS, and then scale up on the board, it's PCB traces. Nearby boards in a group are AECs, but then group to group is connected via OCS.

SPEAKER_00 40:06

OCS, all optics, yes. So optics is a primary driver in all of this networking, and all of this is like OCS. And now what is the benefit? Why do all this stuff? I promise like class will end soon, right? But I think this is fascinating. Okay, so let's I'll take you show you like a couple of more pictures. If you please do watch this on YouTube. So the board fly tells you why how exactly the hops can be minimized. Okay, so the only the picture can show you this, and you can because what you can do is you can go from board to board within a rack, you know, and you will you'll have a couple of hops within the rack, and then you will make one big hop via OCS to a different group, you know, and then you will make a couple of more hops there, and ultimately you'll reach your destination in probably six or seven hops. And so your hops has come down from 16 all the way down to seven because of this approach of architecting a board fly networking scheme for scale up. That is a big deal. You know, the latency has dropped by over 50% because of this clever way of hooking stuff up. See, that's why the networking of a data center is vital in the performance it provides. It's critical, right? So that's I I don't know. That that's about all the spiel I have. It's pretty fancy.

SPEAKER_01 41:36

Yeah.

SPEAKER_00 41:37

It's very technical, I know this episode, but essentially to summarize, there are two major inventions. One in the scale out network, which is uh pretty much all OCS and then there is re-architecting of the scale up network to go from 3D Taurus, which TPUs are typically known for, to make it incident specific and use board flight topology. So that's my long speed.

SPEAKER_01 42:07

Very good. Thank you, Professor Vic. And yes, just as a reminder, so so why we wanted to spend the whole time talking about what Google introduced is because again, we're seeing these big shifts from one chip that does everything to two chips, training and inference, but it's not just two chips, it's two scale-up networks, PD Taurus for training, for dense neighbor-to-neighbor communication, and then the board fly for inference scale-up for mixture of experts, which is again sort of proving the point that a lot of people have been saying is it's no longer one size fits all, not for silicon, not for networking, but actually the data centers, the complete data centers are being architected around the workload. And so, you know, whether you're a startup in this space or you're tracking optics, like, you know, this is a big shift that we we see from Google. And then, of course, there's lots of interesting questions like will we start to see this level of network innovation coming from others like AWS with training? Are they going to stick to the way they've architected things for their cloud environment? Or are they going to be starting to design data centers that are uh, you know, specifically designed for MOE inference or whatever, even if it means they have to think differently than they used to in the web server API era?

SPEAKER_00 43:34

Yeah, yeah. It's interesting, right? Like now you have training and inference have distinctly fallen into different two different camps. They have different chips, they have different networking solutions. I don't know what's next. Different power solutions too? Different locations based on the direction of the wind. You know, it's extreme co design, man. Like, you know, it matters. The wind matters.

SPEAKER_01 43:55

No, yeah, it it definitely is extreme co design. And then, of course, there's a million other questions that. I'm sure we'll circle back to like is it just these two architectures forever, or are there going to be other workloads that demand something slightly differently? For inference, you know, is this sort of one size fits all? Yeah. For all inference workloads, whether they're like world models or just, you know, textual inference.

SPEAKER_00 44:21

Yeah. Maybe we'll build out agentic inference data centers in the future. Somebody will figure out that agentic workloads need a different infrastructure.

SPEAKER_01 44:30

No, I mean, okay, I'm glad you mentioned that because I still want to know more about like where CPUs fit in this communication, this network topology. Obviously, they talked about the Axian CPUs to feed the TPUs, but what about all that other, all the other agents that are just running on CPUs and virtual machines somewhere? I know that a lot of them are just kind of long running and latency doesn't matter. But what about the ones that where latency does matter? Do those come into the networking topology somehow? I don't know. Come tell us to teach me.

SPEAKER_00 45:03

There's so much to learn here. This is just scratching the surface. We haven't even talked about like other stuff that I saw on there called uh uh CAE, which is called Collective's Acceleration Engine.

SPEAKER_01 45:16

Yes, yes.

SPEAKER_00 45:16

Be frank, I don't even know what that is. I haven't gotten around to reading about it.

SPEAKER_01 45:20

Yeah, yeah. We'll have to follow up on that. I I did a little bit of reading and it and it sounded like it actually is related to networking in that it sounded like offloading some communication stuff to a specific accelerator. So yeah, I think I have written down here like the CAE collectives acceleration engine. Um it the each TPU AI has two tensor cores and one CAE on a chiplet die. And the CAE offroad offloads all reduce, all gather, all to all type collectives. It's a workload-specific accelerator. It's kind of like Nvidia's sharp. So it's it's it's kind of similar to what DPUs do as well, which is how how can you let let GPUs just do as much matrix multiplication as possible, let CPUs not get in the way, save those, you know, for doing some of the suggensic stuff, maybe, or feeding the TPUs, but then as much of that networking stuff you can just put it on a network-specific optimized silicon. So it's just kind of it's turtles all the way down, optimizing everything.

SPEAKER_00 46:19

Yeah, it's true. All right. That is too much information for anybody to process. This is going to, we're gonna have to split this up into 16 clips, I think. I don't know. It's like it's like a whole course on OCS and Google networking.

SPEAKER_01 46:33

Yes. We hope that you liked Professor Vic's lecture in near real time after Google's announcement yesterday. Thanks for listening. That's it. If you're enjoying semi doped, the first thing you should do is just tell your friends. We are so happy when we see people sharing our videos. Thank you so much for that word of mouth recommendation. Subscribe to our newsletters if you haven't yet. I'm sure Vic and I will write about this more in depth. And yes, thanks, and we'll see you next week.

Austin Lyons

Host

Vikram Sekar

Host