AI's Invisible Bottleneck: Why AI Stalls at the Network, not the GPU Artwork

AI Proving Ground Podcast

AI deployment and adoption is complex — this podcast makes it actionable. Join top experts, IT leaders and innovators as we explore AI’s toughest challenges, uncover real-world case studies, and reveal practical insights that drive AI ROI. From strategy to execution, we break down what works (and what doesn’t) in enterprise AI. New episodes every week.

All Episodes

AI Proving Ground Podcast

AI's Invisible Bottleneck: Why AI Stalls at the Network, not the GPU

July 02, 2025 • World Wide Technology

For many, AI success isn’t limited by how many GPUs you can buy; it’s limited by how fast those GPUs can talk to each other without tripping over the plumbing. In this episode of the AI Proving Ground Podcast, two of WWT’s top networking minds —Justin van Schaik and Eric Fairfield — lay out the real choke points slowing AI projects to a crawl and how powerful, modernized network architectures are quietly rewriting the rulebook for scaling AI.

Support for this episode is provided by: Nokia

Learn more about this week's guest:

Justin van Shaik is a Technical Solutions Architect at World Wide Technology, specializing in High Performance Networking, AI and Open Networking. A seasoned technologist, he helps organizations design and deploy advanced infrastructure to support next-gen workloads at scale.

Justin's top pick: The Future of High Performance Networking: Ultra Ethernet Explained

Eric Fairfield is a Technical Solutions Architect at World Wide Technology with a passion for solving data center networking challenges. He specializes in Cisco ACI, VMware NSX and their integration. Outside of tech, Eric has spent over 30 years immersed in motorsports, working with F1, IndyCar, IMSA and other major racing organizations.

Eric's top pick: WWT at ONUG AI Networking Summit—Dallas 2025

The AI Proving Ground Podcast leverages the deep AI technical and business expertise from within World Wide Technology's one-of-a-kind AI Proving Ground, which provides unrivaled access to the world's leading AI technologies. This unique lab environment accelerates your ability to learn about, test, train and implement AI solutions.

Learn more about WWT's AI Proving Ground.

The AI Proving Ground is a composable lab environment that features the latest high-performance infrastructure and reference architectures from the world's leading AI companies, such as NVIDIA, Cisco, Dell, F5, AMD, Intel and others.

Developed within our Advanced Technology Center (ATC), this one-of-a-kind lab environment empowers IT teams to evaluate and test AI infrastructure, software and solutions for efficacy, scalability and flexibility — all under one roof. The AI Proving Ground provides visibility into data flows across the entire development pipeline, enabling more informed decision-making while safeguarding production environments.

Speaker 1: 0:00

For many, ai success isn't limited to how many GPUs you can buy. It's limited by how fast those GPUs can talk to each other without tripping over the plumbing the network. It can be the unsung hero or chief villain of your AI journey. Today, two of WWT's top networking minds, justin Van Schaik and Eric Fairfield, lay out the real choke points slowing AI projects to a crawl. Eric Fairfield lay out the real choke points slowing AI projects to a crawl. Things like bad multipathing, tail latency spikes and reliability gaps that turn training jobs into week-long sagas. They'll also explain why next-gen Ethernet not exotic accelerators is quietly rewriting the rulebook for scaling AI. By the end of this episode, you'll hear why the fastest path to AI value might not start with another GPU purchase, but with a ruthless look at the wires, switches and software stitching it all together.

Speaker 1: 0:52

This is the AI Proving Ground podcast from Worldwide Technology Everything AI all in one place. Let's get to it, justin. Eric, thanks so much for joining us today on the AI Proving Ground podcast. How are you Doing great? Fantastic GPUs, accelerators and that kind of being what drives AI progress?

Speaker 3: 1:34

Tell me why it's the network, or networks, and not GPUs, that are that AI choke point for organizations that might be stumbling or stalling on their AI journeys. So I'm going to give you a sub slice of that, dan. The network is, there's always a moving bottleneck. Something is always going to be the fastest part and something's going to be the slowest part. The challenge has been with AI, you have to get the GPUs collectively speaking to each other, and that collective action requires a lot of very high bandwidth, very low latency, interconnectivity. Historically, any kind of flaw in the transport in the network retransmits dying transceivers. Slow things like that tend to have an inordinate impact. A 1% to 2% fail rate of transceivers can have a 60% impact on the job completion time for a generative training one. So that's what we've been dealing with historically. We're just stepping through, trying to fix it every step of the way.

Speaker 1: 2:28

Yeah, and real quick. I'll jump to you here, eric, but first, justin, why is that? You know? You mentioned how GPUs need to talk collectively. Why do they need to talk collectively before they move forward?

Speaker 3: 2:40

I'll use the human brain analogy. One brain cell ain't that smart. You have to have, you know, we have to have a few trillion, you have to have a few hundred thousand neurons interconnected to make a really powerful neural network to that can solve these problems. One large monolithic GPU is not going to cover anything. You have to have a several thousand working together and that's where the collective works in.

Speaker 1: 3:03

Yeah no, absolutely Eric, no, absolutely Eric. This is going to be a dumb question by design, but why can't we just plug these GPUs, or plug AI, so to speak, into the same network that email runs on, or any other enterprise application for that matter?

Speaker 2: 3:19

Well, as Justin alluded to these GPUs, they talk to each other collectively and they have to come to an agreeance within a very specific time frame to be performant. And you don't want these GPUs having to contend with traditional traffic, right, Email, web surfing, March Madness, all those kinds of things, right? You don't want to have to contend with that traffic, Otherwise it's going to affect the job completion time very significantly.

Speaker 3: 3:52

Yeah anything impacting that communication does have a huge impact, like you just said. Like to use a basic analogy, I mean, these things are essentially doing very high-dimensional math, so, but they're breaking up into chunks. So, for example, if they all say, okay, we've, we've multiplied x by y, now we need to carry the seven, who has the seven? Oh crap, find the gpu with the seven. Uh, he's, he's running a little bit slower. You wait until the seven shows up before somebody can carry that seven. It's. That's a very dumb down, you know concept, though, but that's essentially what's happening. Any impact, any gpu that slows down anything will slow the whole process.

Speaker 1: 4:28

Yeah, and how have networks then evolved over just the last couple of years, or three or four years, as AI has or generative AI has jumped onto the scene? What types of changes have really been made in terms of networking that have allowed to enable that communication, that fast communication?

Speaker 2: 4:47

I would say probably one of the biggest things is bandwidth has been changing very frequently, right? It was just a few years ago when we were in shock that we're doing 100 gig in our data centers. Now the data center, the GPU networks, are 400 gig. We're moving to 800 gig and 1.6 terabit is in the very near future here. So the amount of bandwidth that's being utilized is just. It's amazing how that's changed and how Ethernet is having to change as a protocol this is something Justin's been very involved with is UltraEthernet right? And how Ethernet's going to change its way of communicating to overcome some of the historic challenges of congestion management.

Speaker 1: 5:44

Yeah, I do want to get into Ethernet and ultra Ethernet, but just real quick. You mentioned how you know the needs are constantly rising, eric. Is this? Is it a sustainable path forward where you know we're going to be able to account for all that moving?

Speaker 2: 5:57

forward. I would say it's definitely sustainable. But what people have to take into consideration is, as we're getting faster, the fiber needs are going to change, so there's going to be a distinct transition from our typical one pair of fiber to using cables like MPO-12, mpoo16 cables to transfer that much data and really even moving from multi-mode fiber to single-mode fiber we're seeing this with 801.6 terabit is the movement towards single-mode fiber. So it's sustainable. It's just we have to change the way that we're looking at our cabling structure, which is also going to work hand in hand with power and cooling delivery. Those are probably the biggest sustainability challenges right now is actually the power and cooling more so than the network Exactly.

Speaker 3: 7:02

Well, there's always going to be that moving target. But as we start to, you know, essentially one of the old rules of networking is no matter how much bandwidth you put out there, something will consume it and as we put, yes, we have a sustainable growth path for the actual bandwidth Latency is about as low as it can get. Reliability has been vastly improving and that's also what Ultra-Ethernet is talking about. But we've been seeing Ethernet, you know, modify itself many times over the years and it's like we used to have like I use voice over IP as a common analogy when they first had voice over IP, we had PBXs and that was like Digivoid your phones. And then eventually they said we can do voice over IP, but you have that entirely separate Ethernet network to run this. You can't converge it because it's very sensitive and Ethernet will drop every pack and it's going to be horrible.

Speaker 3: 7:54

And then eventually we figured out how to do the QoS properly. We figured out how to have the right bandwidth, all the proper adjustments. Now VoIP is ubiquitous. Ethernet evolves. The requirements that we're seeing in here are, in one sense, just an iteration, the next iteration of every bit of growth we've had to deal with in networking and, in another sense, some of the peculiarities are why they're actually going into the UltraEthernet Consortium to deep dive into some really granular aspects of the transport for AI that needs to be updated, needs to change, and that's irrespective of whether you're running on Ethernet or InfiniBand. Some stuff has to be modified.

Speaker 1: 8:40

Yeah, justin, give us a little bit of that. Maybe dive a smidge deeper with the Ethernet, the InfiniBand, now UltraEthernet, the requirements, you don't care what it is.

Speaker 3: 9:07

So if you look at the application AI as an application, just like we say voice is just an application, ai is Skynetexe and as long as you have the proper, you know you deliver what it needs. It doesn't care how you're delivering it. Infinimain can do it, ethernet can do it now and we're moving into the ultra-Ethernet where it's addressing those very specific problems such as those retransmits once something drops Currently. So the go-fast juice that actually makes Ethernet or InfiniMand really really fast is RDMA, remote direct memory access, and what that does is allow you one machine to put directly resonant in the memory of another machine information without having it checked, similar to what we saw with acceleration with X.25 to frame relay. It said we now trust the transport, we're just going to send it straight through. Rdma has always been very, very twitchy. One of the worst parts about it is if you drop relay, it said we now trust the transport, we're just going to send a straight throat. Rdma has always been very, very twitchy. One of the worst parts about is if you drop anything in sequence in rdma, um, it does a roll, go back n so it rolls back to the last known good, which means that any kind of successive retransmissions or glitchy things. You know, one drop packet per every session can actually had a huge outsized impact on the on the job completion time. So the rdma itself is being rewritten in ultra ethernet.

Speaker 3: 10:31

Um, not like you know the trash, not throwing out the baby with the bath water, but they are looking at things like uh, granular retransmits. So it only you know says uh, you dropped, you know, packet seven out of ten. Please retransmit packet seven instead of let's go back to zero and go through the whole thing again. There's a lot of other tweaks going into there. Retransmits in IP-based networks have traditionally been handled at the transport layer. They're moving all that down to the link level. So it's actually happening in hardware now instead of in a software retransmit. There's no cpu interrupt to make that happen. Again, it's not even going into a tcp offload engine, it is just hitting directly. So, and then there's a lot of other things in there um, quality of service.

Speaker 3: 11:17

Instead of prioritizing, so in a large elephant flow for in gener, in generative AI, um, you're going to have like it's a lot, it's it's highly varied. It is not a homogenous stream. Not every packet has to be delivered reliably. So now I can actually categorize into sub levels. Am I going to be doing? Uh, you know, reliable order delivery, reliable under delivery? Uh, all of these things, you know, reliable, unordered delivery, all of these things, unreliable, unordered. It essentially can classify every single packet in a workflow so that only the stuff that needs the highest priority will get sent. That way. That frees us up a lot, because then we suddenly don't have to have every part of a flow between two GPUs or between 1,000 GPUs or 10,000 GPUs. It won't be prioritized the same. It gives us a lot more leeway in what can and can't be done.

Speaker 1: 12:07

And is that just enabling speed or is it avoiding or mitigating some of those ripple effects that you mentioned earlier if something is not delivered reliably, or what's the benefit of that?

Speaker 3: 12:19

So let's roll back to the RDMA again really fast. If you look at a day in the life of a packet which is, you know, the network crawl between you know sent from point a to point b, we don't look at it from nick to nick anymore. We look at that process in terms of from gpu to gpu, which is so you're gonna have two, eight by registers going to a gpu. One, eight by register will come out. That is a 64-bit flop. Now that goes out to the L1 cache, l2 cache, l3 cache, onto the PCI bus, over to the NIC, gets checked for everything along the way and then finally hits the wire.

Speaker 3: 12:54

Rdma removes all of that error checking and just sends it right out to the wire. And then on the other side, the other GPU does the exact same thing, going back up the stack. So what you're doing here is not necessarily creating more bandwidth out of it, not necessarily creating a lower latency out of it in terms of the speed of light is still the speed of light. We are minimizing the number of touches, the number of steps that we have to do in the middle in order to make sure it happens. So it's a question of efficiency, if you will more than just throwing more scale per scale out of the problem. Does that make sense?

Speaker 1: 13:28

Yeah, no, absolutely Eric, anything you would build on top of that. Or where do you see the delineation between Ethernet, ultraethernet, infiniband, and how does that affect our clients or enterprise IT teams? Do you have to pick one and stay in that lane for a long time, or can you bounce back and forth, or can you even use a best of breed type of situation?

Speaker 2: 13:55

use a best of breed type of situation. So that is a fantastic question, because we get this all the time. You know ethernet or InfiniBand. What's the best way to look at this? And there's a few things that we always have to keep in mind, and one of them is how well do you know InfiniBand?

Speaker 2: 14:08

There's a lot of customers that have deployed InfiniBand that really have challenges operationally because they don't have anyone that really knows the ins and outs of InfiniBand, how to troubleshoot it. Infiniband actually installing it very, very easy, it's very plug and play, but it goes off the rails when it comes to problems. Right, as soon as you have that problem and you don't know how to troubleshoot it. Now you have to call the experts. Who are you going to call?

Speaker 2: 14:40

There aren't a lot of InfiniBand experts out there and the ones that are out there are rather busy dealing with other implementations, troubleshoots, stuff like that. So really you have to look at is Ethernet good enough? And that's something that we have shown in the AI proving ground time and time again now is in a lot of small networks we're able to deploy Ethernet and it's just a performant, if not more performant than InfiniBand, right, and network engineers can go in and troubleshoot it, because they already understand Ethernet operationally. So that's one of the biggest things we look at is what's your operational model? Can you even handle InfiniBand for when it goes sideways?

Speaker 1: 15:42

Yeah, Eric, how is?

Speaker 3: 15:42

the need or rise of distributed architectures. How is that compounding the fact even more? Or is it, If by distributed you mean you know kind of the calico tapestry of networking, that's not necessarily going to be happening a whole lot with AI networks? They kind of require a homogenous transport. You're not going to be going from, you know, 100 gig to 10 gig to 400 gig to different qs's like it. It requires a fairly homogenous island of communication. You can connect that in with the rest of the network, but not at a spot where it's going to be sharing any kind of traffic with the ai itself yeah, what?

Speaker 2: 16:16

what we are seeing is when it comes to distributed architecture, a lot some of that will come down to edge application and it's not part of an overall AI training network or an inferencing network at the edge, because the performance needs to happen at that location instead of coming back to a data center sharing with something collectively right. So there is distributed nature within AI for very specific use cases.

Speaker 4: 17:07

This episode is supported by Nokia. Nokia helps you realize your digital potential with trusted purpose built IP, optical, fixed and data center solutions providing superior performance, security and seamlessly integrating into any ecosystem. Nokia pioneering networks that sense, think and act.

Speaker 1: 17:28

Justin, I am curious. You know we you know at least as of this recording we just got out, or relatively speaking out, of Cisco live, and you know we've had a bunch of other conferences and video GTC, et cetera. Tell me more about this partnership that I'm hearing up between NVIDIA and Cisco to bring Cisco's networking and operating systems to NVIDIA's Spectrum X ecosystem. What does that all all signal? All signal to the industry.

Speaker 3: 17:53

So in terms of industry patterns, it's the best of both worlds. Just to be perfectly candid here, a lot of the industry has been absolutely flocking to NVIDIA solutions because they're excellent, they're fast and they're powerful. But they're also the only solution on the market and they have a tightly integrated vertical stack and everything works together. Many of our customers, many customers in the industry, are looking to have some level of diversified risk around those suppliers, and if it's one great supplier for GPUs, they don't want to have that same great supplier for network. So integrating Cisco into it gives them a little. It gives them choice as well. Under the hood, what we're hoping to see is some level of standardization on the best way to do it. Currently there's several thoughts around it. There's, at the end of the day, it's where do you want to reorder your packets? If you're going to use all of your bandwidth, you know you can spray them out across the environment. There's going to be out of order delivery. It still has to be delivered in order to that GPU on the other side. That's the RDMA way of things. If it's not out of order, things go wrong. So you reorder it either inside the network or you reorder it at the edge on, like a DPU or a Bluefield or something like that, or ConnectX 7 or SuperNIC. And then various partners have various methods for doing it. Cisco has one way, nvidia has another, broadcom has another, arista has another, juniper has a different one.

Speaker 3: 19:20

Also, we are hoping to see, by combining some level of engineering expertise between these major vendors, that they'll start to come up with a best option with other options available. Cisco uses a DL, nvidia uses spectrum x. By the first stage is just allowing cisco uh nexus to participate in that environment, meaning at at layer two, there's going to be a handshake between the super nick and the switch in the middle and says ah, you are a spectrum x compatible switch, therefore I will use you. That's the NVIDIA side. And then on the Cisco side, they just have to use some P4 programmability on the Silicon 1 chips to allow it to speak Spectrum X, meaning will it do adaptive routing in the middle? Will it have some kind of congestion metering available as well? All these things are being factored in. So we're very excited about where it's going, hopeful that it will deliver everything that was expected. But of course that's yes, write that comment. Of course they're going to be doing deliverables as expected.

Speaker 2: 20:22

Yeah, you know, to add on that, I always look at it. One of the things that NVIDIA used a lot was adaptive routing. Is what helped drive InfiniBand, the use of adaptive routing and sharp together. And really what the Spectrum X architecture is doing is they're taking those capabilities from InfiniBand and really applying it to the Ethernet world. So they've had that special sauce, and a lot of the other Ethernet vendors really didn't have a special sauce. Outside of implementing things like ECMP entropy tools, like Justin mentioned, dlb, dynamic load balancing, or something we call flowlet switching packet spring. There's a variety of ways around that and I wrote a whole article around ECMP. And now Cisco, in this relationship, has brought a special sauce to their solution right by being able to tie into the NVIDIA adaptive routing architecture. So again, it's very exciting to see where this is going to take it, because this will give us the ability to look at Cisco in NVIDIA reference architectures as well. So very exciting.

Speaker 1: 21:45

And maybe dive in a little bit deeper, eric, on why that's exciting and more so like why is that exciting for enterprise IT teams?

Speaker 2: 21:52

Well, again, if we talk about some of the operational models that are out there, you have a lot of organizations that have standardized on Cisco switching Right and, by Cisco having this partnership with NVIDIA, they're going to be able to get into the NCP program for helping build reference architecture. So, for a customer, they can now implement these AI networks with Cisco solutions, knowing that they're designed and accepted by NVIDIA and they don't feel like they're having to go rogue right. And they don't feel like they're having to go rogue right. So, again, it's going to allow us to tie into existing operational models and not have to worry about, you know, do we have to convince them of moving necessarily to a different switching platform to support this?

Speaker 3: 22:48

And not just a different platform, but simplest terms like let's think it's a different operating system, you know, like it doesn't really matter where you're standardized, but like there's a huge pushback and a lot of institutional and technical inertia associated with bringing an entirely different operating system. So a lot of customers are very happy with their Cisco and their CLIs. They do not want to actually start incorporating Cumulus or Sonic into the environment. It just complicates things. So this allows them to actually port their same data center-wide and enterprise-wide skill sets directly into the AI without having to go through that sharp learning curve. Now another protocol to learn a few more tweaks, special environment sort of thing much more easily absorbed into their operating models.

Speaker 1: 23:34

Yeah, justin, a couple of months ago, or it might even have been back in 2024, you wrote in an article that and you used a line that I really liked you had a quote unquote science that needs to be verified. You know, these systems need to be tested on real data that are bouncing between real servers and GPUs or storage or whatever it might be. You're talking again about you, again about having a new OS, new operating system. How would our clients, or any organization for that matter, start to verify that these systems would work within their own real-world settings?

Speaker 3: 24:05

So it's a scientific method. First you define the problem. They have to define what they're trying to accomplish. Then you gather the information, you form a working theory like, hey, you know what Ethernet will work, great, we need ultra Ethernet, or we need to run on Nexus, or we need to run wherever else and then you test it. The testing part has been the biggest challenge in the industry really, because there is such a complex ecosystem with a lot of very expensive hardware and if you're going to make a $10 million, $100 million investment in a full-on AI architecture, you want to understand how it's going to perform before you can do it. And that's why we actually built the AI-approving ground and that's a lot of the work that we have coming in is we want to see if what we think will happen will happen. Hence science that needs to be verified.

Speaker 1: 24:51

Yeah, I like that you mentioned the AI proving ground here, certainly the namesake for our podcast, eric. Can you explain a little bit better in terms of what the AI proving ground is and what it offers clients or organizations out there as it relates to testing, validating and proving out that they're on the right path for the AI journeys?

Speaker 2: 25:11

Absolutely. The AI proving ground really gives a customer the ability to bring their ideas to reality. Right. How can we test building out an AI system you know being hardware software right and make this a reality to see if we can even do this right? How can we make the art of the possible happen and what architecture is going to work best do?

Speaker 2: 25:44

Do we need to look at infiniband versus ethernet? Do we need to look at c versus Arista? Right, what software is going to make the most sense? Do we run Slurm? Do we run AI right To do our orchestration? And the AI proving ground is the perfect place for them to do that, because one we have the people to help build it right, because a lot of times the customers may not have the knowledge to do that, they don't have the lab, they don't have the budget to buy all this to see if it works even right. And that's one of the biggest benefits of the AI proving ground is we have the people, we have the software, we have the resources and the relationships with our OEMs to make the art of the possible happen.

Speaker 3: 26:39

And let's also look. You know time to delivery here Because you look at the vast majority of our customers. They have excellent internal teams, a lot of intelligence, they've got their own labs, internal teams, a lot of intelligence, they've got their own labs. But the turnaround time on average is going to be like six months to a year to get hardware in rack, a stack, get it all going. We have this down to this is our focus and we have this down to a martial art so we can actually bring it in very quickly, so we can turn a one-year evaluation cycle into a three-month evaluation cycle. It helps the muse have the right information to make the right choices faster.

Speaker 1: 27:26

Well, that vendor ecosystem figures to get even more complex. Every networking OEM seems to be touting AI-integrated offerings. How are you making sense, justin, of the marketplace? Is it just going to continue to expand and expand, and expand, or are we going to see some consolidation or partnerships by that way?

Speaker 3: 27:42

A bit of both. You know you're asking me to actually give you a prognostication of what the market's going to look like. We will see alliances you know ongoing alliances happening between, say, cisco and NVIDIA or similar kind of connections where we can collaborate on this. We'll see some fracturing as well. We'll see new entrants. You know AMD, you know, starting to produce some good performance. We'll start to see. We've seen, like with DeepSeek.

Speaker 3: 28:07

There was a bit of a disruptor there as they came up with a less resource intensive but a more efficient way of processing.

Speaker 3: 28:15

The numbers were still tweaked a little bit on when they published the data, but there's going to be disruptors that will change everything. There's going to be continuing alliances and there's going to be people every once in a while, like we saw this happen with bringing up the sorted past, but we we had VCE between VMware and Cisco and EMC and they fractured and the entire integrated stack became a broken stack. We will be seeing those things happen as well. We try to stay on top of it by collaborating tightly with our partners Also keeping our fingers in the wind to see exactly what our customers are asking for. Keeping our fingers in the wind to see exactly what our customers are asking for, and we've had instances where what we've been hearing from our customers is significantly removed from what the strategic direction is of our partners, and that's where we talk to them and say we are hearing different things than you are hearing. We should go through this.

Speaker 1: 29:09

Yeah, well, understanding that it's a relatively chaotic landscape and that there's going to be changes in the market in the near term and long term. Eric, is there any advice or guidance that you would give to clients on how to handle that rapid pace of change? How should they look at the landscape and be able to advance their organization forward, knowing that there could be changes coming down the line at any moment?

Speaker 2: 29:38

I think one of the most important things is really when we look at what we're doing, especially from an AI networking perspective, is when you're making choices you have to look at the systems are going to change rather quickly. What was the new shiny object is going to have lost its luster after 18 months, 24 months, very easily. But you have to think about how does this play into the bigger picture? Because as the systems change, is possible to re reuse some of that architecture somewhere else. And this comes down to. You know, a great example of the ethernet versus infiniband argument is as things change, can you put your infiniband network anywhere?

Speaker 3: 30:28

no, you do file and print on infiniband probably.

Speaker 2: 30:31

Exactly so. What you have to do is think you know again big pictures as I go from, let's say, 400 gig to 800 gig. Can I utilize this 400 gig in different areas of my network? Now, lot of people don't think about is you know, when I make a decision about my high performance architecture, what do I need to do with it in 18 to 24 months? Is there a place for it somewhere else in the organization? That makes sense, you know, instead of just you know a quick one-off.

Speaker 3: 31:10

Yeah, when it comes to the enterprise, you want to have a very much a top-down solution. It involves coordinated action at the C-suite level. You're going to be looking at the CFOs and you're looking at the CapEx and the OpEx, but at the same time they have to understand that their standard depreciation cycles of six years, seven years, is not going to apply. They have to actually look at a one to two year refresh cycle consistently. So they have to bring it in to function flawlessly and be able to tear down and tear it out also flawlessly. That's huge operational concerns. That's where the CEO has to be able to talk to. The managing directors has to be able to talk to everybody else. So it has to be very much a top-down solution Customers who are doing a grassroots solution, or historical data scientists sitting in a little small island of performance very quick, this will become apparent that those do not scale.

Speaker 1: 32:08

What did work has to change? Yeah, well, so far in this conversation, which has been fantastic We've only necessarily talked about how the network can support or drive AI. I'm curious let's flip that script here, justin. Where can AI really enhance and help accelerate or make more efficient the network? What types of use cases are we seeing in terms of applying AI to the network?

Speaker 3: 32:28

So I mean we've already seen a lot of machine learning inside networks. You know, put it a full AI level? Not certain, but like we've had things where, like security, where they do a heuristic engine to actually look at patterns in the environment, learn the patterns of the environment and then notice when something is outside of that norm and then apply a fix with varying levels of autonomy. You know they've had self-healing networks. So really bringing AI into the network is nothing new at all. As we've improved the capacity for this, we have a few other choices now. So other partners have their own little AI networking engines that are run on a small L40S or something and we can put that in and then I can say things like natural voice interaction. I can say, you know, hey, siri, what's my network look like today? And then the network will come and say, oh well, blah, blah, blah and it will tell you exactly where things are wrong or please drill down to where that performance issue is. You're watching in real time. Brief me on it.

Speaker 1: 33:38

It makes it easier to track down very complex environments and see where the problems are and how to fix them. Yeah, eric, any other AI use?

Speaker 2: 33:46

cases on the network that you foresee either coming soon or maybe in the distant future. I would say the big thing is operational change. Right Again, how do we make it easier to troubleshoot something, how do we make it easier to manage the network? And I think one of the biggest things that it's going to drive is more appropriate data lake design. Right, where is the information? How can we access that information? Is really going to drive these different AI for network operations discussions right, you already see it happening. You know Juniper, with their MIS product, has a fantastic AI agent built into it. Cisco, as you know, recently announced that Cisco live. They're looking to have a, an AI, uh, enablement across their entire platform. Right, to make it easier to look at the network holistically. So that the big thing there, I think, is we're going to see better telemetry and observability data lakes occur, and that's going to be absolutely huge and that also pushes it out to the, the um, back to the edge again.

Speaker 3: 35:03

You're going back to that point because, if you look at again, the, the data gravity, but having an edge compute where you have I use cars as as an example Waymo self-driving cars Okay, they have to have certain level of autonomy. They can make decisions themselves real-time based on the environmental inputs, and they do but they also need to have some level of coordination with the mothership to ensure that they have the latest data. That could be either a dispatch to tell them where to go, it could be a real-time understanding of traffic patterns in the city so they know where to avoid congestion versus driving straight into it. And then they have to worry about backhauling their own data back to the mothership so that the data center actually has an updated understanding of what's going on. They also have to be able to function completely autonomously without a network. Without a network, they have local city maps that they can actually continue to apply even when the connection goes down, so that they can still navigate the streets even if they don't have an understanding of the real time. So it's very collaborative and AI will then help us analyze exactly what data has to live resident in that, in that car, what data has to be easily fetchable? What data can wait for a couple of days? Uh, it goes like that. There's a.

Speaker 3: 36:22

The ultimate evolution of this, and you know, is going to be something like I summon my waymo to my house to pick me up, to take me to work, the, the way most shows up and says good morning, justin. I see from your Fitbit that you did not sleep very well last night. I've taken the liberty of scheduling, you know, ordering your favorite latte at Starbucks and we stopped through for there. However, also, it seems like your, your, your elevate, your heart rate's a little bit elevated. Your, your, your cardiologist has suggested you don't deal with that, so we might cut back on, you know, the extra shots of espresso today.

Speaker 3: 36:55

But if you think about all that little conversation, there is a pre-existing HIPAA agreement, so that's to be able to pre-fetch. So when the car is being summoned to you, it has pre-fetched your actual medical data, it has accessed your Fitbit to see exactly where it's coming through. It has brought it all in, shoot it all up, and the car has a localized understanding as well as a more of a direction as to what to do with it. That's the shank of the law and, yes, I know there's tons of ethical considerations around that level of integration, the privacy concerns, but that's a theoretical kind of like. This is what. That's the art of the possible, and AI will help us determine again exactly how to make those decisions where to put the data, how to move it faster.

Speaker 1: 37:33

Yeah, no, absolutely. That's fantastic. That's a great analogy. Well, we are running short on time, so I do want to cut it short here. Justin, Eric, thank you so much for taking time out of your day to join us on the AI Proving Ground podcast. Hopefully we'll have you back soon. Thank you very much. All right, Thanks again.

Speaker 1: 38:08

The flow of data, and then decide whether to add more GPUs. Second, AI belongs inside the network, not just on top of it. From anomaly hunting security engines to conversational network co-pilots, embedding machine learning where packets live turns troubleshooting from a war room into a quick chat. And third, telemetry is the new goal Rich, well-designed data lakes and the observability pipelines that feed them. Let ops teams shift from reacting to predicting whether the endpoint is a data center switch or an autonomous car at the curb. The bottom line is if you want AI that scales, start by asking how fast your GPUs can talk, not how fast they can think. The network is the heartbeat of every model you'll build next. If you liked this episode of the AI Proving Ground podcast, please consider sharing with friends and colleagues and leave a rating or review. And don't forget to subscribe on your favorite podcast platform or on WWTcom. This episode was co-produced by Naz Baker, Cara Kuhn, Mallory Schaffran and Stephanie Hammond. Our audio and video engineer is John Knobloch. My name is Brian Felt. We'll see you next time.

AI Proving Ground Podcast

AI Proving Ground Podcast

AI's Invisible Bottleneck: Why AI Stalls at the Network, not the GPU

Podcasts we love

WWT Research & Insights

WWT Partner Spotlight

WWT Experts

Meet the Chief