Radio Cloud Native by Mirantis

All Infrastructure Is AI Infrastructure: Navigating the New Era of Enterprise AI at Scale

Mirantis Inc. Episode 69

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 18:07

As enterprises scale generative AI and autonomous agents, every layer of the technology stack - from silicon to orchestration (ie. Metal to Model) - is being reshaped.

In this episode, hosted by Mirantis CMO Dominic Wilde, you will hear why AI workloads reverse decades of cloud abstraction, what full-stack AI infrastructure actually demands (GPUs, CPUs, networking fabrics, power, storage, etc.), how data sovereignty and regulatory frameworks like GDPR and DORA create hard architectural requirements, and what principles - manageability, observability, flexibility, repeatability, and the importance of open infrastructure - must be used to guide enterprise AI platforms into the future. All of this change has informed how the teams at Mirantis built k0rdent, a complete, open source platform for AI infrastructure orchestration.

A full list of topics covered in this episode include:

  1. The end of abstraction and what's next
  2. Full-stack AI infrastructure
  3. Data/AI sovereignty & compliance
  4. The developer experience gap
  5. Core principles for enterprise-ready AI platforms
  6. The case for open infrastructure
  7. Commoditization to contracts
  8. k0rdent by Mirantis

If you want to listen to more episodes of Radio Cloud Native, please visit https://www.mirantis.com/radiocloudnative/ to download, or find them wherever you prefer to consume your podcasts.

If you are interested in contributing to Radio Cloud Native, please reach out to our podcast team: podcasts@mirantis.com

There's a statement we've been making a lot lately at Mirantis, and it tends to get a reaction. The statement is this: in the very near future, all technology infrastructure will effectively be AI infrastructure. And I want to be clear - that is not hyperbole.

As enterprises scale their use of generative AI models and autonomous agents, every single layer of the technology stack - from silicon to orchestration - is going to be reshaped to support AI workloads. And the reason I can say that with confidence isn't just because we're watching it happen from the outside. It's because we've spent years running large-scale infrastructure for some of the largest enterprises on the planet. We've taken that knowledge and we've been applying it to the problem of AI infrastructure, which, if I'm being honest, is just an entirely new level of complexity on top of everything we already knew.

So let's talk about why this is happening, what it means, and what it's going to take to get it right.

The history of cloud computing was fundamentally a story about abstraction. Virtualization, containers, APIs, orchestration engines - all of it was designed to make the lower layers progressively invisible. You didn't need to know what physical server your workload was running on. You didn't need to care about the operating system underneath. The whole point was to hide the complexity so developers could focus on building applications. 

AI workloads completely reverse that trend.

Performance at scale now depends directly on the hardware and the fabrics underneath. Training and inference workloads are bound tightly to CPUs, GPUs, memory, and the network in ways that previous application patterns simply weren't. Instead of hiding complexity, AI is pulling it back to the surface.

And here's the paradox that creates: the faster AI adoption grows, the more enterprises need to understand and optimize their hardware and infrastructure in meticulous detail. The promise of "serverless" simplicity? It effectively vanishes when your platform engineers are having to confront Non-Uniform Memory Access (NUMA) nodes, PCI lanes, and GPU interconnects just to get usable throughput out of the system.

We've also heard a lot in recent years - and I've been in this industry for a long time - about digital transformation. For three decades, organizations have been talking about digitally transforming their businesses. Well, I'd argue we are now genuinely at the cusp of what that actually means. What's happening now is that business logic - the logic we've traditionally coded directly - is being replaced by AI logic. AI and machines are going to start making the decisions that we currently hard-code into systems on our behalf. And when that happens at scale, the infrastructure challenge we need to solve for, becomes exponentially more demanding.

Let me get specific about what that infrastructure challenge actually looks like, because there's a common misconception worth clearing up.

When people talk about AI infrastructure, they tend to immediately talk about GPUs. GPUs, GPUs, GPUs. And yes - GPUs are central to the equation. But they are only one part of it. Once models get larger and workloads get more distributed, you have to think about the whole stack.

GPUs don't run without CPUs. CPUs are what feed the data pipelines, handle preprocessing, and manage scheduling. And in many cases - more than people realize - jobs actually run better on CPU as part of a coordinated data pipeline than they do on GPU alone. So when you're designing for AI, you're not just designing for GPU capacity. You're orchestrating a relationship between CPUs, GPUs, memory, storage, security, and everything else in between.

Then there's networking. And I don't mean networking in the simple sense. There are four distinct network fabrics that have to be accounted for when you're building out large-scale AI infrastructure. You have data networks - your east-west and north-south flows. You have wide-area networks (WANs) to connect regions. You have Peripheral Component Interconnect (PCI) interconnects between devices. And you have Remote Direct Memory Access (RDMA) networks which are essential for ultra-low-latency GPU clusters. Each one of these fabrics matters. And all of it has to be taken into account right down to the storage layer.

Now layer on top of that the scarcity problem.

How many organizations right now are in the middle of trying to procure GPUs and finding themselves with eight, twelve, twenty week lead times? It's a real and present constraint. But the resource scarcity issue isn't just about GPUs themselves. It's about power. It's about physical data center space. We have a partner right now that is consuming one hundred percent of their available power budget while occupying only twenty percent of their facility's floor space. Twenty percent of the floor, a hundred percent of the power. The rules for designing data centers are being fundamentally rewritten under AI's demands with a desperate need for more power per rack, higher cooling requirements, longer equipment lead times. These are real, physical constraints that aren't going away anytime soon, so in our view, they need to be adequately prioritized in your infrastructure management strategy.

Now let's talk about something that tends to come up in every serious AI infrastructure conversation, especially outside the United States: data sovereignty.

Everyone talks about data sovereignty, but it's worth being precise about what it actually means in the context of AI, because it's more layered than people often think.

Sovereignty is geographic. If you're operating in Germany, there are requirements about where your data gets processed. It needs to stay in Germany, not cross to a data center in the US. For a multinational company, managing that sprawl without it dramatically increasing complexity and cost is a genuine engineering and governance challenge.

Data sovereignty is also a legal issue. There's a real and serious question about whether a foreign government can legally compel a company holding your data to grant them access to it. If the company holding your data operates under the jurisdiction of another country's laws, those laws apply to that data. That's not an abstract concern - it's a real risk that organizations need to design around.

And then there are the regulatory frameworks themselves. The General Data Protection Regulation (GDPR) and the Digital Operational Resilience Act (DORA) in Europe are two that come to mind. These frameworks place hard constraints on where and how models run, and on the provenance of every decision an AI system makes. As AI agents begin to operate more autonomously, the governance requirements only intensify. You need to be able to demonstrate provable governance of every agent, every model, every tool you deploy - and do so in an agile manner so you can course-correct as the models (and the data centers in which they run) evolve.

Then there's multi-tenancy. In cloud environments - especially in Europe's banking sector, as an example - hard multi-tenancy isn't a nice-to-have; it's a regulatory requirement. Workloads that span teams, divisions, or partner organizations need strict tenant isolation that is verifiable and auditable, not just assumed. 

All of these dimensions - geographic, legal, operational - come together under the umbrella of data and AI sovereignty. They are all, ultimately, infrastructure problems. And as we should all agree upon by now - all infrastructure is moving towards being AI infrastructure.

One pain point related to all of this that is definitely worth naming directly: developers building AI applications do not want to deal with any of this. This is completely understandable.

When you've spent your career learning to build applications, you don't want to be thinking about NUMA nodes and PCI interconnects and RDMA network fabrics. You want to focus on what the application does. You want to reason about models, about data, about user experience. In other words, you want to add value to your applications, not simply maintain them at their current state.

But here's the thing - those developers are highly dependent on the infrastructure underneath them. The gap between what the infrastructure demands and what developers are equipped to manage is one of the largest open problems for enterprise AI organizations today. And it's growing, because the AI tooling landscape has exploded.

The solution is not to push more complexity onto developers. The solution is to design platforms that hide the infrastructure detail while preserving the control, security, and performance that the platform needs to deliver at scale. That's a platform design problem. And it's one the industry hasn't fully solved yet.

So what does it take to build AI infrastructure that actually works at enterprise scale? Let's talk about the principles.

The first is manageability. This sounds obvious, but it is often the thing that gets thought about last. Organizations build platforms by hand. They get them running. And then they discover that updating, upgrading, or evolving the platform is an enormous operational burden, sometimes an impossible one. Platforms have to be designed for lifecycle management (LCM) from day one. You have to be able to upgrade them. You have to be able to improve them. They cannot be brittle hand-built systems that freeze the moment they need to change. 

The second is observability. Every layer of the stack - from GPU utilization all the way up to application response times - has to be instrumented. Performance is not optional when it comes to AI workloads. It is a hard requirement. And you cannot manage what you cannot measure.  

The third is flexibility. This is about avoiding lock-in, but more specifically, it's about the ability to change layers of the stack as the vendor landscape evolves, and it is evolving very rapidly. If your infrastructure is hard-coded to a specific vendor's hardware or software, you've effectively become a legacy system the moment that vendor changes direction. You need the ability to swap out components - to adapt - all without having to rewrite the entire platform.

The fourth is repeatability. Templates and declarative patterns are how you capture known-good architectures and reduce the complexity of deployment. There are enormous amounts of reinvention happening right now across enterprises that are building similar AI infrastructure patterns from scratch. That's wasteful. Repeatable, templated approaches are how you move faster and reduce error.

The fifth is what we'd call borderless computing. Resources (compute, storage, networking) need to be locatable and usable wherever they are. Across data centers, across clouds, across edge environments. Secured and observable wherever they run, but not constrained by artificial boundaries.

Lastly, the sixth - and this is one we feel very strongly about as a direction for the industry - is moving from abstraction to contracts. Rather than abstracting hardware away from applications, workloads should be able to declare performance requirements and receive guaranteed commitments from the infrastructure in return. This flips the traditional abstraction model on its head. Instead of the application being isolated from the infrastructure, the application and the infrastructure enter into a contract. The application says what it needs. The infrastructure delivers it predictably and verifiably. That's not fully the reality today, but it's where things need to go.

Let me talk about the strategic choice that underpins all of this, because I think it's the most important decision enterprises are going to make about AI in the next few years. There are two paths.

One path is to take a closed vendor stack. Buy a bundled solution from a single vendor. It's fast to get started. There's less integration work upfront. The vendor has made all the choices for you, and you can start running workloads relatively quickly. 

The problem is the other side of that trade. When your infrastructure is locked to a vendor's ecosystem, your ability to innovate is gated by that vendor's roadmap. If they're slow to support a new GPU architecture, you're slow. If they don't prioritize your compliance requirements, you wait. If their pricing changes, you absorb it — because you have no alternative. The speed you gained at the beginning comes at the cost of control over your own future.

The other path is strategic open infrastructure. This means building on open source foundations. It means composable architecture - modular systems where components can be added, replaced, and upgraded independently. It means bringing together best-in-class hardware partners, storage solutions, and networking technologies, overlaid with open source software and a management layer that gives you control across all of it. You still get access to the innovation happening at the hardware layer from companies like NVIDIA and others. But you're not captive to any one of them. You retain the ability to choose your own destiny.

This isn't a new instinct in our industry. It's the same logic that drove OpenStack. It's the same logic that drove Kubernetes. In both cases, open platforms ended up defining the market - not because closed alternatives didn't exist, but because enterprises recognized the long-term value of maintaining control and flexibility. The same dynamic is playing out now in AI infrastructure. 

That balance - rapid time to value on one side, autonomy and control on the other - is exactly the balance that strategic open infrastructure is designed to achieve.

So let me bring this back to the big picture.

We are at the beginning of a shift that is going to be as transformative as the move from physical data centers to cloud computing. But in important ways, it's more demanding. It's more resource-constrained. It requires more specialized knowledge at every layer of the stack. And it comes with sovereignty and governance requirements that the cloud era, frankly, never had to deal with at this level of complexity.

The organizations that get this right are going to be the ones that approach it with clarity about what they're building toward. They'll be designing for manageability, not just capability. They'll be investing in observability as a first-class concern. They'll be building with flexibility in mind, so they're not locked in when the landscape shifts - and it will shift. They'll be thinking about sovereignty and governance as architectural requirements, not compliance checkboxes. And they'll be choosing open platforms that give them control over their own roadmaps.

Enterprises that make these choices now will be positioned to scale AI workloads effectively and safely. Those that don't will find themselves constrained - by vendor dependencies, by technical debt, by governance gaps - at exactly the moment when moving fast starts to matter most.

AI is not going to wait for the industry to stabilize. And the infrastructure decisions being made today are going to define who can compete and who can't in the years ahead.

At Mirantis, this is the challenge we've been building toward. k0rdent - our open source platform for AI infrastructure orchestration - is designed specifically to address these realities: multi-cloud, multi-cluster, bare metal, declarative, and built to give enterprises the control and flexibility that AI at scale demands. But beyond any single platform or product, the principles we've been talking about today are the ones we believe the whole industry needs to move toward.

All infrastructure is becoming AI infrastructure. That's not a prediction about some distant future - it's a description of what's already in motion.

The question isn't whether to engage with this shift. The question is whether to engage with it in a way that leaves you in control - of your data, your performance, your roadmap, and your costs. 

We think the answer is clear. And we're glad to be having this conversation with the people who are going to shape how it unfolds.

Thank you all for listening.