Why Observability Has Become the Control Plane for Enterprise AI Artwork

AI Proving Ground Podcast: Exploring Artificial Intelligence & Enterprise AI with World Wide Technology

AI deployment and adoption is complex — this podcast makes it actionable. Join top experts, IT leaders and innovators as we explore AI’s toughest challenges, uncover real-world case studies, and reveal practical insights that drive AI ROI. From strategy to execution, we break down what works (and what doesn’t) in enterprise AI. New episodes every week.

All Episodes

AI Proving Ground Podcast: Exploring Artificial Intelligence & Enterprise AI with World Wide Technology

Why Observability Has Become the Control Plane for Enterprise AI

February 14, 2026 • World Wide Technology: Artificial Intelligence Experts • Season 1 • Episode 66

0:00 | 37:38

As enterprises rush to deploy AI, observability is emerging as the discipline that determines whether those investments create durable value—or quietly erode it. In this episode of the AI Proving Ground Podcast, WWT's Ivan Wintersteiger, Cisco's Tapan Shah and NVIDIA's Shashank Sabhlok discuss why observability, when treated as foundational, becomes the connective tissue between infrastructure, applications, security and user experience.

More about this week's guests:

Ivan Wintersteiger brings more than 20 years of experience across IT leadership, engineering, and business development, with a consistent focus on end-user experience and security. In his current role, Ivan leads the End User Computing practice, driving transformational solutions across modern device management, application delivery, and identity management to help organizations operate securely at scale.

Tapan Shah leads the Splunk AIOps products—IT Service Intelligence (ITSI) and Splunk On-Call—within the Splunk Observability portfolio. With more than 20 years of experience across observability, systems and network management, application performance monitoring, and AIOps, Tapan has worked closely with large global enterprises across industries, including a significant portion of the Fortune 100. Based in California, he focuses on helping organizations gain end-to-end visibility and operational resilience at scale.

Shashank Sabhlok is a senior product manager in the NVIDIA Enterprise product group, where he leads initiatives around AI factory design and enterprise adoption to drive scalable, high-performance AI across organizations. Prior to this, he was the lead product manager for IBM watsonx.governance, IBM's generative AI governance solution, where he successfully led the product from inception to launch and beyond. He holds an MBAi (MBA + AI) degree from Northwestern University's Kellogg School of Management and McCormick School of Engineering, and a BASc in Electrical Engineering with Distinction from the University of Waterloo.

The AI Proving Ground Podcast leverages the deep AI technical and business expertise from within World Wide Technology's one-of-a-kind AI Proving Ground, which provides unrivaled access to the world's leading AI technologies. This unique lab environment accelerates your ability to learn about, test, train and implement AI solutions.

Learn more about WWT's AI Proving Ground.

The AI Proving Ground is a composable lab environment that features the latest high-performance infrastructure and reference architectures from the world's leading AI companies, such as NVIDIA, Cisco, Dell, F5, AMD, Intel and others.

Developed within our Advanced Technology Center (ATC), this one-of-a-kind lab environment empowers IT teams to evaluate and test AI infrastructure, software and solutions for efficacy, scalability and flexibility — all under one roof. The AI Proving Ground provides visibility into data flows across the entire development pipeline, enabling more informed decision-making while safeguarding production environments.

Observability Is the New Control Plane

SPEAKER_00 0:00

From Worldwide Technology, this is the AI Proving Ground Podcast. Right now, a lot of organizations are racing to build AI capability. They're buying GPUs, standing up platforms, running pilots. And quietly, a new question is emerging underneath all that effort. Do you actually know what systems are doing once they're live? Observability used to mean keeping the lights on, but today it's more about understanding cost, performance, trust, and adoption in AI environments that are far more complex and far more expensive than traditional IT. When looking to scale AI, that lack of clarity doesn't just slow teams down, it turns innovation into guesswork. So today, we're talking with three experts about why observability is becoming foundational in the age of AI and why it sits at the core of Cisco's Secure AI Factory with NVIDIA. Joining us are Yvonne Wintersteiger, who leads AI ops here at WWT and sees firsthand how enterprises struggle to operationalize AI. Top and job from Splunk, who's shaping how observability evolves beyond reactive monitoring. And Chishek Stoblock from NVIDIA, who is working directly on AI infrastructure where utilization, trust, and economics collide. Together they'll talk about what it really takes to see inside AI systems as they scale, where observability stops being a tool, starts becoming an operating discipline, and quietly decides whether AI turns into durable infrastructure or a very expensive pilot. So let's jump in. Okay, well, everybody, welcome to the AI Proven Ground Podcast. Top in, how are you doing today? Being great. And Yvonne, how are you? Doing awesome. And Shishank out there virtually, how are you today? I'm great. Thanks for having me. Excellent, excellent. We're talking about observability and how it's shifting in the age of AI. We're going to bring in some conversation about Cisco's secure AI factory and just get into a little bit of the landscape of what we're seeing around here. Top in, we'll start with you. Uh, you know, what are we seeing from observability moving beyond just that typical legacy IT ops, you know, idea and more into an actual strategic lever that that organizations can use. And how is that working in the age of AI?

Why “Wait and See” Fails

SPEAKER_02 2:09

No, absolutely. So I think historically what we have seen is when you talk about traditional monitoring or how things were being done for IT ops, people were always reacting to signals, right? That something went wrong. Hey, I gotta act upon this. From there, the posture is changing a lot more to proactive. How do I get to know of something going wrong ahead of time? How do I figure out, hey, there's a drift happening in my environment. That drift could be just measurement of a metric, or now with AI, a lot of drift is in the models as well. Right. Right? So there is that sort of landscape changing in terms of the posture, becoming more proactive. The second big thing that's kind of happening is that a lot of things which were always based on, hey, IT teams are gonna configure their environments in a certain way, and they will always expect certain behavior, and when that behavior is is not normal, they're gonna generate an alert. From there, it's changing more towards do I really need to set things up? Can I just use AI where things like thresholds for metrics, etc., are just recommended to me based on historical data patterns. That way I'm not spending a lot of time setting things up, but system is telling me how I should be monitoring and observing things. Yeah. So I think those are the two aspects.

SPEAKER_00 3:33

Ivan, based on what Tapin just uh said, what are what are you experiencing as it relates to you know what you're seeing? You know, how are real world organizations talking about observability and how they approach it?

SPEAKER_04 3:43

Yeah, no. Actually, I was gonna bring that up. In my mind, observability is no longer an optional thing, right? Because if you look at it, the whole purpose of it is to uh actually proactively fix the problems that are actually potentially happening in the IT systems, right? And those IT systems provide those services to the organization. And those services are so business critical these days that you cannot afford to have a downtime. I mean, you can see some of these outages that we've had, either because of the cyber attacks or because of wrong code that was pushed or whatever, caused millions or even billions of dollars of downtime and not just monetary, but there's also like a brand that is getting damaged as a result of that. So it's it's so critical nowadays. So that's why we're talking about here, like pretty much anything that we do and sell usually should have some level of observability attached to it.

SPEAKER_00 4:38

And Shishank, how do how do how do AI systems or GPUs, you know, how is how is the complexity underneath the hood of all these systems, how is that changing the game with observability from NVIDIA standpoint?

Burning GPUs, Burning Budget

SPEAKER_01 4:51

Yeah, no, for sure, right? So yeah, to build on what Tappan and Ivan said, right, that GPUs are for an enterprise are like one of the more expensive line items in your AI factory, right? And to make sure that they're utilized and you're using that for inferencing and training your model, observability is not just about uh making sure everything is okay. Observability now is like financial control, right? You want to make sure that you're utilizing your resources and your your models, your GPUs are working for you to like to make money for you and not just sitting there idle.

SPEAKER_00 5:25

Yeah. Tap in, do you have a sense? You know he's talking about how you don't want GPUs to sit there idle. Do you have a sense of how much capacity we're wasting by not having the appropriate amount of observability on those GPUs?

SPEAKER_02 5:36

Oh, absolutely. I mean, that's honestly one of the biggest blind spots as we see how things are being done. Like every organization is a pressure, right? Oh, use AI. If you don't use AI, you won't be relevant, right? That to counter that pressure, they are just pushing a lot of workloads into AI. Now, what becomes important at that point of time is hey, how are you actually seeing who is using what? Are your GPUs underutilized? Is one particular LLM or a service overutiling a CPU, a GPU so that it's kind of resulting in thrashing for other LLMs, right? So how do you measure that? Two, when all this infrastructure is provisioned centrally, how do you cross-charge that to different organizations and the teams that are using it? So that's kind of becoming very critical. And eventually all this relates to capacity planning. Are we underutilizing? Are we overutilizing? Are we just pinning off too many workloads and too many GPUs in silos rather than kind of centralizing that and kind of distributing it across different teams? So that's kind of becoming and that clarity is becoming important. And I was talking to one of the clients that we have and and they actually brought up a very nice statement. They said, in the world of AI, clarity is the new uptime.

SPEAKER_00 6:52

Love it. Well, Ivan, build on that. If if if GPU optimization is one of the biggest blind spots that organizations have, you know, obviously efficiency and cost is gonna be an effect, but what other downstream effects might occur within organization that's gonna cause the AI strategy to break, so to speak?

SPEAKER_04 7:10

Well, I you have to look at it also from um the employees or even our customers, right? If the system is not configured properly and it's not running optimally, you're not gonna get a good experience, again, from employees' perspective, right? And uh it's still a new thing, right? And so if you don't have a good experience right off the bat, you're less likely to use it and utilize it as a tool, right? And then ultimately that's gonna be uh driving the adoption adoption the wrong way, right? And so ultimately the companies will not see the benefits of the AI that it can potentially be. And so then you'll see these projects that they're all over the place now, pilots that start to fail. And again, that's not what we want. So that's really that's how that relates to that angle as well.

SPEAKER_00 7:54

Yeah, no, absolutely. Shishank, I mean, really what we're getting at here is is a trust factor. If if if we're giving that that visibility and observability in, the more observability we have into how these systems work is driving more trust, which can drive adoption. I mean, is that kind of the the idea here for organizations to think about?

Trust Is the Adoption Problem

SPEAKER_01 8:11

Exactly, right? Because and just to build on what Ivan said, right? It's not if you think about it from a persona standpoint, earlier as top and said, like you would just think, and or even what your initial question was, you would think like IT teams would be doing observability. Now it's going a cut above that. You have senior leadership now also looking at these things and making sure things are utilized and adoption is happening and so on. And to answer your question about about uh about trust, absolutely, because when when companies undertake AI initiatives and projects, right, it's not it's not as simple as just, hey, we're gonna use this endpoint model and you know build our apps around it. It involves a lot of investment, right? So you wanna and mostly that investment is like infrastructure, right? So you want to make sure that infrastructure is paying dividends down the line. Therefore, building that trust and ensuring those applications are utilized is important. And that that comes not only to GPU utilization, but also, you know, the applic the end applications and models that are being leveraged to solve the specific use case.

SPEAKER_02 9:14

Yeah. And and the trust part actually doesn't stop just at the infra or the adoption or the consumption level, right? It extends pretty much to how applications or observability solutions even show these outcomes to the users, right? When you are producing that, hey in this particular environment, we detected that if this trend of the past KPIs continue in 30 minutes down the line, you will have a potential outage.

SPEAKER_03 9:41

Yeah.

SPEAKER_02 9:42

You can't just say that. You gotta explain why we are saying that. And that also is something that builds trust over time. So one is yes, you consume the infra and then build the trust as uh Sash Hank Shad, but then two, that kind of extends to when it shows is shown to the users and folks who are actually monitoring the system.

SPEAKER_00 10:08

Yeah, and even get proactive too. Like not just what's broken, but what can we do? What can we do better? How can we be more efficient? I mean, Yvonne, what are what's the value of that, assuming an organization gets it right, which is not easy to do, but what is the value of getting to that area?

SPEAKER_04 10:22

Oh my god, yeah, you know, that whole proactive nirvana call it, because I'm not sure if anybody's truly there. Uh it's a journey that we all have to take on, but we have to be on the journey. You can't just not afford to wait and then one day you you like I'll do it now. Right. So really it's again goes back to the whole idea of how business critical IT services are. Because we talk about like uh uh every company is now IT company, which means that pretty much they have whatever products they do, but IT plays such a critical role in their business that they again have to have that done. And it's and it's not even it's just like AI and in AI era, this is really even more important than ever.

SPEAKER_02 11:06

And I think IT was historically seen to Ivan's point more as a cost center to the companies. It's now slowly transitioning towards being a strategic partner to the businesses, right? The LBs that they're serving. Yeah, that's kind of the big transition happening.

SPEAKER_00 11:20

Yeah, no, absolutely. Well, let's let's roll into uh Cisco's Secure AI Factory uh top, and I'll stick with you here. Yvonne mentioned how observability is no longer optional, and and I I think Cisco and or Splunk talk, you know, talked about its foundational element into Cisco's Secure AI Factory solution. So why is why is it important to to make that distinction where it's it's not just you know it's not just an option, it's foundational, it's not a bolt-on.

Impact Over Dashboards

SPEAKER_02 11:45

Yeah, uh no, absolutely. So I think one of the main distinctions there is that again, I'll go back to historically how things were done. One, people were reacting to stuff, and hey, I would buy infrastructure, I would run application workloads on top, I would have some sort of monitoring setup, whether it's open source or an enterprise, and then this is how I'm gonna build the world. And folks still stayed in the in the realm of hey, every time things go wrong, I'll have some system notifying me. We we can't really be in that world anymore. Yeah. Right. And observability was always seen as after-the-fact thought that hey, if I if I have security established in my environment, then I'm done. A lot of companies still think of it that way that hey, I need to just get the security posture right. But if you don't get the observability posture right, then I think you are hitting all those traditional problems of availability, performance, and clarity all at once. So that's that's where we see things are kind of changing. Now, what we are doing for for this essentially is that we are bringing not just observability as an outcome of monitoring or outcome of hey, we are observing systems and this is what we are seeing, but rather how does that relate to a company security posture as well? Do you have a lot of vulnerable systems that is potentially impacting the performance? And is there really the reason why your applications are showing latency? How do you connect all these dots together? I think that's where things become very important.

SPEAKER_00 13:24

Yeah. Shishank, talk to me a little bit more, you know, how does we we mentioned it a little bit earlier, but how to do does that AI infrastructure, the GPUs, how does that change observability? Not, you know, is it just where we're pointing that light, or is it more of where you have to change how we look at GPU utilization? Like, is there uniqueness in terms of what's happening on that end?

SPEAKER_01 13:47

I think I think it um you just mentioned the the Cisco Secure AI factory, right? So I so yes, we have GPUs, but now that's impacting how you set up your infrastructure, right?

SPEAKER_03 14:01

Yeah.

SPEAKER_01 14:01

So let's so so let's break it down and talk about it. So for sure, you need you need observability on your GPUs, you want to make sure it's utilized. It's as you mean like that's your unit of compute. You don't want to make sure like that's idle and that's utilized.

SPEAKER_03 14:13

Yeah.

SPEAKER_01 14:13

But then to set up your AI factory, right? You need a lot of component, you need a lot of components. Like right at the bottom, you need your servers, the Cisco UCS servers that you know that use NVIDIA GPUs. You need storage, right? You're you're training your models, you need to, you need to checkpoint them. AI networking comes into the picture, which operates between servers, which we call east-west, and between the customer and storage and the and the server, which we call north and south. And then on top of that, you have your your orchestration layer through tools like Kubernetes and your applications on top. I've said a lot of things right now, right? So you need to so it's not just about monitoring GPUs, you need to monitor that whole stack top to bottom, that whole AI factory, because anything could be a point of failure in that whole system I described, right? So to so to top it all off, you need that observability layer. And as Tapon mentioned, that below that, that security layer to make sure everything is functioning properly and you're getting the maximum use of your AI infrastructure or AI factory.

Inside the Secure AI Factory

SPEAKER_04 15:15

Yeah. Maybe sorry to jump in that. This reminded me of maybe to uh demystify observing the AI as something like way out there. It's really nothing new in a sense. It's it it's like we've been observing infrastructure for many years, we've been observing applications for many years, and if you boil it down, AI is a combination of the two put together in a package. Yes, it has very unique things that we need to be able to do so, and this is where uh again Nvidia and Cisco and Splunk come up with all the specialty things, but at the end of the day, it's literally you're observing application that runs on the infrastructure, and that's what this is.

SPEAKER_00 15:53

Yeah, but is the North Star, so to speak, changing? You mentioned how it used to be downtime. Is the North Star, you know, how is that shifting in light of what Ivan just said?

SPEAKER_02 16:02

Yeah, so the North Star is absolutely changing because earlier, again, the focus was on infrastructure applications. Now the focus comes from the end user experience or the user impact. Or to Ivan's earlier point, even the business impact. Like people want to answer that question of so what? Hey, I have an outage here, so what? How is it really impacting the company? Yeah. How is it impacting the business? That's becoming the North Star. And to what Sashank earlier said uh said, right? So so one is the whole full stack observability for everything that's kind of delivering AI. But two, there are application workloads that run on top of that as well. So when we say full stack, it starts pretty much at uh network infra. So to Yvan's point, none of that fundamentals are changing. But the way in which AI change is changing that is what you measure is shifting now. Yeah. You were earlier measuring just the fault performance, simple different metrics. Now you're actually measuring the user experience. You are measuring before this application used to use no AI services, now it's using AI services. What's the real difference? Are we seeing more adoption of that application? Are we seeing that users are sticking more to the application? I think those things are becoming more critical.

SPEAKER_00 17:18

Yeah. Well, Sashank, I mean, everything's changing on such a rapid pace. And, you know, we always look to NVIDIA, of course, as you know, kind of that leading indicator. So is the North Star gonna continue to move? Are we looking at moving goalposts here as we bring in agentic or inference or quantum or who knows what? Like, how is how is this all gonna change the the the game of observability, you think?

SPEAKER_01 17:41

I think something that's like very clear is that to add to what Hoppin said, the infrastructure plus application end-to-end full stack mounting, that's not gonna change.

SPEAKER_03 17:50

Yeah.

Seeing the Whole AI Stack

SPEAKER_01 17:50

Whether you introduce a genetic AI or or RAG or even down the line quantum, right? Inferencing is here to stay. A lot of organizations are not just using, you know, they've gone beyond just using it as a chat bot for like solving customer queries, right? They're actually using it to drive efficiencies in their business. So I don't think I don't think it's going anywhere. And and like that full stack story that I just described to you earlier, that's that'll be crucial no matter what technology paradigm uh comes moving forward.

SPEAKER_04 18:22

So that reminds me actually from a Gentech perspective, where we're moving toward is so-called digital employees, right? They pretty much are powered by this AI infrastructure. Yeah. And so as any employer wants to make sure that they're getting quote unquote money's worth out of the investment or employees and make sure they're performing at whatever certain levels, that's where the observability comes into play, almost uh plays a role of HR in a sense. I'm joking a little bit, but you get the point. Yeah, analogy, yeah. That's where we're headed, that's why this is such a big deal.

SPEAKER_03 18:58

Yeah.

SPEAKER_00 19:06

Well, uh uh uh Tap and how do we how is Cisco and Splunk thinking about this to make sure that, you know, not just from an observability perspective, but from the whole Cisco secure AI factory perspective, it's a solution that can, you know, adapt and weave and and keep up with the pace of change here.

Agents Change the Rules

SPEAKER_02 19:22

Yeah. Oh, so the way in which we are thinking is we are we are basically looking at three different perspectives to to solve this problem. One, everybody so far said that we need to monitor infrastructure, is the base layer. That includes networks and actual system infrastructure. We need to make sure that that infrastructure is available, uh utilized based on the capacity that's been allocated, and we know exactly who is utilizing what. So that's kind of the core. The second layer of the solution here is now there are agents. What what when we say agent AI, all essentially we are talking about is that each application becomes an agent. Yeah. Right. If I'm a travel company, then I will have an agent which is just for flight search. I will have an agent which is for hotel search, right? So ultimately each of these functional elements of applications are becoming agents, and we want to monitor and observe those agents as well, how they are performing. In in that particular case, uh what's changing with AI is also not just limiting ourselves to performance matrix, but also include the quality matrix, like the semantic quality, biases, hallucination, toxicity, sentiments. How do you ensure that all these things are part of your traditional observability offering? Because that's what is changing with AI. And then the third layer, which is where security comes into play, is how do you unify the visibility for the end users where they see the security posture in the context of observability? Which is again going back to the previous examples of is my infrastructure secured? Are my applications running secured? Are the is the data in transit always encrypted? Am I expecting any threats? And how do I relate those threats to an observability problem? Yeah. So that's how kind of the three elements are coming together.

SPEAKER_00 21:10

Yeah. Well, I that that puts it into great perspective, Ivan. I'm going to ask you, just can't you like put that into a real world lens for me to the extent that that you can within the AI proving ground, which is the namesake of of this podcast, you know, how are we seeing that actually play out in the real world?

SPEAKER_04 21:26

Well, actually, we're in in the process of actually building the the whole solution in our AI proving ground and our ATC, our advanced technology center, right? And so this is pretty exciting time because we want to go to market with this solution holistically, right? And that's why we think, again, observability is a key piece to this puzzle. We do not want our customers not to have that layer in the mix. So we're actually combining all that together. And yes, uh, it's an exciting time for us. A lot of uh innovation is happening. We're working closely with the Tappan and his team to really kind of make sure that we there we have all the bases covered. And what's maybe unique and Tappen can probably talk more about it. The solution from Splunk that's layered on top really is unique from a market perspective. There is nobody else that can see and observe uh to the levels that they can, right?

SPEAKER_02 22:20

I don't know if you have a building and the specific differentiation there is that w everybody observes Kubernetes infrastructure. That's a table stake these days. But what we do is we actually put that in context of the inferences and in context of how like Kubernetes workloads are running now are using Nvidia GPUs, things that are not available in public API domain. We have that sort of an intelligence with our partnership with Nvidia, right? So that's like number one. Yeah. At infrastructure layer. At application layer, we are not just observing application availability or latency, but we actually can look at the complete trace where we can see that hey this agent was running. This agent actually made these sort of calls to these LLMs and then we use LLM as a judge to even evaluate the response. Right? So we do a deep evaluation. So there are those kind of elements where it's not just about hey you're running this we're monitoring this but we are also checking how efficiently your applications are running and the output that they're producing are the quality outputs. So you're saying the AI is monitoring AI? Yeah that's in fact it's called LLM as a judge.

SPEAKER_00 23:30

Yeah Shoshank anything and let's bring in here anything to build on top of what they're saying?

Built in the Proving Ground

SPEAKER_01 23:36

No that's awesome. If AI judges AI that means you need more GPUs right so yeah no to yeah no to add to what you know I want to say yeah like they're you know they're building out uh these solutions they're based on the NVIDIA as enterprise reference architectures right that uh essentially are a blueprint for companies and you know organizations such as Cisco to you know develop their you know develop their technologies and we you know we also have an observability guide as part of that enterprise area that's you know that's pretty much at the foundational it provides the foundational building blocks right that everything you need to start and obviously like top and his team you know they're they're in a position to deliver that and beyond.

SPEAKER_00 24:16

Yeah. Yvonne get into delivery a little bit so you know they've bought the solution it's gone through the ATC we've we've tested it out. What does day one or day 365 kind of look like as it relates to observability?

SPEAKER_04 24:29

Yeah well so in general I mean it has to be this is all based traditional delivery maybe as we know it you you come and configure it and then you out or like Topham was saying oh we implemented a security and now we're done yeah you're not done at all you're never done. So this is more becomes like an adoption play from a customer perspective. And again the way you're adapting to whatever is happening you have to be able to see what is happening with the systems and make sure they're working right and that's where the key piece of the observability comes into place the visibility into what is happening behind the curtain in a sense. Yeah.

SPEAKER_02 25:03

Well I mean I like go ahead Tapan no so sorry about that. No but but just to add to that right like so what's changing to your question on day one day 365 we call it as day n but there's also our day zero aspect. Earlier the day zero to day one used to be a huge distance because you would install, set up things, get it going, do initial configurations and that's what you call your day one as when you start onboarding your teams. Now what's changing is observability is becoming more of an intent based observability. Where user will just provide that intent that hey I want to observe these systems. Now you know how to do this just do it for me. Like do the setup everything for me. I will just provide the intent you do all the initial heavy lift that I had to do before myself. Right. And then present me that this is the outcome that I would be monitoring.

SPEAKER_00 25:56

So I wanted to add also I mean also from our implementation delivery because we're we have the stuff in our ATC and we are so to speak kicking the tires already on all this time we absolutely will be ready to help customers quickly to help them adopt the solutions and then be around again if they uh run into issues I think that's also a key piece that's part of the uh again WBT secret sauce potentially yeah well Ivan I like what you had to say about um you know seeing what's coming as it relates to that delivery but I think that applies also to just you know where potentially you know observability is going we touched on it a little bit with maybe agentic Shishank you you had mentioned some of this stuff too uh but top in where is that puck going to to quote our CEO Jim Kavanaugh a lot likes to use that analogy but you know what are we going to have to observe in the future that's gonna pop up like maybe it's maybe it's just more of more agents running around but is there something even beyond that that you is on our radar?

LLMs Judging LLMs

SPEAKER_02 27:01

It's I think the paradigm is shifting where what I am envisioning is that alerting will become a thing of a past. Okay. Why? Because people do not want to be alerted self-healing is gonna be the reality. Yeah that you know what the systems are doing you learn from that you remediate the problem and when the problem situation reoccurs you apply that same remediation right so people call it as autonomous operations or smart remediation but at the end of the day we are talking about self-healing systems where agents would interact with agents users would always be dealing with AI agents right so let's say again going back to the travel company example yeah I'm an Expedia for example trying to query things Xpedia deploys agents for hotels and flights and et cetera right as an and customer I'm interacting with those agents when those agents are down like Expedia knows that the impact is heavy because then I'm gonna go to bookings comm right right so they eventually they will not be able to like afford any downtime or outage on these things. That means systems have to be self-healing problems occurred problems were solved before the actual customer or the end user reported about it.

SPEAKER_04 28:20

Yeah so what you're saying is that uh it's gonna be beautiful thing to be an IT ops guy or a help desk guy because you're just gonna be sitting back sipping on a coffee and systems will just take care of it.

SPEAKER_02 28:30

And observability solution becomes your insurance policy.

SPEAKER_01 28:32

Yeah well yeah I mean I know that's a little bit tongue in cheekman yeah go ahead sorry quick question there topping maybe like a question for you right so do you think tracing is going to become ever ever more important then because sure agents can self-heel but you know they're also handling sensitive data right and there's just so many stakeholders involved so like you know in case if they get something wrong that can just easily like lead to trust erosion. So seeing what exactly is happening I feel is going to become extremely important moving forward.

SPEAKER_02 29:00

Exactly and and we basically in the observability lens we call that as explainability. You're doing all these things but explain us what really happened under the hood. Because that again aligns to trust as well.

SPEAKER_00 29:12

Well yeah and that gets to the question you know Von I was going to ask you you were I think you were joking about oh the the IT ops guy can just sit and you know sip on his coffee or whatever else he might be drinking that day but you're talking about explainability so that's still where the human in the you're still going to have to go through those logs and go through that explainability to understand. It's not just you but but it's easy to just put more trust into that AI than perhaps is practical.

SPEAKER_04 29:36

So how do you how do you balance that yeah it's gonna be a journey it you don't gain trust even with people like boom instantaneously this is the same thing we're doing uh with the AI. So uh again the visibility that uh that human in a loop has to have or IT operations person is going to be critical to make sure that again that trust is developed over time and so then you can do more coffee sipping than doing other things.

SPEAKER_00 30:02

Right. And and Shishank you know we had talked about that that self-healing toppin mentioned the self-healing kind of future here. You know you along with the rest of us are working with some of the leading you know most innovative companies in the world that are that are playing with with AI.

From Day Zero to Drift

SPEAKER_01 30:15

Are you seeing any of that self-healing start to emerge or as are there any breakthroughs there or is that still a little bit of ways off no I mean I'm sure like you know organizations that have you know made heavy investments in you know in AI generally in Gen TKR are probably working there. But for wide scale B2B adoption I think it's a while but it's not far away. Yeah right it's it's it's it's definitely something in the near future and uh yeah it's something that we need to you know we need to be ready for and that's why I think explainability and tracing becomes becomes paramount in that case. So I don't think coffee sipping as much will happen but yeah I think I think the the role of that ID ops person as to what you know they what they manage or what they look at or how they problem solve is going to change a little bit.

SPEAKER_00 31:04

Yeah I mean it's funny how terms carry out throughout an episode coffee sipping that's that's the one you just have top and you were going to say saying yeah so no I I was just gonna build on top of what Sashang said right where like literally like in 2022 I was talking to a customer we talked about autonomous operations and self healing back then and they laughed it out.

SPEAKER_02 31:24

Right. They laughed it out that same customer literally about a couple weeks back said that we shouldn't have laughed it out because this is becoming reality. Yeah. I mean people are now seeing that that has to happen. Yeah it's no more thing that people can laugh about.

SPEAKER_04 31:41

Yeah and I and the key point here is it's a journey that they have to go through. It's like a pyramid like what we're talking about is at the top of the pyramid. You you can't just build that piece you have to build all the foundational pieces to get there. So again the sooner you get on this journey the better you're gonna be set for future.

SPEAKER_00 31:57

Yeah. Well that was just one of the key messages that uh Cisco president G2 Patel brought up on a prior episode of the the AI Proven Ground podcast which you can scroll down and see which is like stick with it. You know if something's not working it may work potentially within like three to six weeks. That's how fast things are changing. And you know Shoshank you had mentioned we need to be ready for whatever's next and given everything we're talking about here Von like how how do we advise organizations to to be ready for for when that tide shifts?

SPEAKER_04 32:27

Again it's that whole thing of making that first step let's do this first step knowing that the the the nirvana is there and we're shooting for it. So but you still have to have that first step or two. And again so really it's all about helping them with the strategy around this seeing what they have in their systems and ultimately building that tailoring that strategy around that so they can quickly or as quickly as possible get to that next level. And that's really ultimately what we're trying to do with every customer that we talk to.

SPEAKER_00 32:59

Yeah.

SPEAKER_01 32:59

Shishank we're coming up on one thing one thing is yeah I think when I think I don't know I don't agree when he starts when the set that strategy I think observability is always there from like day zero so that it's clear that this is not something you can take for granted and as you build your observability stack also builds on top. And then at one point it can probably do all those fancy things we were talking about, right? Like selfie lake and so on and so forth.

SPEAKER_00 33:21

Yeah. Shishank I'll stick with you here because we're coming up on the bottom of the episode but any priorities that you know leaders out there need to be considering right now so that you know at this time next year they're thinking to themselves wow we were making great decisions back then like what what should they be prioritizing right now?

Can AI Heal Itself?

SPEAKER_01 33:37

Right. I think if you are if you are on this journey of adopting AI, right, you will you would have most definitely invested in infrastructure. Yeah. Right. So to make sure and again I'll come back to that image I painted at the start of this podcast where I said your AI infrastructure is not just a server with GPUs. It's everything around that it's your switches it's your it's your Ethernet cabling right even within the server your your PCIe lanes your and your NV link lanes and then beyond that when you come to applications or and and customer experience so making sure that at least on day zero or day one like that observability is ready to go to make sure that you to make sure you know how your system is performing because the faster you know the quicker you can use that to drive you know drive your drive revenue through your infrastructure.

SPEAKER_00 34:31

Yeah certainly massive shift towards AI ready infrastructure across the stack uh top in we'll end with you here somewhat similar question like what are the priorities for 2026 but you know what can organizations be doing right now so that their entire stack their systems are easily observable to make those right decisions.

If You Can’t Explain It…

SPEAKER_02 34:49

I think one point itself is the technology shift. So the organizations need to start planning for the technology shift if they stay in the legacy mindset that hey we have been working like this forever I think they will they will be deprecated in no time. Yeah right because things are moving at that pace where they need to be adapting to technology shift now how to do that do you want to boil the whole ocean the what I have seen working successfully in organizations is they start with one particular user group or line of business those who are more adapting to latest technology. So move them show that as an example that hey look these folks who are on legacy data center infrastructure we moved them to this sort of infrastructure like for example Cisco say iPod right which is to Shashank's earlier point it's an ecosystem by itself right it provides infra provides networking provides a Kubernetes layer on which you can run the applications and provide solutions that can kind of observe that as well so when users when customers start deploying all this they need to work on identify that okay which is my line of business which is aligned to a later technologies trend take them apply these sort of solutions show the ROI right show the results right the ROI here is as I said clarity is the new uptime yeah where it's not we are just not looking at how your applications are always available your users are not impacted your business is not impacted so show that and then make it a cookie cutter model right but but they need to start adapting to newer technologies like within our company also there is a clear push that if you don't use AI you'll not be very useful in longer run. And I think that's real.

SPEAKER_00 36:36

No for sure for sure and I love that clarity is a new uptime that's a that's a good kind of key lesson uh to to end the the conversation on uh to the three of you thank you so much for for taking the time I know your schedules are busy and uh you know those things are are not taken for granted when you spend some time with us so uh Shoshank thank you out there for for joining thank you so much and the two here with me thank you too awesome yeah okay thanks to Ivan Toppen and Shishank for joining what this conversation keeps circling back to is a simple lesson that is easy to miss in the rush to adopt AI. You can't manage what you can't see. And in AI systems not seeing clearly creates waste mistrust and stalled adoption maybe more importantly it wastes a lot of time money and effort in the process. This episode of the AI Proving Ground podcast was co produced by Nas Baker, Kara Kuhn, Diane Devery and Addison Ingler. Our audio and video engineer is John Knoblock. My name is Brian Felt thanks for listening see you next time

Podcasts we love

Check out these other fine podcasts recommended by us, not an algorithm.

AI Proving Ground Podcast: Exploring Artificial Intelligence & Enterprise AI with World Wide Technology