When Every Millisecond Matters: Observability at Massive Scale with Todd Persen Artwork

The Digital Revolution with Jim Kunkle

"The Digital Revolution with Jim Kunkle", is an engaging podcast that delves into the dynamic world of digital transformation. Hosted by Jim Kunkle, this show explores how businesses, industries, and individuals are navigating the ever evolving landscape of technology.

On this series, Jim covers:

Strategies for Digital Transformation: Learn practical approaches to adopting digital technologies, optimizing processes, and staying competitive.

Real-Life Case Studies: Dive into inspiring success stories where organizations have transformed their operations using digital tools.

Emerging Trends: Stay informed about the latest trends in cloud computing, AI, cybersecurity, and data analytics.

Cultural Shifts: Explore how companies are fostering a digital-first mindset and empowering their teams to embrace change.

Challenges and Solutions: From legacy systems to privacy concerns, discover how businesses overcome obstacles on their digital journey.

Whether you're a business leader, tech enthusiast, or simply curious about the digital revolution, "The Digital Revolution with Jim Kunkle" provides valuable insights, actionable tips, and thought-provoking discussions.

Tune in and join the conversation!

All Episodes

The Digital Revolution with Jim Kunkle

When Every Millisecond Matters: Observability at Massive Scale with Todd Persen

May 25, 2026 • Jim Kunkle • Season 3 • Episode 21

0:00 | 37:21

Send us Fan Mail

A 10 minute delay between an incident and the data that explains it is not a tooling problem, it is a business problem. We sit down with Todd Pearson, CTO and co-founder of Hydrolix, to unpack why milliseconds now determine whether you stop fraud, keep a streaming event online, or lose a customer before your team even sees the alert.

We talk about the real mechanics of modern observability across distributed systems: where latency hides in the pipeline, how it inflates MTTR, and why “monitoring everything” collapses when log volume reaches petabyte scale. Todd shares what he learned building time series infrastructure at InfluxData and why general purpose stores can become painfully expensive for logs, metrics, traces, and high cardinality fields like user IDs. If your observability budget is growing faster than your infrastructure budget, you will recognize the pattern immediately.

From there we go forward into 2026 trends: edge computing and CDN telemetry, bots that look human thanks to agentic AI, and the shift from human dashboards to agents that query data directly. We explore what “headless” observability could look like, why query performance becomes the limiting factor for AI SRE, and how predictive approaches may finally start to shorten time to remediation toward zero. If reliability is your moat in a world where code gets cheaper, this is the playbook.

Subscribe for more conversations on digital transformation, share this with your SRE or platform team, and leave a review with the biggest observability challenge you are facing right now.

Download (PDF Ebook) "The Evolution Of Digital Transformation By Jim Kunkle" Here: https://drive.google.com/file/d/1z1NjoP7SMs3w7hwXVHT6mVc3--RNrD_1/view?usp=share_link

Referral Links

StreamYard: https://streamyard.com/pal/c/5142511674195968

Contact Digital Revolution

Email: Jim@JimKunkle.com

Follow Digital Revolution On:

YouTube @ www.YouTube.com/@Digital_Revolution
Instagram @ https://www.instagram.com/digitalrevolutionwithjimkunkle/
LinkedIn @ https://www.linkedin.com/groups/14354158/

If you found value from listening to this audio release, please add a rating and a review comment. Ratings and review comments on all podcasting platforms helps me improve the quality and value of the content coming from Digital Revolution.

I greatly appreciate your support and Viva la Revolution!

Welcome To The Millisecond Era

Jim 0:05

Welcome back to the Digital Revolution with Jim Conkl, the podcast where we break down the technologies, the leaders, and the ideas reshaping how industries operate in the time of digital transformation. Today's episode goes straight to the heart of modern digital operations. The world where data never sleeps, systems never stop, and the difference between success and disaster can be measured in milliseconds. Think about that for a moment. A millisecond is faster than the blink of an eye. Yet in today's digital landscape, that tiny slice of time determines whether a fraud attempt is stopped, whether a streaming platform stays online, whether a global enterprise catches a threat before it spreads, or whether a customer abandons your service altogether. And behind all of that is one concept observability, not dashboards, not logs, not alerts, but the ability to truly understand what's happening across massive distributed high-velocity systems in an instant. And that's why today's guest is the perfect person to help us navigate this role. Joining me today is Todd Pearson, CTO and co-founder of Hydrolix, a company that has reimagined what observability looks like at a massive scale. Todd has spent his career in building systems where latency is just a metric. It's mission critical constraint. From global traffic networks to real-time analytics pipelines, Todd's lived on the edge where traditional observability tools simply break down. Todd, welcome. Hey, Jim, thanks for having me. It's great to have you. So what I wanted to do in an openness is kind of really kind of frame everything and talk about the new reality of real-time everything. I think that's super important to start at that why. And really kind of the set everything up is, you know, milliseconds today really determine security, the customer experience, and operational resiliency.

Why Milliseconds Decide Outcomes

Jim 2:10

Todd, why is the mill why are milliseconds so important today?

Todd Persen 2:14

Well, I think, you know, you kind of you said this, it's it impacts so many facets of the of the online experience. And so I think we've, you know, we spent a long time uh you know being waiting for for pages to refresh and and things like that. And now as you know, I think the rise of agentic workloads is certainly pushing, pushing people to think a little bit differently about what the experience is like with any any online you know service or product. And you know, the faster you can respond to something, uh, you know, for a lot of companies, you know, seconds are revenue, you know, and and so I think they they measure downtime, you know, in terms of dollars. And so I think anytime we can get insight into a system more quickly, we can get to a resolution more quickly, and we can get to a place where businesses see you know smaller and smaller impact. We're never we're never gonna live in a world and we'll have have zero problems, you know, or or yeah, we're always gonna have problems. But we are we got to get to a point where we can remediate them uh as quickly as possible.

Jim 3:16

And in that thought process, too, when we think of legacy, it was all about monitoring. We monitored everything. But today, what we really need is true observability because right now, data, volumes, and also uh AI out there and other systems. I mean, everything's doubling like within 12 and 18 months, and we know that's gonna accelerate where it might be every four to six months. So this conversation is really important that we're gonna have today because you know, in 2026, we have so many uh AI-driven systems that are coming up uh online. We've got a lot more bot traffic. Um, you know, we're getting deeper into edge computing and also too real-time decision loops. So, what I want to do now is talk a little bit about your background and what I kind of encapsulated as saying for you is like engineering for scale before scale with coal.

Todd Pearson And The Observability Shift

Jim 4:04

And what I want to understand is your early journey from, you know, the the kind of the engineering side of everything to really building systems uh to handle, you know, traffic and volume and and and observability.

Speaker 4:17

Yeah, for sure. Um, yeah, so I mean, I've I've been in this industry, you know, since before the term observability really existed. So um my my first company that I co-founded, Influx Data, you know, that was back in, I think we started working on that in in 2012. And, you know, there were there were other tools that existed for doing some amount of like uh instrumentation of applications and kind of understanding a little bit at a at a high level what was going on. But a big uh reason we started building that company was that we saw this this difficulty when we just wanted to get uh custom custom metrics, custom you know, visibility into our applications to really start digging in and understanding what was happening. Um, and we just couldn't find tools that did what we wanted. And so a lot of where uh InfluxDB originally came was just we needed a place to store this data, and nothing else really existed that would that would go to the scale we needed. Um you know, there were tools like graphite and stuff. So there was there were some things, but they're very limited in what they could do. Um and so really it just came out of a need, you know, as engineers, like how do we how do we build and observe these systems? And a lot of other people were you know doing doing similar things at the time. And so we sort of saw that that uh period of time where we were building those tools kind of lead into the modern you know observability wave that kind of really started to pick up in 2014, 2015. And so, you know, it's just it kind of came all of this kind of came from uh needs as engineers. Like we want to know, we want to know what's going on. We want to be able to tinker and explore these systems. And obviously now it's just it's you know a thousand times bigger than it was when we first started.

Jim 5:58

Yeah, definitely. And you know, one aspect too that um I find fascinating is you know, we always in the past would talk a lot about latency, but today, um, you know, latency is is costing companies money. You know, what in your opinion or what have you observed out there when it for the cost of be of slow?

Speaker 6:18

Um I mean it's a great question. Every company is in a different situation. You know, you kind of mentioned, you know, is it um you know a streaming service that has uh a latency or disruption or something like that that impacts advertising revenue? Is it an e-commerce site that uh goes down and they they miss sales? And so most companies that are uh sufficiently large and dependent on um you know uptime to drive revenue have some number anchored to you know, an hour of downtime costs us, I don't know, let's say $100,000 or $500,000. There's some number. And so companies have that as an anchor point, and they can think if I if I invest this amount of money in my platform and it saves me, you know, 20 hours of downtime a year, that's got a dollar value associated. And so uh every single one of these things that that adds up in terms of latency, and now you you know you mentioned latency, but um it comes in so many forms. Is it uh is it getting data out of your system? Is it getting data stored into a new system? Is it how long it takes for it to be queried? Is it how long it takes for it to be visualized? And then there's you know, MTTR. And so there's latency at every stage of this process uh when there's an outage or an incident. And so you want to try and shorten all of those because you're just adding it all up. So you want to know really how long does it take from something happening to the point where somebody on my team can see and identify it and begin making a resolution? Um, and so you know, every every bit of that latency just means more impact to that that uptime number.

Why Hydrolix Targets Edge Logs

Jim 7:52

And with your with your current company, what was uh kind of the genesis? What was the uh maybe a pain point or or you know something that you were looking to resolve out there? What what was really the the genesis or the uh the finding founding reason uh for your current firm?

Speaker 8:07

Yeah, so hydraulics is um we're we're focused, we're we're basically a petabyte scale data lake for logs, is a probably the easiest way to think about it. So we do a lot of work with customers that operate at the at the edge, you know, through CDNs and things like that. So we've got a lot of companies that we work with that are in media, gaming, e-commerce, uh, and we have a deep partnership with Akamai, who's probably the largest player in the CDN space. And so we do uh a lot of work with their customers as well. And you know, a big part of it, my um our our CEO, Marty, uh, he had been working at another company he'd started previously, and they were trying to do some uh you know traffic steering type software in the CDN space. And as a result, you know, they were generating a lot of logs because CDN traffic is is super high volume. And so at that time, they were just sticking it in uh you know a cloud service. And they were like, wow, this is this is actually a lot of data. It's very expensive to handle all this data. And so a lot of where hydraulics came from is looking at the pain points of the past and saying, you know, this would have been eight years ago. And I mean, it's obviously data volumes are even bigger now, um, and just recognizing that like every system that's running at scale is generating logs. You know, it's just like a kind of a ground rule of uh of observability. And so all these systems in all these different verticals are just generating tons and tons of logs. And so CDN is particularly interesting because that's the first place where uh you know a customer may experience an issue, first place that a customer comes in contact with your service and where you can start to see some of these indications of a problem. Um, whereas things further down the stack may take longer for an issue to emerge. But the the trade-off though is that CDN traffic, because it's at the edge, is the is the noisiest of all. Every request goes through the CDN. And so, you know, for every file that's downloaded, for every chunk of video that's streamed, you can see a log, a log line that tells you how the system is is behaving. And so a lot of the genesis really was looking at, hey, we've got this really interesting use case, but the the pain point is that it's just a massive amount of data and how no, no, nothing else can really deal with it. And so hydraulic stepped in to basically look at the trade-off between you know retention and cost and performance and kind of optimizing all those things so that you can keep logs for as long as you want for these use cases uh without the cost being insane, but also still letting it be fast. And that's where you know you kind of uh you're optimizing things that are fighting each other. And so a lot of the innovations we came up with at hydraulics were, you know, how do we, how do we keep it, how do we keep it fast without making it super expensive?

Where Legacy Stacks Break Down

Jim 10:52

Yeah, because I mean when you look at uh observability, you know, the challenge is that you know, traditional approaches, you know, in most cases are going to break at scale. And, you know, some of the aspects of you know looking at this with enterprises is that you talked about here, about, you know, the logs, the metrics, you know, the traces, the events and things like that. So why do legacy observatority stacks, you know, why do they kind of collapse when you're dealing with that that petabyte scale ingestion you were talking about earlier?

Speaker 11:23

Yeah, I think one of the big things is that, you know, for this use case, and you know, this goes back to my time at uh Inflex Data as well, is you know, we're looking at data that we know is going to be very high volume. We know it's gonna be time series data, meaning there's gonna be a timestamp on every row, and the other, the other data that you have attached could be, you know, it could be numerical, you know, like metrics that we're observing. It could be strings like text identifiers about either a user or an application or a region. Um, and so we're gonna have these rows that are gonna have a timestamp plus a lot of other data that you know might be somewhat arbitrary to the system. You're not always gonna know ahead of time what the schema looks like, you're not gonna be able to do planning. And a lot of systems that get used by default, things like Elasticsearch, and more commonly we're seeing click house, you know, those are great general purpose data stores and you can make them do a lot. But when you start to get to the scale that you need to be at for modern observability, you know, you really see that those kind of things I was talking about optimizing for really don't work out well. And so you can run Elasticsearch at a pretty crazy scale, but it is insanely expensive. Same thing with Clickhouse, like you can you could do a lot, but when you really start looking at optimizing for uh for log data or or metrics or traces, kind of the key observability pillars, you know, you you find that you really kind of need to have something that was built specifically with that in mind. And a lot of these other uh features that make these other systems great for a bunch of use cases are actually uh kind of detractors uh of having good observability performance and cost and things like that. And so um that's where you know it becomes you essentially have to invest in building your own high performance infrastructure at the storage layer. And that becomes a big challenge when you're trying to build a new observability product and scale it up.

Jim 13:23

Yeah, and I would imagine before your company, before your approach and everything, you know, there were probably observability budgets that were growing faster than infrastructure budgets out there.

Speaker 13:33

Yep. Yeah, and I think one of the things is that, you know, your you know, your infrastructure is is roughly correlated to your you know, your your growth and your expansion. You know, you add more capacity as you have more traffic, but uh observability can very quickly grow exponentially. Like if you if you're adding more logs, you're adding more metrics, you're adding more instrumentation in the hopes of of getting more insight, you know, it's very easy to add a whole lot more data than you intend. Um, and it's it's easy for that to kind of kind of grow and not get checked, you know, until you see your bill. Like, you know, data, data dog is a great example. Like it's got a ton of features, uh, all kinds of stuff you can do, but it's it's metered. The more data you send, the more you pay. And I think some some people get surprised because uh there are unintended consequences of, you know, especially we talk about high cardinality data. So we're we're looking at uh places where unique where fit fields have a lot of unique values. And so user IDs are one of these where it is a common case. You want to be able to tie an event, a log, a metric to a user ID. But that cardinality, you know, if you've got 10 million users, that's 10 million unique values, and that explodes the complexity of the data. And some platforms charge on that. Um, and so you'll see that these unintended, you know, it's a it's an honest desire to be able to instrument this thing, but they don't realize with the complexity of the platform how expensive that can get. And so definitely you see some some like rapid growth of the observability data uh uh that doesn't tie to the infrastructure costs.

Jim 15:06

So with the with obviously with you know the the scale of uh of all the the data, and you've got you know companies still out there kind of taking observability on a legacy way, you know, what are some of the operational risk of of of delaying implementing what, for example, like what your firm does uh for for companies out there?

Speaker 15:26

Yeah,

The Hidden Risk Of Delaying Instrumentation

Speaker 15:27

and I mean, and part of this is uh, you know, companies that are trying to step from, I mean, this is particularly hard for companies that have been around for a long time, where they've got lots of legacy systems in older languages, you know, they maybe got um, I mean, I we we've talked to companies in the past, like I've I've run into companies that are, you know, large banks, large, you know, multi, multinational corporations that have like literally thousands of applications, uh, doing all kinds of things. And so when you think about the task of going through thousands of applications, maybe some of whom the the original author is no longer around, and you're thinking, like, how do I, what even matters to monitor here? What do I even need to know about? Uh, do I even want to touch this code? Um, you start thinking about all those and doing it thousands of times. Um, it's a huge effort. And so I think part of this is just that the longer you wait, the harder it becomes to go back and actually uh really deliberately take that initiative and go add instrumentation and observability to your to your products. And so that's that's one of those things where um, you know, it's easy to try and move fast early on and just and defer it, but it gets more painful the longer you wait. And then, you know, then the downside is not having observability logs or metrics or whatever um for a long time means you you're missing some of these historical trends. You're missing some of these things that you can use as a baseline. And so um, you know, it's anyone who's not you know up to date on observability in their systems, they're probably aware and they're probably aware of the pain that's associated. And so um, you know, lots of lots of the kind of usual side effects of kind of like uh putting off the hard thing um until you until you feel like you're gonna get time to actually do

Slow Observability Is No Observability

Speaker 17:15

it.

Jim 17:15

Yeah, it's perfect. Let's switch over and talk about the real world impact and the aspect of where every millisecond matters. And you know, based on you know my research and looking into uh the topic that we're talking about today is you know, how milliseconds affect? They can affect anything from fraud detection, bot traffic classification, content delivery, network performance, streaming, and media delivery, application programming, interface reliability. So let me ask you this why slow observability, why is it effectively no observability at all?

Speaker 17:54

Well, so I think if you imagine um, like so uh again, from our perspective, hydraulics does a lot of work with with like uh media streaming companies. So uh, you know, let's say you you've got an event that you're streaming, we we know we've helped um you know, help Fox with the Super Bowl recently. Um, you know, I think you look at that from a perspective of a user. Let's say they're watching the game, and you know, if you've got 10 minutes of latency, you know, like some systems have a really long ingest pipeline before you can actually see the data, you're watching a dashboard, and then you see uh, hey, oh, wow, this this uh latency is really spiking on their their view or or their application's crashing or whatever the reason is that they're having a bad experience, but your your data is delayed by 10 minutes. Well, you you're gonna go and try and fix this problem that a user's been having for 10 minutes, and they're probably already gone. They've already probably closed the application, um, they've they've moved on, that impact is is there. And so in some of these situations, you know, you have a really small window of time where you can actually fix a problem and still kind of like retain that that user experience. And um, you know, it's it's not uncommon to see systems where it takes 10 minutes before you can actually see the data that's been set because of the way these systems are built to ingest data. And so, you know, the goal for us for a situation like that is for you know, from the time uh the data is emitted from a CDN uh to the time we get it and ingest it and make it invisible on a dashboard should be, you know, within the seconds. And and at that speed, then people who are in charge of that experience, they can uh they can make a change. Maybe it's an infrastructure change and they can, you know, push uh more capacity or they can split to another region because some of this stuff is happening, you know, it could be an actual like application error, it could be an infrastructure problem, you know, there's all kinds of places where these these issues could come up. And that that insight that you get from kind of that real-time observability lets you dig in immediately and figure out where that's happening so that you can fix it. And in a system like ours, you know, you could potentially have seen and fixed the problem before other systems will even have surfaced that there was a problem. And that's really where when we're looking at these like live events where um you know user experience matters, advertising targets matter, um, that uh that that window, keeping that window as small as possible is is absolutely critical.

Jim 20:23

Yeah, you were talking about with uh the um your uh your company's ingestion pipeline, uh, and uh you were talking a lot about you know with you know the compression of the of the data, indexing, and then you know, storing data massive scale. You know, how do you enable real-time you know queries on trillions of of of data, but also when you're dealing with so many different events that are out there?

Speaker 20:48

Yeah, I mean, so you know, for uh, I mean, it comes down a lot to just like the design of the architecture. Like this is sort of one of our like key principles is that data needs to be searchable as soon as it's received. And so there are a lot of systems going on in the background that take you know recent data and combine it with older data. And, you know, this is a relatively common design pattern where you're you're taking data and compacting it into larger and larger chunks as as they get older. But kind of one of the key principles is from the moment we get that data, we need to store it in a format that's um that's searchable. And so that's kind of uh a big deal when you're looking at these uh observability platforms and time series platforms. It's like, what can we do to make that possible? And it means kind of going back to what we were talking about with other tools like Elasticsearch Clickhouse, et cetera, um, there are there are trade offs we can make where some of those systems are designed to have certain um transactional guarantees, and that we can we can trade some of those off because we know that the transactionality is not as important as the ingest and query performance. And so we can we can say, listen, we won't give you this. This guarantee, but we will give you the ability to have your data searchable in two seconds, you know, whatever, or under a second. Um and so all those, like really building an observability platform is just uh dozens of those small decisions and optimizations that make it possible to do really, really well for this kind of real-time time series data. Um, that other systems, you know, retain flexibility for more generic use cases is what makes them less optimal at the scale that we operate at.

Jim 22:31

So your your key value is able to store more, query faster, and you're gonna pay less overall.

Speaker 22:38

Yep. That's it. And that's uh sort of that whole uh those three things are sort of that that the the bucket that we've optimized for. And that's really what makes um you know hydraulics in particular, but observability platforms in general. That's kind of everybody's goal with these tools.

Bots Get Smarter So Data Must

Jim 22:53

And when it comes to legitimate bot traffic, you know, how do you handle that?

Speaker 22:58

So I think right now we're we're in a a weird spot where you know we've gone very rapidly from bots being relatively easy to identify, you know, like they they sort of had uh very specific characteristics that make them easy to pick out to this world where you know anybody can sit down with Claude or Codex and uh crank out an agent that kind of resembles you know human behavior. You can you know, you know, you can adjust uh the the randomness factors a lot more easily. Like it's it doesn't take the sophistication that it used to to be able to build something that's harder to detect. And so, you know, right now, a lot of what we're doing is, you know, hey, we store the data. Like we don't necessarily know in real time whether you're a bot or a human, but what we can do is give end users the tools so that they can start to do some analytics. And we're starting to build more tools that do um, you know, heuristics on this traffic as we're receiving it and look and say, hey, this pattern over the last couple minutes, this looks fishy. This doesn't look like what a human would do. Um, and it could be time between accesses, it could be uh the the order they look at pages in on a on a website. And so because we have that data now, because we've we have we've we've solved the hard problem being able to store this data and get make it available. Now we can start to do some of that work and say, here are some algorithms that help us kind of pick out bots. And once we have that data, then we can flag it for those end users and we can say, uh, you know, this IP address looks like bot traffic. You know, you can choose to block it or not. Um, and we can start to surface that that stuff a little bit more quickly to end users, and then they can make those decisions about whether they want to, you know, block or allow those uh those requests. Because one of the biggest things that we're starting to see is it's it's becoming uh kind of a recurring problem where bots, you know, agentic workloads can drive significant infrastructure costs. Um, and for a lot of these companies, you know, they don't want to lose, uh, they don't want to lose out on real human traffic, but they definitely want to do whatever they can as soon as they realize it to mitigate those extra costs for bots. And then, you know, as a kind of a parallel unit of work, companies are starting to look for ways to do agent or bot authentication. And, you know, because we're we're we're moving into a world now where there are legitimate use cases that are bot driven, but companies want to be able to have a way of saying, um, okay, you're you're a bot, but you're an authorized bot, or you're you're here to do legitimate commerce. And I think there's going to be more and more places where we see that. And so there's a whole whole bunch of startups that are kind of rising up in this space, you know, the they call it like know your agent, kya, but um things like that where you know you still even need to be able to delineate between legitimate bot traffic and illegitimate bot traffic and then human traffic. And so all these things are changing very, very quickly right now. Um, but at the at the core of it, and kind of, you know, the whole reason we exist is a lot of that differentiation that you want to make comes down to having the data, having it accessible quickly, and being able to really like give those insights to the end users to make those decisions about what's what's legitimate, what's illegitimate.

Jim 26:19

Excellent, excellent. Now

Agents Replace Dashboards In Practice

Jim 26:21

that we're kind of in the midpoint of 2026, let's talk a little bit about the observability landscape. Um, there are trends that are happening that you're seeing. Also, your clients are, you know, they're talking about what's important to them. So, you know, what are some of those trends that you see that have uh sprouted up over 2026 and what might be coming down the road?

Speaker 26:42

Yeah, I mean, you you touched on this a bit earlier. Uh data volumes are gonna keep growing, you know, Moore's Law or faster. Uh, you know, there's there's a lot of um a lot of stuff that's changed over the last six to 12 months, just in terms of how fast some of these tools are are evolving and giving us um new capabilities. Uh, but I think that's gonna be continue to be uh kind of a requirement for every company is like this observability stuff is here to stay. It's super important, but it's gonna keep getting harder and the data volumes are gonna keep getting bigger. So I think that's gonna be a place where you know scale, scale is gonna be crucial. Um, the ability for for humans to get insights, I think, is gonna continue to be relevant, but we're gonna be moving into a phase where the ability for agents to get insights is gonna be even more critical. And I think we're we're already starting to see people contemplate a shift away from dashboards, you know, where that that used to be the foundation for all observability is like if I can't see it on dashboard, it might as well not exist. And so, and we, I mean, we had already gotten to a place where there was more data than you could graph. There was more data than you could look at as a human. But now, you know, agents could do a better job kind of digging in and digging in and digging in. And they don't need dashboards. So things like MCP servers are obviously very handy. Um, but being able to give agents access to your systems is gonna be uh, I think, a lot more common. And as a as a consequence, you know, we're coming back to the query performance and the availability of data. You know, your agents are only going to be able to go as fast as they can get the data to make the next decision. And so, you know, we're already starting to see some folks in in this space kind of push towards data storage products that are kind of like built for agentic workloads or that have uh query properties that are more optimal for agentic workloads. So I think we're gonna see that trend continue. Obviously, also, you know, there's a whole bunch of companies that are in the AI SRE space. Um, you know, and so I think more of these things that used to be a human looking at a dashboard are gonna be moving towards agents looking at data. And I think that shift is just gonna continue. And so I think there's gonna be a desire to feed more data to the agents, but also continue to push that performance envelope. And will that lead us to predictive observability? Oh, probably. I mean, uh, that's sort of been the that's sort of been the dream, like for a long time. You know, we've we've talked about it for years and years and years. That's like, I would love to feed data into a system that just knows what's going on. I don't have to spend uh you know a ton of time, you know, configuring and tweaking alerts, that the system kind of has enough of a sense for what my system is doing and what's normal that it can alert me to things as they change. Um it's you know, it's still from computationally, it's still a hard problem. But I think we're getting to a place where um, you know, LLMs are pretty good at reasoning about data and and coming up with some standard things to look for and search for. And so I think we'll see more systems that do a better job. I don't want to say just like, you know, anomaly detection is such a vague and loaded term, but you know, looking for changes in your data and then making the decision whether to alert you if you want to give it the ability to go and you know, do some sort of a of a run book, uh, you know, make some sort of change to your infrastructure. So I do think we'll see more predictive is also sort of loaded, like it'll it'll probably be more uh more able to pick up trends that a human might have missed. And so might be might be predictive, might be able to find some things before they actually become an incident. But I think the hope is really just that incidents can be more quickly detected and remediated and um you know drive that drive that you know mean time to resolution, you know, closer, closer to zero. Yeah, that is high, high value there.

Jim 30:41

So for the business, uh for those business leaders that are listening to this podcast right now, when it comes to practical guidance, you know, what should those business leaders, what should they do now?

Speaker 30:52

Yeah, I mean, I think we're again, you know, it's it's weird right now. Things are things are strange out there. Um, you know, I think don't don't um shy away from you know traditional observability tools. We're still gonna have this period of time, I think, while a lot of what we're building is uh still maturing a lot. So I think you know, continue to invest in observability, um, but but stay kind of on top of some of these things like like I mentioned, agentic SRE, um that's a field that's developing quickly. And again, I mean some of it is not much more than kind of a wrapper around some of these LLMs. So I think it's you know, some companies have invested more in the um kind of the secret sauce. Um, and so I think there's still a lot of work there to be done. But I think, you know, uh experiment with these tools, you know, even for you know, us inside of hydraulics, you know, we're we're doing, we're seeing some folks shift more towards building with cloud as part of their daily workflow. We're seeing our SREs use Cloud for uh building, you know, scripting and doing a lot of kind of like SRE tasks. There's still humans, obviously, involved in a lot of the infrastructure management upgrades, things like that, because it's still very sensitive, but the automation and tooling is is is getting a lot better. Um, and so yeah, so just you know, keep investing in observability, but stay open-minded to some of these new tools. Because I think, you know, if if the last six to 12 months have shown us anything, it's that uh, you know, new models can give us like an overnight a step change in in capabilities. And so I think we're gonna we're gonna quickly see a point where, you know, depending on how much you you uh believe the hype, you know, like Anthropic's been talking about Mythos, their new model that's gonna be really good at finding um security holes, you know, flaws in in software. I think we're gonna see models that come that do better and better jobs at um, you know, at detection of anomalies, at processing you know, more data at scale. And so I think this stuff is just gonna be changing super fast over the next few years. And I don't think we're far away from a place where you know you can do a lot more of this observability work in kind of a headless agentic way. I think it's I think it's not maybe, maybe two, three years from now, I think we'll be having a very different perspective on what a day-to-day, you know, observability platform looks like.

Jim 33:13

Yeah. So to close out this episode, let's let's talk a little bit about the future of observability.

Digital Trust And How To Connect

Jim 33:19

And uh, you know, I believe in in my research that you know your your position has been that observability, you know, it it needs to become the backbone of digital trust. Did it did I have that correct in my research? Yeah, yep, you got that. And so when it, you know, why the question I would have closing this out is, you know, why companies, why companies that master that millisecond level insight, why they're gonna dominate the industries out there.

Speaker 33:45

Yeah, well, I mean, I think we've you know, we've already seen that some of these these uh LLMs and coding agents have made, you know, building software has has gotten less expensive. You know, it sort of has been the trend over time as you know, new languages and compilers and tool chains to build software have gotten better and better. You know, now LLMs are just another another layer on top of that. And so building the code is gonna be uh less expensive and less of a of a moat. Um, so it's gonna be easier for companies to step in and and compete and you know build uh feature parity with a with an existing competitor. But what's still gonna always matter is the the reliability of the service, the trust that you build as a brand. And I think all those things, being able to do and deliver all those things, whether it's you know, whether it's being delivered by humans or agents, data is the thing that's gonna drive the ability to keep your system running well. Um there's no, in my mind, there's no there's no part of this revolution that isn't still tied to data. And so, you know, code's gonna get cheap, everything else is gonna get cheap, infrastructure is still gonna be expensive, and observability is still gonna be critical. And those are places where, you know, if anything, we're gonna need more infrastructure, more observability as as this continues to grow. Um and that's the place where the companies that are being smart about their observability, that are you know investing in in new ways to try and get data in more quickly, get insights more quickly, and remediate more quickly, um, they're gonna be the ones that you know customers look to to um, you know, to have the have the best experience that they're gonna trust with their business.

Jim 35:26

Yeah, because observability is so critical, my challenge to everyone listening is to please, if you need to start from the beginning and re-listen to this episode, make sure you share this episode and also too, do your research and observability, but also reach out to Todd and his firm. Todd, how can people get in in contact with you, the firm, or where can they find information about what you do in observability?

Speaker 35:52

Yeah, sure. So yeah, the company name is Hydraulics, uh, H Y D R O L I X Um Hydraulics.io. Uh, you can reach me if you want to shoot me an email directly, Todd at hydraulics.io. Um, but yeah, you know, we're we're out there, get in touch. Um, and uh yeah, looking forward to seeing uh, you know, where all this where all this stuff goes over the next few years.

Jim 36:15

I am as well. And hopefully, you know, in uh, you know, maybe six months, a year, you and I can get back together and see where things stand at that time. I'm really looking forward to uh see exactly more and more about observability and also to the uh business leaders out there, the companies, uh, the organizations that really uh take this and adapt it and run with it. So, everyone, I greatly appreciate this opportunity to have another conversation with you. Todd, thank you so much. Yeah, Jim, thanks for having me. All right, everyone, please make sure that you do uh share this episode. But importantly, the digital revolution with Jim Kunkel is on all major platforms. Please make sure you're following this podcast and feel free to reach out through the podcast. There's information on how to send us a text, send us emails, follow us, all that kind of good information. So thank you everyone.

Jim Kunkle

Host