Starting an SRE Practice in a Critical Industry - Vlad Ukis

Welcome to the latest season of ShipTalk! In this episode, we get to talk to Dr. Vlad Ukis, Head of R&D for Siemens Healthineers and author of Establishing SRE Foundations.

We cover several great topics from starting and scaling an SRE function in a healthcare/regulated industry to how to go about renegotiating your SLOs.

* Road to SaaS
* How to renegotiate SLOs
* Changes in SRE practices through the years

Ravi Lachhman 0:05
Hi, everybody, and welcome back to ship talk. I know it's been a little bit of time since we had our last episode. But don't worry, we have something great today. Today I'm joined with Vlad. He's the author of establishing SRE Foundations. But Vlad, maybe for the listeners and viewers, maybe a quick minute background about yourself and who you are.

Vlad Ukis 0:23
Yeah, sure, Ravi, thanks for the invitation to the podcast. I'm really excited to be here. And to talk to you about all things SRE, which is Site Reliability Engineering. So that's a new thing not everybody might know the abbreviation by now. And I worked for Siemens Healthineers. So I'm a Head of R&D for the Digital Health Platform. We've got the company, and it supports digital services that are developed throughout the company, and also used by other companies in the healthcare space. So looking forward to explaining to you what we've learned on our journey to introducing SRE in the organization.

Ravi Lachhman 1:05
Perfect, Vlad, I'm very excited. I mean, it's, a lot of times, I know, you and I talked a little bit before the podcast as a cheat, you know, like, get to know each other, discuss how the podcast goes, but I'm really excited; being in a technology space that I'm in like, purely a software space, as a lot of times it's, it's engineering ethics don't come into question too much. Like, it's not life and death, right? Like, it's like, oh, we can always roll back. You can't quite rollback with someone's life, or you can't roll back medical devices like the, you know, the establishment there, it's, it's a lot, the stakes are a lot higher, because human lives involved. We can kind of interweave that story to say, hey, you know, what, if it was a software company, this could happen. But at you know, Siemens Healthineers, like, we have to take more do care, because people's lives are at stake. But let's why don't we start with this, like so? SRE? You know, since the Google book came out, you know, like, it's been getting a lot, a lot more traction. But why would anybody invest in SRE practices to start with? So let's say someone was brand new, you know, that company has never invested the science of SRE. Maybe they had the arts of SRE, you know, our website went down. Now we need to make sure we have two replicas of it now, is that a one? But why why would someone invested organization just decide to invest in SRE practices or an SRE organization?

Vlad Ukis 2:26
Alright, so you would start thinking about SRE when you have got some way in order to operate your services reliably at scale, right. So once you put up some, some services that are used by others, then you need to ensure that those services are operated somehow. And there will be a point in time where if you are successful with your services, that the number of users of the service will grow. And also, if you are successful with the with setting up the delivery processes at the organization, then your teams will also accelerate with pushing change to production, right, so that on the one hand, you've got the more and more change in production happening by different teams. But on the other hand, you've got more and more users using the services at the same time. Right. So that means that the services will have more and more stringent availability requirements. So despite the change being pushed to production, you need to be available as much as possible. Because if you go down, then the growing user base will then suffer more and more, because typically, you would deploy something in the cloud in order for it to be used by many people. And therefore you've got those central instances running. And you need a structured approach to manage this. So you need a structured approach to manage the operations have services like this. And this is where you would need to come to some process of how to do this, and you would start looking around. Okay, so how do people do this? And inevitably, you will arrive at SRE.

Ravi Lachhman 4:07
Nice, nicely said, Yeah, it's like, it's like, kind of like, both sides are starting to grow, right? Like, for example, if you read a service just for yourself, it's like, is it really a service or just, you know, it is like, a user, this is limits you, but it's the whole point of like writing any software, it's for other people to use it right? Or like to do benefit or be part of a system that's like, you know, helping multiply like the effect of like, a person and yeah, you said it very nicely, modern software engineering, it's about velocity. But if you kind of peel back like, you know, the human psychology about it, or even the system size, like topology, psychology about it, it's anytime we introduce change it's introducing risk, right? You know, if nothing is ever nothing ever changed, there will be no risk, but there'll be no innovation, right? And so like, in a modern environment, where there's lots of competition, a lot of competition for users you know, even even more intrinsically, like you want to better the craft of engineering can, you know, we can do a better job, but you have to introduce change at some point. And you said it very well, like the innovation or the velocity of change has increased so much. And also, like the user traffic has increased. And so like, how do you mitigate both of them, you want to make sure you don't stifle innovation. But also, you don't want to make sure that your users are like, so upset that they go somewhere else, or in the case of Siemens Healthineers is that, you know, at some point, someone's life will be at stake. So you want to mention that you make you're protecting human life at the same point. So that's, that's awesome. And so I think another another question would be like, is there a kind of, is it worth it? At what point is it like art versus science? Like, is making the investment worth it? For example, if say, I'm just, you know, a stodgy managers and you know, what, our service should never go down? Do we need an SRE for that? Or like, what, what are we? What's kind of the negotiation? They're like, Oh, no, we had our ravi.com went down too many times and I need to hire a SRE. Is that is that the is that the critical time to do it? Or is there should we be investing in SRE before like an outage occurs?

Vlad Ukis 6:11
Right. So I think, to answer this, we can go back to the origins of sre. So why did actually SRE come into existence at Google? So Google was growing exponentially at some point in time, and with their operations capabilities back then, they required more and more operations engineers in order to run Google. And that wasn't sustainable. So they were really growing very fast. And they couldn't grow the workforce for operating the services linearly with the growth of users. So they needed to come up with something. And that's something that became sre. So that means with SRE, you are able to run your services in such a way that you don't need to scale linearly, the number of operations engineers with the number of users that are going onto your platform onto your services. So this is also where the economic argument can be made to that stingy manager that you mentioned, to say that, you know, yeah, we could do this for some time. But at some point in time, if we are successful, if the traffic really starts to grow exponentially, we just cannot afford off. Also exponentially growing operations team.

Ravi Lachhman 7:18
Yeah, let me that makes that makes perfect sense. Like, it's kind of like, the modern platform engineering type of world is that, you know, like, like being an SRE I view as like a highly specialized engineer, right. So like, you're not going to have one SRE per like scrum team or a per agile team. You know, like they're handling like multiple teams or multiple organizations, and like, how do you go about making sure that the expertise that's built right with SRE is like there and disseminated across the organization? In your example? Google is really good, because you get hired 1000s of engineers, but I think I forgot last SRE-Con i was that they had some like statistic, like, you know, they have like, for every, like, 1000 engineers, there's like 15 or something like odd. There was some low number, it was like, oh, there's like less than, like, a percent of the engineers are SREs at Google, but academic and scientific kind of ways of driving the discipline home, they're able to, as you say, like scales SRE. Have an expert have Vlad here. So I like to ask greedy quesitons questions. It's always like a comment, like, since the dawn of time, you know, technology. There's like, there's innovation and there's control, right? So like, Hey, you, you have to create new ideas, but you don't want to break things which you control? Like, how do you balance innovation and control Vlad, like, you have to support changes, but you also have to support the criticality of what you're working on. Like, any advice for anybody trying to balance like an open ended answer, like no right or wrong, just how do you tackle that problem? Innovation versus control?

Vlad Ukis 8:45
I think this is where SRE really has its strength is strength, because it actually aligns the entire organization on operational concerns very well. So the question is like this can actually be answered more scientifically than kind of in an art way. So what will happen under the SRE framework is that you will get onto the table. The people that develop the services, so the developers, then you'll get on the same table, the people that run those services, so the operations engineers, although they could be developers as well, but people who are fulfilling that role will be on will be at the table as well. And then you will also get at the same table, the people that are responsible for those services as products, so we will get the product managers at the same time. Alright, so then this set of people so the trio of product managers, operations engineers, and developers, they will set for each service so called service level objectives, which means then from their liability standpoint, what is the service level that each of the services that we offer shall provide to the users and because they will agree on on those options It gives and the SRE infrastructure will then monitor the fulfillment of those objectives, you will then have data to act upon where the server infrastructure will detect the services that actually don't fulfill those objectives that the people that are developing the service operating the service and then offering the service as a product agreed upon to be useful to be offered. So therefore, then whenever you've got them data from production that says, Okay, so this service doesn't fulfill the service level objectives that we thought it should, then you've got the conversation ongoing, which is the right one about Okay, so do the service level objectives that we thought the services should provide actually are reasonable ones, right? Or if they are not reasonable, reasonable, then why don't we lower our thresholds? And if the service level objectives are reasonable, because every time we break SLOs, service level objective, we actually break the user experience of the users so that the SLO breaches become the user experience breaches. And if so, then why don't we actually invest in more reliability that in order to fulfill the SLOs, so that they actually don't break the user experience. And this is where you kind of gradually over time, get the organization into that you've got data to work with that is actionable, where you can actually invest in in areas where it really matters from the user's point of view. And where the data shows that although you, for example, don't fulfill certain service level objectives, but they don't lead to the user experience breaches, then it just lower your service level that you offer, and you continue running.

Ravi Lachhman 11:43
Yeah, perfect. . It's, that's a very experienced answer, right. Like, hey, you know, if you have to balance innovation versus control, actually, using SLOs are actually very, very wise. Because like, now you're everyone's at the same table. You say, hey, you know what, let's, let's get our objectives out before we start bickering over, like, oh, we need to get this out. Or like, No, you shouldn't do that. Like, if once you have like, a, like, usually, it's like a communication thing. Like, this is a base of understanding, like, you know, what, we need to reply to the user and 500 milliseconds or less 95% or 99.5 95? Because that's a Ravi service. 95% of time availability. Right? Like, yeah, it's it just getting that out in the open, say, okay, and then everybody kind of like, it's, you know, we'll work together, right? It's not like, you know, like us versus them or like this group or that group. It's like, once there's a base understanding, kind of is kind of circling around saying, hey, you know, what, let's try to work within that framework of producing something that's users will be delighted with. And also we're not killing yourself with liability. SLOs fit one of always like a hot button topic. In SRE land. From I always like to ask, like folks who are very experienced in SRE is like, something I wasn't very good at, is if if you have to renegotiate an SLO, how do you go about doing that? So maybe, maybe give a quick for those listeners who don't quite know what s low is maybe give a quick definition one is, but also, they are quite tricky to renegotiate. But I want to hear your experience, if you ever had to renegotiate an SLO before typically how do you go about doing it?

Vlad Ukis 13:17
So first of all, to the definition of SLO. So SLO is, as the name suggests, a service level objective for service. If you've got a service, then first of all, before defining the SLOs, you need to define so called service level indicators that are important for the service. And the service level indicators is a fancy name for something that people in computer science have known for years. So these are typical utilities that that are important in the non functional requirements, world. So for example, you can have things like availability, things like latency, things like consistency, things like durability, and so on. And these will be different service questions. But what's very common for all services is, for example, availability. So something is available, it's kind of the basic property of, of a service. Because if it's not available, then you don't need to go and look at other others allies, like for example, vacancy, and so on. Because if it's not available, well, then it's down and not usable. Right. So therefore, I think it's easy to start with availability, because it's going to be applicable to a very broad range of services. So then, once you've defined one of your important SLA, service level indicators as availability, you can then say, okay, so what is actually for this particular service that sits there in my service network, a reasonable availability, so does it sit squarely at the bottom of the network? Or is it somewhere at the top? Or is it somewhere in the middle, and so on, so he's dependent on on that particular service? And then because you've got at the table again, the three or the product manager, Operations Engineer and developer, not one of these parties makes the decision but They agree on what's actually important. And they've got different viewpoints, right. So the developer knows how the service is actually implemented, the operations engineer knows what the actual, what the actual behavior of the service in production or in different production environments or under different loads, and so on. And the product manager has got a vision for the service that they want to offer to the outside world. Right. So these are three distinct viewpoints. And they need to come together in order for a definition of an SLO to make sense. That said, the very beginning, when there is still no data for reduction at all, it will only be just the guests. And it's important to let the people know that this will be just the initial slo that we are setting here, because this is what we're thinking. But then of course, we will be proven wrong by production data, once the infrastructure actually has started tracking that SLO against the actual performance of the service. And this then starts a very powerful feedback loop, where then the trio then first of all comes up with an SLO that seems to make sense to them, then they feed it into this or infrastructure, then these are infrastructure then looks at the fulfillment of the SLO then starts doing the alerting when the SLO gets not fulfilled, starts also generating long term statistics. So that say, every three months, you can have a look at your performance for each service in the last three months, and then see which ones are fulfilling the SLOs and which ones are not. And this isn't the process that that kind of running from there forever. So if you're doing it like this, then I think you already kind of woven in the renegotiation of slo into the process. Because once you look at your long term trends, so one, say at the end of a three month period, you look at the previous three months, then you're you will already see that some services are not fulfilling the SLOs. And you will have done the necessary conversation, which is okay, so do we need to have that high level of that that stringent slo for that particular service for that particular operation of the service? Or can we lower the threshold? If we lower the threshold? Will Will that lead to broken user experience or not? And if people at the table saying okay, so I think we could actually give it a try. So let's lower the SLO, then this is already kind of an act of renegotiation, right. Yeah. And if, on the other hand, the SLOs, are fulfilled, but you still, over the last three months, got a fair share of customer complaints, then, you know, okay to actually we need to tighten the SLOs. Because the current level, although we are fulfilling it, but it still leads to customer complaints. So we need to tighten the SLOs. And that's again, an act of renegotiation of the SLO. So this is how I would do about I think, the core of the matter is that you establish initially that feedback loop, and you never let people go out of the feedback loop.

Ravi Lachhman 17:59
That's a very solid answer. Yeah, that's, that's as good as I ever heard it, you know, like, SLOs are not static, right? Like, you know, they change, they adapt. And I think a lot of times, like, especially for myself, like, there's a difference between like a greenfield service, like something that's new for the first time versus like, there's a brown like, new something's been running flakes since the dawn of time. There's understanding there's differences. Like, if you don't have data, you don't have data, like if you and I dropped the service right now, like, we don't have any historics because this isn't brand new, like this is version, you know, open Oh, point one of like, what we're building? And so we don't know, versus like, yeah, which version like 100, you know, it's been running since like the mid 90s, we have a good idea of like, what, what what's going on, and that just making sure that it's, it's continuous improvement, right, just having that the feedback loop, as you mentioned, and also they're not static, right? I think, like, least in my experience, like, sometimes go to you SLOs is a little bit static, but you know, data can feed them. And you can kind of readjust them that I really liked that answer. shifting, shifting a little bit of gears here. You know, so since that you worked for Siemens, maybe talk about the journey of you know, when folks are switching myself, like hear about like medical devices, or like in the medical field, like you kind of imagine, oh, you know, there's like 10 hospitals and someone has to drive around each with like a USB stick and and, you know, install updated software. But today, it seems it's actually very modern, you're actually representing like a SaaS based model with a lot of a lot of like, software has gone, but maybe you can talk about that journey. Like, hey, like, if change is not that instant, how do you ensure quality and this is where your criticality of your role comes into play, but maybe a minute or two on the Siemens journey and just know how that's changed. Maybe the SRE practices that you've practiced at Siemens,

Vlad Ukis 19:47
You are spot on with the memory stick that is used by the service engineer going from a hospital department or hospital department, to a hospital department and so on. So this is actually one of the reasons why the digital health plan On that I'm responsible for exists in order to avoid those memory sticks and so on. And in order to bring the innovation of cloud and the possibilities of cloud into the healthcare domain, so we actually, you're absolutely spot on with the numbers, because one of the applications that we offer, it does exactly this. So there are literally replaces the memory stick and the service engineer because you can update things over the cloud, or there is actually another example in the healthcare domain where instead of memory stick, they just use a taxi driver. So you know, the taxi driver brings, you know, a CD from one hospital to another hospital. And that, of course, can be done over the cloud. And this is what we do, using the platform. So the journey that we that we had, is similar to the answer that I gave a couple of minutes ago when you asked about when you would introduce SRE in an organization. So we also had a moment where our services started being used more and more and more and more. So we had really a growing user base, and we had a growing user base from several dimensions. So on the one hand, just the number of users accessing the platform, but also on the other end, the number of devices connected to the cloud, from hospitals, also has been growing, and what hasn't been growing like than our operational capabilities. So we basically kind of just operated the platform the way we happened to to do. And it wasn't a very structured process. And this is where we started feeling that every time there's an outage, there is more and more and more impact of this outage. So basically, they've got an outage, you know, like today, then it's one blast radius. But if we've got the same type of outage, say two weeks from now, then there will be more users, and there will be more devices connected to it. And therefore, the impact is just getting kind of exponentially larger every time we get an outage. And also, every time we get an outage, our operations capabilities didn't allow us to, to see improvement in the sense that, okay, so we coped with that outage within a couple of hours. And the next time we did it, we put something like this, we can do this, you know, in one hour, and so on. So we kind of didn't see the progress or of improvement of our operational capabilities. And that led us to searching for some alternative or searching just for, for methodologies that people use in order to run services reliably at scale. And this is where we then learn that there is that thing called SRE and this is where we started introducing it. So this is kind of the the journey of where we came from. Awesome.

Ravi Lachhman 22:38
Yeah, thanks for that. Yeah, that's surely you know, the ability to get like more updates quicker than someone in via taxi or career or services, you're going hostel, the hostel, it helps everybody because you can like make clicker changes, and you know, get the most updated version or critical fixes that are out there. I have three more questions for you. And they're, they're pretty pretty, I think they're pretty good questions are pretty interesting questions. So here, I've worked for Harness. And so like, you know, we, internally, like when we want to experiment with something, we usually get a quarter like, I get like 90 days, they okay, you have 90 days to try this out and see what what the results are. Let's say we will start from scratch and say that we're just working at, you know, some some firm, and we decided, you know, what, we get 90 days to implement an SRE organization or practice, like, we get one and a half engineers, you know, we're building the practice from like, the day one, we get the mean, say, Okay, we meet, we've reached that point that we should, what would, what would you do in the first 90 days, right? Like, if you just had, you know, a small team of just folks who want to change reliability for the better at any organization? Like, what is the what is what can you do in 90 days? Like, is there much is there a lot? Is there a little or what do you recommend? This is an open ended question again?

Vlad Ukis 23:56
Yeah. So you said one and a half engineers, that's actually quite a lot to say, I get a capacity of one lab engineer. So that's great. I think you can do a lot in 90 days with one and a half interested engineers. And so I think what needs to happen is, first of all, you need to find a team that will be willing to jump on board and try something with the SRE infrastructure that doesn't exist yet. Right? But you need to have a consumer right? So you need to have somebody who would be eager to try. So that's one thing, but then on the SRE infrastructure side, you need to start building the the infrastructure. And here I would say the very minimum that that you could do that you learn to concentrate on is again you just select one most important SLI that would be applicable to the broad broad range of services, which would be availability, and then you allow using the infrastructure to for the engineers to specify the SLOs right so so then the first thing would be okay, so as Hello, this can be defined number one, then number two, the SLOs can be deployed to the most interesting environments. And that that needs to be to be done in a way that can be, can be done very quickly by the people using the infrastructure. So by the consumer team, right, so they now have got the ability to specify the SLOs. And then to deploy the SLOs, to be tracked by the infrastructure, very conveniently, in any deployment environment that they are interested in, could do production environment could be any intermediate environment, right. So once, once that is in place, then the next thing is the implementation of some alerting mechanism. And if you look at the Google Books, then you'll find you know chapters about this. But I think to start with, you can just implement a simple algorithm that would balance a little bit the timeliness of alerts, and the effectiveness overloads. So timeliness, that means you don't alert kind of immediately once the SLO threshold has been broken, but you kind of wait a little bit until more error budget has been consumed, and then you alert and effectiveness is that you're not on something that can make sense. So I would say you just implement a little bit of all of these smarts in the algorithm, so that it just a little bit better than just alerting as soon as a threshold gets broken. And I think with that, you are already in a row where, where a team can do much more than before, first of all, they can define the SLOs for the services that they own. Secondly, they can get those SLOs checked in any environment that they interested in. And third, they they can get alerted whenever the the SLOs get broken. And that then enables them to have that that powerful feedback loop that I talked about before. Right. So then you work with a team to really get at the table, the product manager, the operations engineer and the developer, and you kind of practice that feedback loop with them. Right? We meet every couple weeks or so and say, Okay, so this is what we fed the infrastructure, this is what we're good back, and so on. So shall we adjust this a little or not? Right? And then additionally, you start, if we're talking about three months, then you start showing them also already kind of the long term trends long term in that case, you know, because we just had three months. So we spent quite some time on implementing all this and just trying things out, and maybe long term would be, you know, sort of, you showed them the data that you aggregate over several weeks or so. And then you say, Okay, so the short term here is what we got the alerts and so on, and we adopted this rules, but long term, you know, if you divide the time into, say, fortnight so here for the last I don't know, three fortnight's. This is what we saw, these are the services that that fulfill the SLOs, this fulfilled so those, so would that data help you with prioritizing reliability or not? So basically, I think over over that period, you already elevated the team quite a bit. Right? So first of all, they know they defined together the SLOs. Number one that didn't happen before. They got feedback from production continuously, whether the SLOs are fulfilled or not. So you involve them into the kind of build, measure, learn, build, measure, learn feedback loop, where they started actively talking about the whether the SLOs made sense, and whether that led to the user experience breaches a lot. And for you also showed them that this is not just for the immediate SLO, slo breaches. This is also for the long term trend analysis that can be fed into the prioritization. So actually, you kind of prototype the whole cycle with them. And I think with one and a half engineers, one team interested in that should be possible within 90 days.

Ravi Lachhman 28:55
That was perfect. Yeah, great, great answer. Yeah, I think it's it's very pragmatic, like, hey, you know, teaching other teams how to measure, right? Like, we're all I guess I could admit, I'm pretty bad at measuring sometimes, like, you just get focused on building, it's like, oh, we do have to measure this. You can't improve what you can't measure. And yeah, having that kind of focus on Okay, let's look at measuring. Let's look at learning, you know, can we improve anything, and then it's kind of like organic after that, like showing incremental success, build success. So as you kind of kind of, like, you know, show some improvement, some small experiments. That That's awesome. Yeah. Perfect answer. Thank you for that. Kind of last few questions here. For the listeners or the viewers. One would be, I'll take it in this order. So someone who's been in the SRE industry for a long time, like, is there any sort of like trends you're noticing, like, say, if you compare when you started your essary journey to like today, if you had to say, You know what, we were doing things a little bit differently today than we did a couple years ago, like our as long as we're more aggressive or oh, we have different practices or policies in place or oh, you know, it's just we're continuing to evolve as a practice like Any people process technology that you say, You know what, this is really cool. And anything on the horizon that you think will be really, really need that as essary practice continues to evolve?

ShipTalk - SRE, DevOps, Platform Engineering, Software Delivery

Starting an SRE Practice in a Critical Industry - Vlad Ukis - Siemens Healthineers

ShipTalk - SRE, DevOps, Platform Engineering, Software Delivery

Starting an SRE Practice in a Critical Industry - Vlad Ukis - Siemens Healthineers

Listen to this podcast on