ShipTalk - A new word for techies, using your observability data lake to do good, and some HEMA-toma inducing sword play - Hilliary Lipsig

In this episode of ShipTalk (The SRE Edition), Red Hat's Chief Mermaid, Hilliary Lipsig, defines a new word that all techies should adopt into their vocabulary. For the main course, Hilliary describes how to use Observability data to do so much more than most companies are doing today. She also shares her favorite hobby which is something I have never heard of before.

Introductions
Just for fun #1 - Hilliary's new word - ooming
Main topic - Using Observability data proactively
Just for fun #2 - Hilliary's favorite hobby

Jim Hirschauer: 0:07

Alright. Welcome to ShipTalk, the SRE edition. I'm Jim Hirschauer, your host for today. ShipTalk is a DevOps podcast, brought to you by Harness, the software delivery platform, and the SRE edition focuses on reliability topics. My guest today is Hilliary Lipsig from Red Hat. Hilliary, welcome to the show.

Hilliary Lipsig: 0:27

Thanks, Jim. Happy to be here. My name's Hillary Lipsig. I am Chief Mermaid at Red Hat. And if you look me up on LinkedIn, that's really there. If you've seen my talks, that's my title. And the reason for that title is because I am the technical lead for a small subdivision in Red Hat called Strategy Enablement and Architecture out of our larger service delivery organization. And when they said, Hey You really can't have principal site reliability engineer on your job title anymore because you're not really, you know, an SRE that way. What do you think a good business card title would be for you? I said chief Mermaid, officer because I was finally my opportunity to grow up and be a mermaid like I'd always wanted. So they laughed but we did it. So that is that is my official unofficial business card title which is delightful.

Jim Hirschauer: 1:15

That's probably the greatest title I've heard so far, so congrats on that.

Hilliary Lipsig: 1:19

Thank you. I'm very proud of it. Someday they will probably make me change it, but today is not that day.

Jim Hirschauer: 1:25

Awesome. Keep it as long as you can, for sure. Well Hilliary, you know, we've talked about the format of the show We have a couple of sections that are called just for fun, and then we get into the main topic. So we're gonna start off with a just for fun section. And we, when we were talking earlier, you, you used a word that I had never heard before, so I'm not sure if you made up this word, or at least you're like a really early adopter of this new word, but I feel like we should all know what this word is. So why don't you go and explain that.

Hilliary Lipsig: 1:54

Sure. So people tell me, I made up this word. I do not believe it, but the people I thought I learned it from insisted they learned it from me. So who knows where this came from? So when you're looking at a pod in Kubernetes and it has an out of memory or an OOM somewhere along the way I said, Ugh, that's ooming. And people are like, what? And I said, yeah, ooming. It's like, you know, you say oom for out of memory. It's, it's ooming, it's a verb. It's delightful. And this has caught on to some degree here in Red Hat. Others have picked it up. Again, I'm credited with this. I'm certain I did not come up with this. I must have heard this somewhere. So if anybody knows who I learned it from, they deserve all the accolades. But I will tell you that Like I said, it's delightful. And I have been on a call with a customer that was very serious, very tense. They were rightfully so, really upset. And I used the term and then I paused to explain the term and you know, it's not like magically everything was better, but like there, suddenly there were smiles on the call. People were laughing and it, it seriously reduced the whole tension of the situation because I wasn't taking myself or anything very seriously. Or the right level of levity is really what I wanna say is what that brings into the conversation. Yeah. And so then we got back to our very serious,

Jim Hirschauer: 3:20

discussions. Well, that's awesome. I I love the word, and I, I'm curious though, can you use the word a little more generically or does it have to be like, really related to an out of memory error? Can, can I use that word? If, let's say the system is just like chugging along and performing poorly, can I say, oh, it's, it's ooming. Is it almost like googling something?

Hilliary Lipsig: 3:43

No, I, I do feel like it is kind of pretty specific because it maps back to a very, like, you see OOM visually on the like OC get pod, or not oc I'm sorry, that's a very OpenShift thing. kubctl, like get pods or whatever. Mm-hmm. I work in OpenShift. We have our own CLI. It's opinionated and I have that completely memorized, much to my occasional detriment when I wonder why it doesn't work on vanilla Kubernetes.

Jim Hirschauer: 4:10

Right. All right, good. So I've learned today that I can use this word ooming if I see an out of memory exception, so next time I see one, that's definitely, it's ingrained in my memory now for sure. I love it. I feel like I'm just gonna give you credit cause you're the first person I've heard it from, so if anyone asks me, I'm gonna say I heard it from you. And origin being unknown, I think you really deserve the credit at this point. If it really starts taking off.

Hilliary Lipsig: 4:37

I will humbly accept that credit, but thank you.

Jim Hirschauer: 4:40

Okay. Alright. So let's get a little more serious now and, and jump into our main topic. So You know, we had a chance yesterday to chat a little bit before the show, and you were talking about observability and, you know, I've been in and around the observability space personally for over a decade. My background as a practitioner goes back into the monitoring realm, which is kind of the, I'd say the, the granddaddy of, of modern observability and. I think most of us that are either on the show right now or listening to the show, we all know how and why we should be using observability to identify and resolve production incidents. But yesterday you said to me that we can, and we should be doing more with all of that data. So I'd love for you to explain your position on that.

Hilliary Lipsig: 5:33

Sure. And I'd like to hope this isn't a controversial position one of the. One of the projects that I'm, I'm working on with Red Hat is it's a data project and we were working through the personas of how, how do we be a more data-driven organization and, and what kind of data do we really need to be that? And so, and I swear this is not an original thought. I'm sure I learned this as from somebody who was much smarter than I am. But. I said that, you know, one of the things that we could be doing is actually demonstrating our cloud like usage metrics. So what are we doing with CPU and what are we doing with memory? And then using that to compare to, you know, when we're pla when we're doing these plannings for these systems, we typically have some sort of profile. So the first is, of course, that profiling, like where are we in our, our usage compared to how we you know, how we forecasted to make sure we're, you know, kind of keeping track of that. That'll all maps back to spend, which everybody is really caring about right now which you should always care about actually. And the other thing that I said was, you know, It's important to watch trends as well in these usage and not just like looking over the course of the month or course of a day, but looking a little bit more granularly. And so there's actually an example from the SRE team that I, I joined at Red Hat actually as to that was a team I came onto that I'm, I'm no longer a part of although I still love them and they love me, I think. But anyway, the There was some thresholds that were set up in Prometheus that would create warning alerts and it would just be for things like CPU and memory and so forth. And because of these thresholds we noticed things like memory leaks in the code. And there's a couple things that like happens when a memory leak is happening, a couple symptoms. So it's not just that your memory usage goes up and up and up. But actually your CPU will start churning and churning and churning even in virtualized hardware. It will work harder and harder to try and reclaim as much memory as possible to try to prevent an out of memory. And so taking a look at that kind of data, and looking at that data a little bit more granularly, we can start actually looking for issues in our software. And of course, so we had set alerts on that to let us know that there might be a memory leak. But you can actually with more data, cuz again, that was one simple Prometheus rule. You can actually get a little bit more intelligent about it as well. And so, That's one of the things that I am trying to drive this kind of a, a cultural push within Red Hat is and I know I'm not alone in this, I really don't wanna say that there's nobody else doing this. It's just I've, I've, I've joined onto this to this type of an effort. It's this idea that we should be using our metrics to look and see if we can be improving our software. Because if we profile our software and then we run our software with like, The profiles we've defined. So we've got memory limits and CPU requests and yes, CPU limits. I know that CPU limits are like, people are like, you don't need them. I will argue that it's a whole other 25 minute conversation. And, but yes, CPU limits as well. And so, If we're constantly running close to your limits or even occasionally, like hitting the limits and then things restart and run fine for a while those might not necessarily all result in incidents or downtime or alerts, but all these things tell you that your software's probably not operating as well as it could be. And so, We wanna look into why is our software not operating as well as it could be Because the nature of, of working in a cloud native environment or Kubernetes is often that I have seen. Sometimes certain things can kind of be hidden for a while because it's highly available because it restarts and brings itself back up to a desired state. There might not be any seriously noticeable blip but that doesn't mean that the blips that exist are acceptable either.

Jim Hirschauer: 9:34

Yeah. So, you know, you make an interesting point. Being proactive with this type of activity. First of all, most companies have this type of data. They have observability solutions in-house or in place at least. And there's a, a wealth of data at most companies. What I think is interesting about this is, is you made the point really early on that we should always be concerned with costs. And what I've seen happen is to, over provision your resources with your cloud computing provider, or we used to do this, you know, back in the days of the plain old data center, we just used to over provision the heck out of our hardware so that we could avoid these issues for as long as possible, avoid the out of memory exception from kicking in and, and causing an error, causing the impact to our users. But that's in my mind, especially in today's world, is, completely unacceptable. That goes out the window and we have to do a better job, as an IT community of creating services that respond and behave properly, that we don't have to over provision and, and really waste a ton of money on. So I think you're really onto something here. You know, just besides the fact that you can make an overall better customer experience by having better performing software, you can end up saving your company significant cost across all of the different services that most companies have due to that over provisioning.

Hilliary Lipsig: 11:02

Yeah. And this is actually one of the, this goes into the, and I, I, I swear I did not intend to derail our entire conversation into CPU limits. But it actually, it actually came up very recently and I explained to somebody all the reasons why I'm in favor of them. And I'll give a, a high level overview here, which is one, I'm not only dealing with virtualized hardware, I'm dealing with real hardware. So, I can't just let processes do whatever they want. It actually is, is finite. And if you're overcommitting your CPU on real hardware, you're shortening your hardware's lifespan. So if you think about how expensive a server is, first of all, that's extremely wasteful to short its lifespan. Second of all, it's drawing more energy, which is extremely wasteful cause it causes more pull there and more actually. Impacts the overall ambient temperature, which means the AC needs to run wherever that server is located more often. So as you said, it's not just about the immediate company's fiscal situation, but then we start getting into the environmental impacts of not having performant software. Which I think about all the time, because when I'm thinking about the size of, of the Red Hat customer base and, and so forth. These, this enterprise scales. These are actually things that start coming up. We have projects around sustainability and, and environmental impact as I think a lot of major companies do these days. And so it's, these are the types of things that, you know, we're caring about. These are the types of ways we're wanting to use data. And since we already have the data, it. Honestly would be almost irresponsible not to use it to figure out how to make everything better holistically and looking at the holistic picture of our software. And since I'm somebody with an operational background, I care about how it's performing.

Jim Hirschauer: 12:38

Yeah, makes sense. So let me ask you this. Sometimes it's really easy to say we should do this thing. The reality that most companies are faced with and, and most SREs or folks who are responsible for reliability and performance of overall systems is that they are typically pretty far behind in their main job responsibilities. So, what advice can you give folks to start to transition from using their observability data in a reactive methodology where they're just putting out fires all the time to being able to truly switch over into this proactive mode where they're, looking at things ahead of time and, and trying to make sure that the systems are actually performant in the right way so that they can right size things. I know it's, it's just a very difficult thing to make that transition in some companies. So how would you suggest folks do that?

Hilliary Lipsig: 13:31

And so I wanna actually point out here Red Hat is it's no different. We're struggling through some of these same problems right now. And one of the initiatives I am responsible for at Red Hat is something called hybrid, SRE, which is an interesting name. And this is the idea that we stop throwing code over the wall to SREs. I spent 11 years in quality engineering before I moved to SRE, and I experienced in quality engineering how bad throwing code over the wall can be. And so what, what I did when I was head of quality engineering at my, in my last role was I actually had the quality engineers working on the engineering teams working on unit testing, making unit tests a little bit more similar to kind of more like the, the standard way of quality engineering, thinking about testing with, of like integration and regression and putting that, shifting it left into the unit testing framework. Especially with microservices, that's really easy to do. It's much harder. It was much harder in the monolithic architectures. And we really just need to be doing the same thing, that same type of partnership with SRE and service engineers, like was what we call them. And so we've started doing, we've, we've pushed off this initiative and it's about getting engineers closer to their operations and the operations folks closer to the engineering so that you've got better communication, better partnership. And a lot of that is actually process engineering. Where we're putting together processes of like, okay, here's how we're going to do these things. And so what we're doing is by putting in these processes that generally make the software more reliable and give it a, a baseline higher level of service maturity out the gate that actually frees up the SREs to do a little bit more of that proactive work. So it really requires a strong partnership between SRE and the software developers who are on the service kind of dividing and conquering. So SRE should be putting out things and guidance and best practices to their engineering teams to say, Hey, in our environments, this is what we know. These are the good patterns that make for performant software. And then, It's incumbent on engineering to deliver software that matches those patterns. That helps. And then it must be an ongoing partnership of like, okay, how are we performing? How many incidents are we getting? What kind of trends are we seeing in the incidents? There's A great practice that our platform SRE team does, which is where they actually take like some major incidents, especially anything that's recurrent. And they put together what we call a tiger team and they'll go and kind of like do a deep dive and it's cross-functional. And so that's really, The thing that I, I have to say is the way to get there, it's what we've been doing. It's what we see working. We've seen some really great outcomes of like, you know, certain failures going away completely or, you know, being reduced. And so, and we also have requirements of toil. If you're having SRE, run your service. Then our toil levels must be below 50%. If you breach that, if your service breaches it, then we can't keep our commitments to our customers, which means that engineering teams must come on and start actively doing that toil until they can basically help automate it away or resolve the issues within the service that is leading to that toil, that manual labor. So there's a kind of a lot of little pieces that go into this and it's about relationships and process primarily. And I would just say you need to implement these things iteratively. If you are looking holistically at the problems in your organization or the problems with your communication, there's gonna be some really bigger issues and then there's gonna be some probably low hanging fruit. And definitely just do that. Do that. Here's a low hanging fruit thing that we could be doing better. Let's go solve that. And I love the cross-functional the cross-functional tiger teams for that, which is also similarly, we have a cross-functional chaos engineering game where we play with services and we break them and do incident response and like engineers and CEE and SRE all take turns playing each other's role in an incident response to learn about things. And we usually get good insights into our software out of those games as well. Which results in better, more re resilient and reliant software.

Jim Hirschauer: 17:43

Wow. I feel like maybe I need to have you back on the show for a completely separate show about that topic chaos engineering and, and working on resiliency. That's a huge topic in and of itself.

Hilliary Lipsig: 17:54

Yeah, it is. And these are all just pieces of how we're putting together this bigger holistic data story at Red Hat, right? That's one of the things that feeds into our overall service health index that we're working towards, and our overall data story of here's here is how we are doing, here's how things are, and then ultimately working towards even more proactive work.

Jim Hirschauer: 18:18

Yeah. You know so I do work for Harness and what you said really resonated with me. At Harness we are building our software to align with exactly with what you were just talking about. So, you know, trying to make sure those processes are in place and automating the process across the software delivery lifecycle to ensure quality code, reliable code. Making its way through the, the life cycle so that when it hits production, you're in much better shape overall. So like from a a philosophical perspective, I think we're completely aligned there. It's really hard to do in practice, without good tooling to help you. Process can be very difficult in and of itself and it's hard to control that unless you actually like automate it in some ways what we've been finding.

Hilliary Lipsig: 19:03

Yeah, absolutely. A hundred percent. And there are a lot of pieces of the things that I talked about that are automated. And even as much as the chaos engineering stuff, right? Lots of chaos, engineering things exist. We run it as a game because it's a team building as well. Yeah. And so that also is fun cuz people get to be red team and they have to like reverse engineer their software. How am I gonna break it? We, we run it as a live game, but there's all kinds of automation opportunities places where GitOps can really we use GitOps to actually solve alerts in the SRE teams at Red Hat. So places where things like that can really come in and, and bolster the actions. Automation should really be about making your humans more efficient, right? And so any, any kind of automated tool that does that is probably a great idea.

Jim Hirschauer: 19:47

Absolutely. Alright, well, we are out of time on our main topic, and it's been really interesting and insightful and I, I really mean it. I think I would love to have you back on the show at a later time where we could discuss more about these, you know, chaos engineering game days that you all put on. I'd love to hear some detail about that. So if you're willing to love to have you back on the show.

Hilliary Lipsig: 20:07

Oh yeah, absolutely. That's it was an initiative I'm extremely proud of. It was actually one of mine. It's not like I came up with the game. It follows the capture, the flag style pattern. Mm-hmm. But the initiative was mine and it's caught on and it's gone very well. I'm very proud of that one.

Jim Hirschauer: 20:21

Okay. Yeah, we will definitely save that topic and talk about that on a future show. Right now we're gonna transition to just for fun, number two. So, Hilliary, outside of work, what's your favorite hobby?

Hilliary Lipsig: 20:34

My favorite hobby is called HEMA. It is Historical European martial Arts. And it is a style of sword fighting. Specifically I'm learning what is called the Meyer System, and it is a late German style of fencing, like what Knights, you know, would've done and would've trained in. Okay. And it also, so it includes in addition to the sword techniques, also grappling and wrestling techniques as well. Oh, wow. It's very fun. For people who follow me on Twitter, you will see there's a clip from my first tournament, first match of me just getting whacked on the head.

Jim Hirschauer: 21:07

And what's your, what's your Twitter? What's your Twitter handle real quick.

Hilliary Lipsig: 21:11

Caffeinated integrations into caffeinate. It's at int the number two and then Caffeinate(@Int2Caffeinate), which I cannot spell because I have dyslexia.

Jim Hirschauer: 21:20

Okay. No, no problem. I think people will be able to find you.

Hilliary Lipsig: 21:25

Probably. Probably, yeah. So It's a very fun somewhat expensive sport. We fight with the swords are 48 inches long, so that's four feet in the Imperial system and about a meter and a third in the metric system. And yeah, so they're kind of heavy. They're big. Yeah. And it's great stress relief though because when you have all the armor on and everybody you know, is fighting with control, that's a really big thing with the, the sport is you must use control. Cause the swords are heavy and dangerous even though they're training swords. And then after that it's great stress relief cuz you're just hitting your friends with a really big stick.

Jim Hirschauer: 22:04

I love it. It's, I've, I had never heard of this before you mentioned it. It's, it's like this whole new world of activity that I had no idea existed. It's amazing.

Hilliary Lipsig: 22:14

It is an extremely fun sport. It is a very friendly sport. The people in HEMA are, it's just the type of culture where after a tournament match, right, you've just been just wailing on each other, right. You hug, you give a just big hug, just hug it out, right? Hold the whole thing. People are smiling, they're hugging, you're sweaty and disgusting. This is a stranger, and you're still full force as hard as you can hugging this person because that's like the type of the type of attitude that it has. And if you've done other types of martial arts, I will tell you that the tournament vibes are not the same. This is a very joyful sport and some of the absolute coolest people I have ever met including my best friend and her husband and my own husband participate.

Jim Hirschauer: 22:56

It sounds amazing from the way you describe it. I, I, I definitely wanna check it out. I live in Austin, so it's highly likely there's a, a place near me that can teach me this. And it sounds like it's good exercise. It's, it's, I'd imagine it's really strenuous.

Hilliary Lipsig: 23:10

It is a full body exercise. Like you, in order to be doing this correctly, you must be engaging like every single muscle. I was at a class last night and everything hurts, but it's like the good type of hurt.

Jim Hirschauer: 23:22

Yeah. Alright. Okay. Listen, Hillary, thank you so much for being on the show, sharing your new word with us. I love that"ooming" as a verb. So remember that our listeners need to start using that whenever they see Out Of Memory exceptions. I love that you shared a completely new sport and hobby with me, so I'm excited to, to dig into that. And your main topic was just incredibly insightful. So, I think it's something that I never really considered all that we can do with observability data and all that's possible and all the really good reasons for us that we should do it. So you're very humble, but you are super insightful. So I just wanted to thank you for, for everything that you shared today.

Hilliary Lipsig: 24:06

Well, thank you so much for having me on. This was a real joy for me and I'd be happy to come back anytime.

Jim Hirschauer: 24:11

Fantastic. I'm looking forward to it and to all of our listeners, if you are an SRE or if you're in a related role and you want to be a guest speaker on ShipTalk, please send an email to podcast@shiptalk.io and we'll get back to you. That's all for now. Until next time.

ShipTalk - SRE, DevOps, Platform Engineering, Software Delivery

ShipTalk - A new word for techies, using your observability data lake to do good, and some HEMA-toma inducing sword play - Hilliary Lipsig - Red Hat

ShipTalk - SRE, DevOps, Platform Engineering, Software Delivery

ShipTalk - A new word for techies, using your observability data lake to do good, and some HEMA-toma inducing sword play - Hilliary Lipsig - Red Hat

Listen to this podcast on