
Customer Support Leaders
Customer Support Leaders
264: Mastering Incident Management - Part 1 of 6; with Kat Gaines
Mastering Incident Management - Part 1 of 6; with Kat Gaines
Unlock the secrets to seamless service as we navigate the critical world of incident management with the expertise of Kat Gaines from PagerDuty. Prepare to be armed with a deeper understanding of the complexities that define an incident, and why this knowledge is a game-changer for teams committed to providing uninterrupted excellence. Kat's seasoned perspective will guide you through the layers of incident severity and the essential preparedness for the unexpected. This episode isn't just a conversation; it's a toolkit for anyone in the support and development trenches, laying the foundational strategies for robust incident response.
Embark on a journey through the collaborative core of incident management, where the role of support engineers is reimagined as a pathway to career growth and the position of an incident commander becomes the linchpin of crisis resolution. With a focus on the symbiosis between structured protocols and the agility to adapt, we dissect how cross-functional cooperation and continuous process refinement are the cornerstones of an organization's resilience. From prioritizing responses to managing bugs, this dialogue with Kat is more than just an exploration—it's an invitation to evolve your practices and prioritize your customers in the face of adversity. Join us for an eye-opening series that promises to redefine your approach to incident management and elevate your team's capabilities.
Hello and welcome to Episode 264 of the Customer Support Leaders Podcast. I'm Charlotte Ward Today. Welcome Kat Gaines in Part 1 of a six-part series on incident management.
Charlotte Ward:I'd like to welcome to the podcast today Kat Gaines. Kat, absolute pleasure to have you join me. Thank you so much for coming onto the podcast. I'm going to drop a little teaser right up front that today is episode one of a six-part series that you and I have planned out on a very important topic. But first would you like to introduce yourself.
Kat Gaines:Yeah, thanks for having me, Charlotte. I'm really excited for this too. I think we were talking about this series and we're finally getting to manifest it, and it's really exciting because it's something I want to talk about a lot right now. So, yeah, I'm Kat Gaines. I am a developer advocate at PagerDuty currently and, for folks not familiar with PagerDuty, we're an operations management platform. You may know this mostly in the context of incident response, which spoiler alert. That's a little bit what this series is about, and I've been at PagerDuty for about almost 10 years now Before.
Charlotte Ward:I was in my current role.
Kat Gaines:I used to run their customer support organization. We started with like five people in the San Francisco office and then when I left we'd scale to 30-ish people globally San Francisco office and then when I left we'd scale to 30-ish people globally. And so I've seen a lot in that time, not just at PagerDuty but in my career in general and being you know kind of the leaders in the incident response and incident management space. We have learned a lot from our own processes, which I guess I'll stop giving things away. There's a lot to talk about there too over these few episodes.
Charlotte Ward:Oh man, there really is. I am so pleased to have the opportunity to unpack this in what I hope will be great and practical detail Like this. Isn't this six part series is? Is not a bunch of theory? Like it like incident?
Kat Gaines:response.
Charlotte Ward:I get to gloss over it a lot when I'm like speaking at a conference or something, something, but we get to deep dive with all this time, yeah yeah and like, and, and I think if people stick with us, um, they're going to come away with some really practical things they can do after every single one of the next six episodes. I am going to space them out, I think, but you know we'll uh, we don't want to commit people to a six-part course like at high speed. They need to take their time with this and absorb the why and the what each stage is. So definitely, we're going to get all six out there pretty soon. But watch this space, because you need time to listen, time to think about what it means to your organization, don't you? And then exactly how you're going to implement it. But I think I'm really excited to get some really actionable advice from all of these conversations coming up. So wonderful. I mean. Let's kick off, then, with, like, the very, very simplest question what is an incident?
Kat Gaines:The basics, yeah, so I threw around a lot of terms in the intro that a lot of people are going to be familiar with, but you might be encountering it for the first time and it's okay if that's the case. So an incident is kind of what it sounds like. It is an incident. It is something happening that is out of the norm of daily operations. We hear people call them different terms here and there, but really it's some kind of issue that needs attention in short order, whether it's drop everything and deal with that now, or okay, we know that's hanging out there, maybe we don't have to fix it right away. There's such a thing as a lower level or lower grade incident, as well.
Kat Gaines:But it's an event that is somehow disrupting either the work you're doing and when I'm saying the work you are doing just out there, I mean kind of anyone. I mean, if we're talking to customer support teams, it might disrupt your work, helping customers working with them For an engineering or development team. It may disrupt their ability to continue working on what they're trying to build and maintain and then, most importantly because this is the part that customer facing teams see the most often it will disrupt what customers and users are trying to do with the products, the software, the services, whatever it is that we're providing as businesses. Right, and so that's where it really comes into play and a lot of what we're going to be talking about over this series too, that this is something that your customers often see. We have a lot of theory out there in the world around try to prevent issues before your customers see them, and that's a beautiful goal. It doesn't always happen. An incident is often that goal not happening.
Charlotte Ward:Right, right and preventing it. You want to catch an issue.
Kat Gaines:Yeah, you want your monitoring to pick it up. You want to be able to say just going to fix that before it becomes customer facing. Stuff happens and that's okay. Your customers can honestly be a huge asset in understanding your systems and the reliability of them as well.
Charlotte Ward:That's very true. That's very true. You touched on a couple of things there that I just think it's worth highlighting. One is that you said an incident can have multiple, like it can have different priorities. We always think, I think incident is kind of outage level Everyone drop everything. This is affecting tens or hundreds of customers and it's like all hands on deck. But it's important that we realize that really at the very basic level, exactly as you said, it's the unplanned, isn't it? And I think that I'm going to throw an idea out there and a good instant management system allows you to plan for the unplanned. Would that be fair, Exactly?
Kat Gaines:Yes, no, that's exactly. I think that one of my teammates maybe has or delivered a talk at some point called Plan for Unplanned Work.
Kat Gaines:That's very on the nose. That is exactly what incidents are. You mentioned terminology too, where you said, like an outage is usually what people think of, just a full scale, high priority P1 issue. Um, you might not always call an incident an outage. There is, uh, a talk that I give at conferences on occasion around incident communication, which we'll dive into a little bit later in the series Um, but outage isn't always the right word. Sometimes it's just a service disruption, or sometimes it's just kind of an issue, right?
Kat Gaines:If you say outage, that is a very intense, visceral word for people that sort of just evokes a panic response. Understandably, you don't want your customers to panic, right? And so I think, yeah, understanding the differences of the scale of incident is one of those kind of bread and butter pieces of understanding what an incident is and how you approach it.
Charlotte Ward:Yeah, and I don't want to fast forward us to the terminology, you know, but each of these terms has a specific meaning. On the importance of things like common understanding across this whole kind of framework, across all of the processes and responses that run from it, including how you talk to customers, right?
Kat Gaines:Yeah.
Charlotte Ward:So I mean talking about customers then why should we care about, as customer-facing teams, why should we care about maybe not just that terminology, but getting incident response right? Why do we need to plan for the unplanned? Because most of the time, let's be honest, most of the time we conflate incident and outage for a reason, and that's because most of these outages or high level service disruptions do need everyone on hand all of the time. So isn't that just what we do?
Kat Gaines:It is just what we do, and you're right. I think that why we need to care comes really into play around the process and around being involved and feeling appropriately involved. It can be so easy to be on a support team or even any other customer facing team to be, kind of, the last team thought of in these moments and you want to flip the script on that. If that's what's happening in your organization right now. If you hear about processes after they're already built out and written in stone and can't really be changed or they're not open to change. Or if you're often finding yourself in a really reactive space of hearing from customers and going, oh, what's the process? Where do I report this? What does this look like? Where you know it's something bigger than just a bug, but it's something really large. Or if you're in this space of knowing there's something going on but not even that data back to the engineering or product teams around what you're hearing from customers. Often they will see what's on the back end, they'll see what's in the monitoring. They may not see the actual symptoms that customers or users are experiencing in the moment. So if you're someone who finds yourself in any of those places, that is why you have to care you want to change those experiences for you, for your team, to have a better overall experience while responding to an incident.
Kat Gaines:I really strongly believe that a moment of incident response for customer support or for anyone honestly should be as calm as planned work. Obviously it's a disruption. Obviously it's a little bit of an adrenaline rush. We can't fix that completely, but it should be something that you feel calm around because you know what the process is, you know where to go, you know who to talk to, you know what happens next, you know how you're involved, what your expectations are, what to expect of other people, and you know what to tell your customers. You're never in the dark going. I'm going to just make something up because I don't know what's happening. Right, you don't want to be that person. Yeah, that's really why CX cares, because you want to be in that space of feeling informed and involved and empowered in the process, rather than just a bystander reacting to it. It's about flipping from that reactive to a more proactive state, even when you don't know something's coming.
Charlotte Ward:Yeah, I think you're absolutely right and I think that Moving from being the downstream consumer as much of a consumer of that process as the customer is is a reason why you should care, because if you're only ever the last stop, then everything's happened before you. What control do you have over any communications? What feedback loops do you have over any communications? What feedback loops do you have, like it feels like if you're only ever having it thrown at you, um, you can only ever react and you can only ever push it, push it further down to the customer. You're not, you're not, uh, you're not contributing to the improvement which is ultimately what we want out of any good incident management. Action is, I won't say process, but out of any good incident management, we want improvement for the next time Because, as we said right at the top, part of the outcome should be prevention.
Kat Gaines:right, it is. Yeah, it's continuous improvement, it's continuous learning. And to that point you made about being in the same place as the customer, getting the information at the same time reacting, that's not possible, that's not acceptable. Customers are looking to you as the face of the business to inform them on what's going on. And so if you imagine that you're an agent sitting in the queue and you open up a ticket and a customer is asking oh, what's going on?
Kat Gaines:I'm seeing the symptom and you're like I don't know, and then you find out that someone could have told you about it, maybe an hour before, even a few minutes ahead of time. That's a really crappy feeling, and so being able to feel like, okay, no, I am the voice of authority. I'm not just, you know, like the last person in line, the last person everyone's thinking of. I'm a voice of authority here for this customer, for this person, and they can trust me, they can trust what's coming out of my mouth, they can trust that I have a handle on the situation and that I'll keep them updated too.
Charlotte Ward:So they know that, by proxy, they can trust our business. I think it's really important to think of yourself as a customer inside the business when you're in support, because part of being downstream, of everything that comes out of product and engineering and sales and marketing and everything else, is that you're not just a consumer. You are a customer. You do have rights and expectations that allow you or at least that should be in place to allow you to deliver a good service, right. I think it's really hard to deliver a good customer experience for your customers if you're getting a crappy experience as a support organization yourself. And I think that that's as much true of supportability, which I talked about recently with Alexis, as it is instant management, right, and so I think I think it puts you in the right frame of mind as being a customer, it makes you a key stakeholder, allows you, as you said at the top, to flip that script. I've got one more question before we dive into. I know you're going to begin to tell us about how Pages UT handles this, and we will definitely talk more about this over the next five episodes of this series. But one question is, and I suspect I know the answer, answer and the answer, I think, is it depends on the organization. I'm just going to put that out there. I think it depends on the organization.
Charlotte Ward:But the question is, um, who should own the incident management process? Um, and I'll I'll tell you now that in my organization, support owns it. It's a deeply technical product. It's a product where we work very closely with engineering anyway, very closely with our professional services and account managers and obviously very closely with the customer over extended periods of time. And so we really are at the center of that customer relationship when it comes to technical incidents and the technical resolution needed on those incidents. And so for us being at the center, flipping that script, being at the center of the incident management process, is really key at this point. Now, that might not always be the case, and certainly we are iterating and improving, like everyone should right, on all of their processes, but right now we own it. Incident managers come out of support, they are support engineers, they're frontline folks, you know. So it's basically a hat they wear, which is also a function of the size of organization we are. But, mike, I'll throw the question to you who should own the process?
Kat Gaines:You're exactly right that it depends on the organization. So the structure you just laid out is super valid and it's also pretty amazing and empowering for a support team to be able to have that kind of ownership over that org wide process. It's also, honestly, it's a career building move as well as to be able to say that I own this piece of the process and I'm also. I'm not just an expert in the support sense and being a deep expert on this product and providing fantastic support, but I also now know things about incident management and incident response that I can continue to take further in my career, whether that career continues in support or whether I pivot somewhere else down the line, right and then to that point. There are companies that make incident management a role in and of itself. It is a salary paid role that you are working on. Incident management. That is your job to think about the reliability and then the response and continue to iterate on that process.
Kat Gaines:What we do at PagerDuty it's co-owned across the organization. There is a team of people who they have their day jobs, which may be different jobs across the organization, but there's a documentation around incident response. So if you're going to any of our, you know, like responsepagerdutycom, we have these guides around incident response that walk you through different steps. People will see that term a lot. So the incident commander is someone who effectively owns the incident while it is active, and so that person is the authority at the end of the day.
Kat Gaines:They are the person who is making the decisions. They're not making those decisions in a vacuum. They're sourcing the opinions and takes from everyone who's involved in the response process itself, but they're responsible for saying this is what we're doing next, this is how we're handling this and basically being the final call on any decision made during incident response. So it's a hefty job. It's a lot of responsibility. We have training that, again, we've made it available externally, but we have training that we go through internally to make sure that folks who volunteer to do this are ready to go. They can be from any part of the org, though, and then they're the person who works with everyone else in. Again, in our documentation we'll probably talk about this later on in the series too there are a number of different standardized rules that we outline, you're the person working directly with those people to ensure that everything is going smoothly in the way it needs to.
Kat Gaines:So that's another way to do it. It's obviously it's how we do it. It's not what we recommend as a hard line. This is what you must do Because, again, there are so many different ways to own this and to make it happen. I think the key is making sure that the departments who will be impacted are involved in some way.
Charlotte Ward:Oh, that's so important.
Kat Gaines:Yeah, everyone who's going to be touched by it, no matter who's owning it the engineering and product teams, the support teams, the marketing teams who are going to have to be handling things like maybe, angry tweets, things like that that you don't necessarily want to have to see, or, in a worst case scenario, coming up with larger corporate level messaging if something really intense has happened.
Kat Gaines:Everyone who is going to be touched by this and affected by it has to be involved in some way, and that's not to say, for example, that you need to have maybe a representative from the sales org on every single incident response call. You can group functions by what they do. So customer facing, for example, customer support, is often plenty to be able to be both the voice of the customer in that call and then to also keep the other customer facing teams informed externally of the call, so they don't have to jump in and say I'm the sales rep for this big account, they're blowing me up, what's going on. Instead, they know where to go for information. So it's not that everyone should be on calls, but everyone should understand how they get information and at least be informed, if not involved in designing the process itself.
Charlotte Ward:And I think that's right and I think it's. I should say right now that we based a lot of our incident response on the PagerDuty model. So you'll see, as we talk over the next six weeks, six episodes, you'll hear a lot of our stuff echoed back to you. I'm sure we have the role of incident commander. Incident commanders for us are always support engineers. Again, right now might change, but right now they're support engineers.
Charlotte Ward:Um, and that incident commander role is really is it's kind of, as you described, it's kind of the hub, isn't it, for the incident? That incident commander is aware of the status of every part of that incident at all times. That doesn't necessarily mean they're working the engineering resolution or they're communicating with customers, but they are the, the person who is there and knows and and is ensuring that it moves forward to resolution and ensuring process is followed and ensuring that all the comms that need to happen happen. For us it works, um, that it's support, because support is a 24 by 7 team. We already have solid handover processes for everything. So and no other team really is 24 by 7 at the moment, and so for us it's natural that, given that we've got good handovers and we can train us inside a single team that is already 24 by 7 and customer facing, that they can kind of wear two hats again.
Charlotte Ward:There's a lot of double hat wearing in smaller organizations isn't there, but yeah oh my gosh, there's so much yeah, there really is double, triple stack of hats, like the mad hatter some days right, but yeah the hats just keep going.
Kat Gaines:It's like one of those nesting dolls, yeah it really is.
Charlotte Ward:It really is. But, um, you know, when you're wearing that hat, that is your primary responsibility and you know you might continue, actually, for us, you might continue to own customer comms but hand off incident commander, that incident commander hat to another support engineer or eventually, like as you said, other people in the org who are trained to take on that role. Um, but I think that for me that's the key. It's like a big part of incident commanders is they have to be available. So you have to ensure, because unplanned things happen 24 7, so you have to make sure you've either got people around or an on-call response, I guess, for those incident commanders exactly.
Kat Gaines:Yeah, you have to either be set up in multiple regions, which, when you are that, can work out really beautifully, or you have to feel very confident in how you set up your on-call resources. And just to touch on that for a minute, obviously that's a lot of what PagerDuty does and our product does, so I spent a lot of time thinking about this too over the years. Setting that up well doesn't just mean having someone on call. It means having people on call, having a backup, a secondary at least on call, figuring out the severity of an issue and if you need to add additional levels to if this person doesn't pick up what happens, then right. And then it also means ensuring that you're balancing that on-call responsibility with day-to-day work in a way that people don't burn out.
Kat Gaines:So, that is something that I think about all the time too. Just to burn out in capacity, both in the support sense and in the on-call and incident response sense. Right, it's something that is high capacity in both types of roles, and especially when you combine them, it just kind of balloons.
Kat Gaines:And so being able to ensure that people are getting the breaks they need to, that you can like you said, hand that hat off to someone else that you can have things like rules in place.
Kat Gaines:So maybe there's an incident that's running a long time and you can say, okay, after x number of hours or however long it is, this person has to go off, call and hand off to the backup or the secondary so that they can just go get a sip of water, go rest for a few minutes.
Kat Gaines:Maybe, if it's, you know, worst case scenario, something that's going on for days, they can just go back to their normal routine for a moment, know that someone else has it covered right. That comes obviously with good handoff hygiene and making sure there are clear notes around what's going on so that you can equip that next person. But really ensuring that there's the capacity there for folks to cover for each other and to avoid burning out on being on call so fast because it's something that can just accelerate your level of burnout instantly if it's not done the right way, and so the people care is a huge part of incident response process as well. I think we don't always talk about it as much, but it's the same thing I was saying earlier, where the goal for me is to have it be as calm of a process as day-to-day work, and the goal is also to make sure that everyone is getting their basic needs met.
Kat Gaines:They feel cared for, they're getting compensated with maybe some extra time off if they've had a particularly rough on-call shift, and that we're acknowledging the human aspect of this as well, not just the reliability or customer facing aspect.
Charlotte Ward:Super important, absolutely. The final thing I really want to touch on before we wrap this session. I know we're going to deep dive into a lot of this over the next few episodes. I think we started with what is an incident, but I think it's worth talking. Just you know, just bounce around for a couple of minutes what an incident? And because I think that, and and through that, maybe a little on triage, because I think I think part of the people care is making sure that you only call people when you should call people. Yeah, um, so so what? What does that look like? What does just just at the simplest level, what does a good incident trigger and triage look like?
Kat Gaines:yeah, so your trigger can come from multiple sources. It can come from something in your monitoring is not meeting expectations, something is fundamentally broken. Uh, I'll a really dramatic example for pager duty, because it's a lot of what we do is if one of our notification providers is having a service issue, we can't send notifications out. Right, that's a huge part of our product, um, and so folks who are listening can probably think of comparable examples. These are the types of things that trigger Something is not working as expected, and triage for it really looks like understanding scale.
Kat Gaines:We were talking about those priorities earlier. Something not working as expected can often just be a bug, and so, however you handle bug management in your organization, you don't even have to touch the incident response process if it's low level enough that, okay, this isn't working as expected. Really, having good definitions, not just for incident response but for any state of something not working as expected, is crucial, and revisiting those definitions often is also crucial. So when we're talking about the human care piece, we can't rely on knowledge from like six years ago to scale with what your team is doing today.
Kat Gaines:Things change, products shift and evolve, people shift as well, processes do. I'm sure that many of us have seen situations where there are, for example, reorgs of internal teams, like engineering. Teams shift, and so you can't rely on all of that change to maintain the same process or priorities. Different people will have different priorities, different leadership will come in, and so continuing to revisit those all of the time, honestly, an annual process is a great idea, even though it can feel a little bulky. Once you've done it once or twice, it will just be a quick update or two, the next couple of iterations and hopefully, future iterations where you do it. And so, just starting on that small level with let's just put incidents out of the way and let's just talk about bugs or product defects of any kind, that's something to immediately just say okay, we know what that looks like, we know what the process is for it. We have really clear definitions of priority. That can be hard to get to, but, again, building out that process of revisiting that often will help, and building really strong partnership across customer facing teams and development teams really helps on that side as well, and so just having that in place and then using that as a jumping off point. For instance, again, you're going to have different priority or severity levels of incidents and it's going to be okay.
Kat Gaines:This is definitely something that's bigger than a bug. It needs a more urgent response of some kind. We can't just file it in a backlog and hang out and wait, but we do see that it needs a more urgent response. However, the scale of how many people we are called in, of which even just how many different teams are called in, of the expectations we set externally because by the time we're talking about an incident, we're talking about eventually communicating externally in the near term are all going to be impacted by that severity level, and so it's really looking at what are the possible things that can go wrong. You can look at the history of things that have gone wrong, because the minute that you have a company, you have things that go wrong on some level in some way every single day. If we're being honest and looking at that history and then using those as that jumping off point to categorize and prioritize, if this happens, how serious is it? If it happens to this percentage of our users, is that a threshold of some kind?
Kat Gaines:If it happens for this amount of time and we realize that it's not self-resolving within X number of hours. That's some kind of threshold. So all of those things are going to play into how you prioritize. It gets really complicated. Again. It's going to be a lot of partnership across the org to make sure that you get it right. And even when you get it right, don't get attached to it. You have to be willing to kill your process one, two, 10 times over to make sure that it actually fits everyone's needs. And then it's also making sure that people feel like they can speak up for their needs in that as well. So often when we talk about designing process, it happens up at the top level with the leadership right.
Charlotte Ward:Yeah.
Kat Gaines:That's great. There are people who are seasoned in their career. They are empowered and feel good about making decisions. It's one of their favorite things to do. They don't actually touch what's happening on the ground every day, though.
Kat Gaines:Great leaders do sometimes, but it's honestly just not your priority and that's not a flaw of leadership, it's just the reality. And so ensuring that the people on the ground in the work are able to have a voice in that process and say you know, I feel like if this happens, it should be higher or lower priority because of these reasons that I'll list, giving them a forum and some kind of accessibility to the people who are making the final decision to say you know what? This is what I feel really influences this and understanding that their voice is important in that process, that they don't feel belittled or set aside which can happen very easily sometimes and, honestly, if that's happening to you, go find an organization who values you, you don't have to put up with that. Absolutely go find an organization who values you, you don't have to put up with that. But yeah, from the leadership side, making sure that that forum is available for folks and accessible and again having the human aspect of it folded in.
Charlotte Ward:Yeah, I couldn't agree more, and I think one final thing I would say on that is that it's um. From my point of view, it's super important that we remember who the consumer is the customer at the end of the day, and so whatever you're doing might be uncomfortable, the thresholds might feel not quite right to you in any part of the organization. Like you know, engineers might say no, I don't know if I can cope with this level of whatever it is Like. This is too much or too little. But ultimately you have to do the right thing by your customers, and if that involves re-engineering processes for you to be able to deliver that to your customers, then that's the right thing to do, and it's definitely going to take iteration to get there, but but like being comfortable with that iteration super important as well. Yeah, this has been an amazing conversation. I always knew it would be um. I've been so looking forward to this and I'm so looking forward to the next five um.
Kat Gaines:I can't wait. They're gonna be so fun. I've just taken so many notes while we've been talking about oh, we need to talk about this in that episode, so we've got it's gonna be like eight.
Charlotte Ward:It's gonna be like eight by the time we're finished. Yeah, let's throw the plans out and we'll just keep talking. Amazing, that's fine. Well, I've, I've a sneaky peek of the next five planned episodes, uh. But if we need to add another more, another few more on to uh, to iron out the kinks of uh, of me throwing a few random curve balls at you, then I'm very happy to do this, because this is super amazing.
Charlotte Ward:I don't need to ask if you'll come back, because I know you are Sure will. Yeah, thank you so much, kat. It's been lovely and I will talk to you very soon. Thank you very much.
Kat Gaines:Yeah, thank you too.
Charlotte Ward:That's it for today. Go to customersupportleaderscom forward. Slash 264 for the show notes and I'll see you next time.