Testing in Production with Rouan Wilsenach Artwork

Quality Bits

High-quality products and teams: what are those? In this podcast, Lina Zubyte takes you on a journey to understand better how to create more efficient and successful tech products with excellent quality.

All Episodes

Quality Bits

Testing in Production with Rouan Wilsenach

January 23, 2023 • Lina Zubyte • Season 1 • Episode 11

Testing in production: scary abnormality or familiar regularity?

In this episode, Lina talks to Rouan about the role of a QA that is more than testing and testing in production. You'll hear about the misconceptions of testing in production, the requirements for reliable testing in production, and even how to clean up your messy alerting systems.

Find Rouan on:
- His website: https://www.rouanw.com
- Twitter: https://twitter.com/rouanw
- LinkedIn: https://www.linkedin.com/in/rouanw/

Mentions and references:

Rouan's website with ideas for more interesting standups: https://funstandups.com
Quality Bits episode with Steve Upton on the definition of quality: https://qualitybits.buzzsprout.com/2037134/11248422
Rouan's article about the QA role "Is QA dead?": https://www.thoughtworks.com/insights/blog/qa-dead
Testing in Production:
- Rouan's article on Martin Fowler's blog: https://martinfowler.com/articles/qa-in-production.html
Elizabeth Hendrickson on Agile Testing: https://www.youtube.com/watch?v=gqi_T2B8DvY
Kent Beck on Facebook Engineering Process: https://softwareengineeringdaily.com/2020/12/10/facebook-engineering-process-with-kent-beck-repeat/

Follow Quality Bits host Lina Zubyte on:
- Twitter: https://twitter.com/buggylina
- LinkedIn: https://www.linkedin.com/in/linazubyte/
- Website: https://qualitybits.tech/

Follow Quality Bits on your favorite listening platform and Twitter: https://twitter.com/qualitybitstech to stay updated with future content.

Thank you for listening!

Lina Zubyte (00:05):
Hello everyone. Welcome to Quality Bits, a podcast about building high-quality products and teams. I'm your host, Lina Zubyte. In this episode I'm talking to Rouan Wilsenach, who has written so many great pieces about the QA role or testing in production. We talk about testing in production, DevOps, the ownership and responsibilities of team members when it comes to those, as well as how to even clean up your sometimes messy alerting systems. I feel it's an extremely interesting episode and I really hope you enjoy it.

(00:57):
Welcome Rouan to Quality Bits. It's very nice to have you here.

Rouan Wilsenach (01:01):
Thank you. It's very good to be here.

Lina Zubyte (01:03):
I've been following your work for quite a while. It's funny that I did not have any kind of contact and did not and did not know you, but I knew your work before I got to know you. We started working remotely and we used a website to find some ideas how to have our standups more interesting and it was fun standups and then we wanted to add some new ideas and we were like, oh, maybe we could contribute. And it turned out it was yours. So this is how we started talking, which is such a wonderful website. And by the way, I just suggested it to my colleagues as well. So how would you describe yourself to the ones that do not know you or your work?

Rouan Wilsenach (01:48):
I am a software engineer. My background, I've played a variety of different roles, so somewhere in between leading teams and helping organizations do software differently and running some code as well. I currently work for a company called Haven. They're based in the UK and the leisure industry, doing some TypeScript and node and AWS there. I'm based in Johannesburg in South Africa. Most of my time I spend with my family. I've got a lovely wife and two little kids, so we spend all our time together. If I get another chance, I love to write, so I do a bit of writing as well.

Lina Zubyte (02:24):
Nice. Are you writing related to work or fantasy fiction?

Rouan Wilsenach (02:30):
Lots of different things. I actually studied poetry a number of years ago, but I'm working on a novel and yes, I do write the occasional thing kind of software related as well.

Lina Zubyte (02:40):
That's super exciting. And it's also quite common because I do know quite a few people working in tech writing on the side, which is maybe showing that we have our creative spark as well and maybe to express ourselves in that way too. When we talk about building quality products, what do you think is the foundation of building those?

Rouan Wilsenach (03:04):
Oh, that's a good question. I mean I guess this will be a slightly different answer for everyone cuz it depends a little bit on what you're building, what quality would mean. So what quality means for a social network will be different to what quality means for a share trading platform. It means something different on an embedded device to what it will mean to a cloud-based application. So I think every time you're working on something new, it's important to start with the question of what is important to this thing I'm busy building. I like to call them critical success factors, but really it's just about thinking, well what are the things that have to go well and what are the things that can't go wrong? And you kind of use that to guide your approach to quality.

Lina Zubyte (03:50):
That reminds me of the first episode we did with Steve Upton. We also spoke about what quality is, how much it actually relates with this answer as consultants that we would give, which is, it depends. It depends on what success is to your business and as you said, critical success factors. I think that's a very nice way to phrase it. It matters first to understand what we can do well and what we can do wrong. So the first time I actually heard your name was when I was reading more and more on the QA role. I started at ThoughtWorks and it was slightly different than what I was used to before. So before I was more in this traditional kinda QA role. So I would have my column and then I would have some tasks and I would just tested and then ThoughtWorks when I realized I'm working with such smart people, I was wondering, okay, what should I do here?

(04:47):
And I read some articles and one of them was about the QA role and you've wrote it. I really like this role that I still use up to today that when it comes show to automation, the most powerful benefit of test automation is what it frees you, the QA to do. I love this so much and it resonated with me a lot because I realized that it's not just testing and there are so many different areas that we can go into, for example, monitoring and using data and I did that as well. So how do you see the role and activities of qa?

Rouan Wilsenach (05:28):
Yeah, I mean I think it's really quite a broad term and I think in the industry we did something well, which as an industry we did a lot I think. But we like to put things into boxes, people who work with software, it's nice to try and classify things, but it's not always super helpful. So we came up with this QA role and mapped it one-to-one with one person and said, okay, you're a qa, it's your job to make sure that what we did isn't broken. And that's a very narrow minded view on what quality is about. And so that article that you're talking about, I wrote because we were working with lots of teams where they had these QAs who were had manual testing background and they would have spreadsheets of test cases that they would go through over and over again to make sure that the software did what it was supposed to do.

(06:16):
And we had engineers teaching them how to automate tests and most of them didn't have a background in engineering. And so there was a lot of anxiety around it I think because there were people who were feeling like, okay, we're automating the thing that I normally do. Do I still have a job? And so I wrote that article it's called "Is QA Dead?" And it was really just about an exploration of whether that's true or not. And I remember I attended a virtual conference and someone called Elizabeth Hendrickson spoke and she spoke about how checking is not the same thing as testing. And so what she meant is that these automated tests are all about checking that things are okay, but that doesn't all add up to a holistic picture of what quality is. My take on that was that what she calls the exploring part of being a QA is a vital part of playing this quality role.

(07:10):
And that was what I was trying to talk about in the article. It was all about now that we don't have to do all of these manual checks over and over again, you've got so much time, what are you going to be able to do? And the answer is quite a lot actually. And like you were saying, I also don't think that quality should be restricted to just testing things at the end. In the article I go through a few different roles that a QA could play and it includes a quality analyst role which is being involved at the point of requirements. Are we building the right thing at all? Do we understand what can and should and shouldn't go wrong? There's an element of quality assurance, which is making sure that the process itself adds up to good software. Are people writing tests? Are people pairing or people looking at each other's code?

(07:57):
And last year there was this kind of idea of a quality ambassador as well for being someone who's there to advocate for good quality software. And I guess the last thing to add to this, it doesn't just broaden across the sac in terms of care about the full software development life cycle, it also broadens in terms of what parts of the system you're looking at, whether it's code or whether it's how it's running in production. And it also broadens in terms of who's doing it because you don't necessarily have to have a dedicated QA person for there to be quality in a team. And at the same time a person who is deemed a quality analyst or whatever you wanna call them, a QA doesn't just do the things that you traditionally expect them to do. So in general, I like the idea of quality being a holistic picture and for every team and every project that's going to be have slightly different focuses, slightly different people and a different mix of things.

Lina Zubyte (08:52):
It is scary though, especially when we are used to these boxes, as you said, we are used to categorization. I think that's another topic will be Cynefin and how we would want to put everything to boxes. But systems are complex, humans are as well, so we cannot just simplify it and actually building good products, we often may need to adapt our ways of working and work with the context we have and change our activities. For me, one of the essential learning moments was when I work with a product that had very good monitoring and I realized I cannot test every case. It was used by millions of people. There were different devices, different ways of how the users used it. And I learned how to create dashboards and use this data actually to understand the behavior in production, which did unravel quite a few of bugs. So that sparked my interest in testing and production and later on also working with a different project also with e-commerce. I thought that really could be beneficial there. And when I was reading more, I stumbled into your article about testing and production. So how would you define it?

Rouan Wilsenach (10:13):
Testing in production in its simplest form is just about paying attention to what's happening in production. Well I guess there are two steps to it. It's about paying attention to what's happening in production and learning from it. And so you kind of have to put the right things in place to be able to learn from production. And so that's what people in the industry now broadly call instrumentation. So that's the right metrics, the right logs. And then there is what people call observability, which is the ability for you to look at what you have gathered in terms of information and draw conclusions. I remember when I first started thinking about some of these things, I was working in a team where we had QA and there was this classic tension between software engineers and QA where the QA would find issues and the engineers would say, well that's such an education, no one's ever going to care about that. And the QA being like, well, the software has to be high-quality so we need to fix it. And there was this argument that happens in a vacuum. It doesn't have any data. One person's saying it's important to one person that's saying it's not important. So how do you know? And the only way to know without going to go and ask all of your users, which is difficult when you operated at a big scale is to pay attention to what's happening in production.

Lina Zubyte (11:32):
Yeah. When I was using these dashboards in my job, I also had this, I would say humbling moments where I realized that what I thought was a critical issue actually was affecting way less users than I expected. And I would use data to quantify the impact of it, which was extremely helpful for prioritization. And I think it was helpful for me to understand the product better and to put things in perspective that whatever I encounter is not a catastrophe, it's actually just what I encounter and it doesn't mean that everyone encounters the same things as well. So maybe it's helping people as well using this data to see it for what it is, the reality of it and that the product may work in all kinds of weirdest ways and some of the things that we think are not important are actually really important. And a lot of people stumbled into it.

(12:33):
One of the other examples I can think of right now is we worked in a team where we thought that we want to deprecate IE11 when it was still a thing and it was the most common browser actually, so we couldn't just eliminate it. And we always tend to think that the way we see a product is representative. I think it's also very common with QAs. We like to say, oh, I'm the voice of the customer and all this. Now when people say this, I cringe. I'm like, not sure. You cannot be the voice of everyone. Only if you have data, if you have this production data, you could actually understand your customers more. Not always understand, but you can at least try to believe them and also make decisions based on that. So when we talk about testing and production, a lot of people are scared of it. They think that, okay then you either don't do any testing or how can you do it? It's so risky. What are some of the common misconceptions that you hear about this and how would you challenge them?

Rouan Wilsenach (13:41):
Sure. So there was quite a bit of, when I first started talking about these ideas, we were not cloud native as an industry microservices were just beginning. And so the modern observability practices that we have today were not really a thing yet. And when I was there saying, Hey, test more in production and maybe you don't have to test as much before production, that was really scary for people. But it was based on my experience that we were working at. I was working for a company called TES Global. It's an educational company and it has millions of users. It's a website for teachers and schools to share ideas and it's a way for teachers to find jobs as well. And what I found when I was working there was that beforehand when I was working as a consultant, I was working mainly in the FinTech industry.

(14:32):
So it was banks and insurance companies and places where there was this overwhelming feeling the whole time that you couldn't do anything wrong. But when I started working at tears, there was a little bit more of a relaxed attitude and it was about getting things in the hands of users quickly and seeing whether things work. And so we scaled down on a, I mean let me be clear, we still did test driven development and we still had lots of tests, but we were in a situation where we didn't bog ourselves down with a giant pile of browser tests, which I'd seen happen to many teams before. We just didn't write them. And it was fine. I mean we had bugs and productions, but we were paying attention and we noticed them. We didn't do performance testing before we went to production. I've worked at companies before where releases would be delayed for two weeks to do performance testing when actually that performance testing was done in an environment that looked nothing like production and added no real value.

(15:29):
And so we kind of freed ourselves of those shackles and we paid attention when we released a feature to see whether it was performing well or not. And we had the right continuous delivery pipeline in place so that if we noticed something was wrong, we could fix it really quickly. But the concept was scary to people. I think when I wrote the article, and I think some people also place themselves into a domain automatically where they feel like they can't take these kinds of approaches. So for example, at the time there was this whole Facebook culture of what was it, ship fast and break things. And Facebook could do that. I mean I listened to talk by Kent Beck about testing it at Facebook and he was saying that they don't do that much testing before production. What they have instead is they have these kind of alpha groups.

(16:17):
Everyone at Facebook uses Facebook and they've got these alpha groups of people who use Facebook and they get feedback really quickly when things go wrong and they roll things up gradually and the perception is that they're a social network. So whatever, if you can't see your friend's photo today. I work at a bank, so it's really bad if we put the money into the wrong account. And obviously that's true, but I think what's interesting is that always, it depends, but there's a spectrum, there's a balance between testing before and in production. And I think that if you are saying quality is so important to me that I don't want to do this quality and production thing, then you're making a mistake because either quality is really important to you and you need to do quality in production and quality before production or it's less important to you and maybe you can have a bit more of a balance.

(17:09):
But that was the kind of argument I was having at the time. It's become much less of an argument because we've moved into a distributed cloud native world where systems are really complex and one of the advantages of testing in production is that you can test your real production system and it's now, I would say near impossible to have an environment before production that actually works the same, exactly the same way. You're not going to have the same scale, you're not going to have the same network issues you're not going to be able to or you can but it's a huge effort to replicate the same number of connected parties and services. I think there has been a general push in this direction. More and more people in these complex systems are realizing that we have to test in production. And there are some companies like Honeycomb and Datadog who are making good money off of that.

Lina Zubyte (18:00):
It's actually quite even arrogant to think that you don't need to test in production because that would say that, okay, I know all the scenarios that may happen, which we're not omnipresent, we do not know everything that may happen. In your article, you said this sentence that "Testing is great for finding defects, you expect to happen, but many production defects are surprises." I really like this quote. It reminded me of many examples that I've seen from data as well or even that sometimes splitting by a browser or devices. When I worked in commerce, we would even stumble into device category of cars. So someone will be browsing shopping while they were in their Tesla. So it was always full of surprises and I never could do this right as just QA. When we talk about this, it's maybe a bit difficult question, but should you test more or less if you test in production?

Rouan Wilsenach (19:09):
Oh, I think it depends. The answer will depend. Yeah, totally. I think that the root of it is that if you are paying attention to what's happening in production, you will know your system better. Knowing your system better will lead you to have the right test cases. And so you'll understand which parts of your system require what kinds of testing. It extends even beyond testing, it even extends to the features that we build. And we did a great trick in one of my teams when we had a conversation and we said, we think that this might go wrong, let's write some code, defensive code around that scenario. So if this thing fails, then we take this remediating action and we started to say no. And what we did instead is we set up an alert, we called them unexpected alerts. And so what basically would happen is in the code you would say if whatever a scariest scenario you're imagining is going to happen happens, send off an alert and you don't write any other code to deal with that scenario. And it's amazing because what ends up happening is that the majority of those alerts never fire. And so you've saved yourself a lot of time writing code that you don't really need. And I think it's the same with tests. You understand what's happening in production well enough to know and you're starting to learn how your users work and what they expect and so what kinds of tests you do need and don't need more of a need less of.

Lina Zubyte (20:42):
Yeah, I really like the alert part and it also shows that it's not just one or zero that is somewhere in between. And sometimes we can act smart rather than just do everything. We don't need to do everything to have a good product. One I also thought that I saw in your article was that "Tests need to earn their keep." I bookmarked this because I feel like often we even may overtest things because we think it's better to have that, but some tests may just be noise, they may not be necessary at all and we're just wasting our time actually. You mentioned a little bit about what we need. What are the requirements for good testing in production? Could you a little bit expand on that? So what do we actually need if we want to do testing and production? For example, you mentioned logs, metrics, how can we make it fairly tidy as well when we start doing more testing and production?

Rouan Wilsenach (21:46):
There are a number of things that add up to kind of modern instrumentation and observability frameworks. But I think the first thing that you're going to need is you're going to need some sort of logging. But I'm going to pause that and let's come back to that. The second thing that you need is you need some sort of monitoring. And the monitoring that you need I think needs to operate at two levels. The one kind of moderate monitoring is this kind of classic systems monitoring. This container is restarting over and over again. This system is using too much CPU, we're running out of memory over here where we're seeing a huge uptick in timeouts across the estate, those kinds of things. And then we should also be thinking about business metrics. We've processed 20 widget transactions, have we sent 20 emails? Did it add up to 20 payments through our payment platform?

(22:39):
Making sure that those things add up as well. So that's the metrics themselves are not super helpful because we also need some sort of alerting cuz otherwise you're not going to know that something goes wrong. You can't rely on checking dashboards all the time. Dashboards can be really useful and in fact humans are very good at picking up pattern anomalies. So you might notice very quickly if you're used to looking at a dashboard, something that's slightly weird that's not normally like that, but you still need some sort of alerting to let you know that something has gone wrong. And then that goes hand in hand with the logging because once something has gone wrong, the next question becomes, what happened? Why did this go wrong? And the logs are always going to be a good place to start. You wanna try and figure out what exactly happened.

(23:26):
So try and log extra information when things go wrong. Be careful about what you log because of things like personal information that you don't want to log. I guess two more things to say and I'm probably forgetting things, but we we'll do our best. One thing that you definitely need to have is the ability to standardize. Standardize is a tricky word, but I've seen this in a number of places. We have this problem where I'm working at the moment where you have lots of different teams, lots of different microservices and everyone does their logging and they're alerting in a slightly different way. And what that means is that it's very difficult to see the bigger picture. It's difficult to see, it can become difficult to even have a simple query. How many errors are we having across the system at the moment or how many timeouts are we seeing across the system at the moment?

(24:17):
So it's really good to try and have some sort of standard tooling around how do we log, send off metrics how do we do auditing, whatever the thing is so that things can be clear. People do tracing, like APM kind of stuff where you see how long things take all of these different parts of instrumentation kind of add up to what I think a modern observability stack looks like. If that sounds frightening, I don't think it needs to be, I think you can start very small. The simplest thing that you can do is put something in the logs and have something look at your logs and shout at you if they see a bad thing in the logs. And that's step one. And there is also so much help at the moment with there are some really fantastic tools that can help you monitor what's happening in production.

(25:10):
There really has been an explosion of these because of the kind of cloud native distributed world that we live in now. And I guess the other thing that I mentioned in the article, which I wonder whether this is still the case, but at the time I said that in order to do QA in production, something that you need is the ability to ship a fix fast. So I don't think that's true really. And if I think about it now, I think everyone should be doing quality practices and production. I think once you start shifting your quality practice more into production, once you start leaning more on production to know when things have gone wrong, you know, do the thing where you're, okay, I'm going to trim down the number of tests I have so that I can go faster, that's the points at which you need to be able to ship a fix fast because you're going to be in trouble if you can't. And I mean to be fair, it's not to say that things aren't going to go wrong. I think everyone needs to have the ability to ship a fix fast irrespective of what else they're doing. Really it's, I guess it's a modern hygiene factor for software.

Lina Zubyte (26:08):
It doesn't mean that you're not testing at all before you're shipping, you're still doing the same things, it's just that production adds something extra. So I get everything is complimentary and just supporting our ways of work, as you said as well. Learning more about production can help us test better before production as well to understand the product much better.

Rouan Wilsenach (26:31):
Yes, I think there's something about, so one of the original extreme programming values was courage. Well I think this is interesting because I don't know about in real life but in software you can create courage for yourself, which I think is quite cool. I don't feel very brave when I have a nice set of unit tests and integration tests and things that tell me whether what I'm doing is breaking everything. I feel a lot braver, I feel confident to make big sweeping changes across a code base, which is often what things need the number of times as an engineer where a new requirement comes in and there's kind of a fork in the road, on the one hand you just kind of squeeze it into how things look at the moment and that's how you end up with code bases that are very difficult to look after.

(27:18):
On the other hand, you're brave, you reshape the entire system so that it makes sense with this new requirement so that the whole thing is simpler. People say that you should leave a system in its simplest possible state and you can have the courage to do that if you've got some automated tests and that's amazing. I think that having good observability practices can give you courage in the same way that good tests can. Because I can make big sweeping changes across multiple microservices and feel okay about it. I can feel empowered to do that. And so we end up with a system that's of higher quality because we're able to reshape it to be the best possible system it can be all the time and we can be brave enough to do that because we have these production quality practices.

Lina Zubyte (28:03):
Nice. I also see testing as creating confidence nets for us. So I think it adds nicely with courage. I really like that you mentioned the importance of simplicity as well and that we can start small and then add more of alerts or dashboards or logs. What about the systems that overdo it? The systems that have too much of noise, too many logs, alerts that keep popping up all the time and then the team just ignores them. Do you have any tips how to clean up?

Rouan Wilsenach (28:41):
Yeah, I've had to do this before, it's not fun. People talk about signal to noise ratio and essentially what happens is that you just have noise, you have alerts popping up all the time and you don't know what any of them are and so you just kind of ignore them all. And you also, you get to this kind of broken window syndrome where you just feel like things are really broken cuz they're all these alerts and everything's always red and you just kind of shrug it off and don't care about fixing them. I guess the first thing that I'll say is that it often seems worse than it is. So very often there will be even just one or two things that are too noisy and they can make the whole thing seem too noisy and unapproachable. And I've had it before where you spend less than a day and you cut the noise down by like 80% because you've just found the two things that were so noisy and you shut them up.

(29:38):
And I think what's in terms of approach to how to fix these things, the first thing to recognize is that on the day that you begin doing this, your system is as broken as it is. So by switching off alerts or by adding new ones or whatever you're going to do, you're not going to make the system any more broken. People have fear of dealing with the noise because they feel like, oh everything's important, I dunno what's important. And you know, just kind of get overwhelmed. So what I like to say is start looking at the issues that you have, find the alerts that are noisy and don't seem valuable and you have to go back to those critical success factors that I was talking about. So there might be things that are going wrong and that you do need to fix or that you do need to understand, but maybe they don't map directly to your main success criteria.

(30:29):
And honestly, if your system has been working and your customers haven't all been leaving or shouting at you, you're probably okay on most of your critical success factors at the point at which you're starting. Remember the system is not as broken as it is, stop by having a look at whatever it is that's making the noise, the alerts or the logs or whatever. And you have to do a little bit of archeology, you have to do a little bit of identifying what is this, what is that? And then you start quietening things down. So that might mean that you stop logging certain kinds of errors or more likely maybe you filter the log level of things so you change from error to warning and you only look at errors to start off with. You might notice things that are broken and you can create some issues and you can put them on your team's backlog and say we we'll deal with that, but for now we just want to quieten things down.

(31:19):
You also wanna keep an eye out for the things that are really bad cuz you know, might stumble on that one alert that's been hidden amongst everything else and you make sure that you prioritize that. But it's really just a process of sifting through things one at a time and doing a bit of prioritization. I will also say that if you're like me and you like things to be neat and tidy, if the approach of trying to silence things like reduce the noise isn't making things as neat and tidy as you want to remember that you don't have to fix it all in one day. It can be very overwhelming. Give us some time, you'll improve it as you go. It's also very helpful. For example, I worked on a service once that we built from scratch and we put observability in from the beginning and it was fantastic because we never had any errors in the logs that we didn't pay attention to.

(32:06):
If there was an error in the log, it was because something was wrong. We looked at that error, we got alerted about the error and we did something about it, it was glorious. Then I moved to a different team and there were just errors everywhere all the time. And so you have to think about what your baseline is. And so what we did in that case, I can say, well great, I don't know why, but we have 200 errors every day. And so what we did is we set up an alert to say, okay, well if the errors are more than 200 every day, let us know because then things have gone worse than they are right now. And that's a starting point and you just kind of have to start wherever you are and work your way towards zero and work your way to towards a fuller understanding of what's happening.

Lina Zubyte (32:46):
I had the same situation. I have been working in a team that had very good observability practices and then I moved to another team and they did not, and they were drowning in noise. And as you said, it's hard not to get overwhelmed and it's not fun to clean up. So not fun but I do like it also neat and tidy the systems. So every cleanup, even step by step, day by day makes me somehow feel better about the system. And even though I did have inklings to just stop all the alerts because it's just noise, nobody checks it anyway. So it's quite a journey to somehow crystallize what has to be an actual alert to redefine maybe the log levels, the error, the warning, what is just information, what is the warning, what is the error? What actually is something we should look at right now and should be alert and what is just expected?

Rouan Wilsenach (33:52):
I was just going to say something that might work, I've not actually done this before, but if you're in a situation like that, something that might work is that instead of saying in all of this giant pile of noise, what are the noisy things? Can we take away the noisy things? Maybe what you can do instead is have a look at the giant pile of noise and say, what are the important things in here? And maybe there are one or two alerts in there that we really should be paying attention to. And a simple start might be if no one's paying attention to the room full of alerts anyways, maybe what we can do is create a new room and just send those two important alerts there and say, okay, everyone start paying attention to this room. We can ignore the other one. And then we slowly work to moving, I'm talking about a slack room or something here, sorry, we can start moving the alerts into that channel as we realize or recognize that they're important.

Lina Zubyte (34:35):
Oh, that's a great idea. I think it's actually uncovers this one big important question on what is actually important for the product. What is this critical success criteria for the product? Cause once we don't know it, everything seems important and that's not a way for us to focus and understand our product better. One question that pops into my head is when it comes to these practices, sometimes we have these DevOps teams and we just assign all the things to them. What is your idea about the ownership of any kinds of ops related? I would say even activities like looking at dashboards, creating an alerts checking the locks and things like that.

Rouan Wilsenach (35:25):
So this is an interesting one and it's a tricky one because the term DevOps has really steamrolled a lot of things. What you have to think about is you have to think about what is DevOps culture and where did it come from? And where it came from was because we had people building software and people running software and those people didn't work together, they never talked to each other. There was a lot of animosity. And so the folks who built the software didn't run it and the folks who ran the software knew nothing about the software that they were running and didn't care about it. And that was problematic. And so that's when this DevOps kind of movement started. It was all about trying to put that together and it came along with lots of great tools to do that. Gone with the days of manually configuring servers, we did it with code instead.

(36:15):
And in that process we got lots of developers and operations folks to talk together and work together. But it also has people misunderstand things and we go in weird directions. So we do have these things called DevOps teams and lots of places. And basically what a DevOps team is is it's an Opa team who codes, and so we call them a DevOps team and this all work separately from the engineering teams themselves. And they don't talk to each other and they don't know what they're running and the team doesn't know how it runs. And that's exactly the same situation. We've just changed the skillsets of the people who are Ops essentially. So that's still not healthy. I mean I think a healthier term is a term that's catching on a little bit is this term of a platform team because there is a lot of complexity in managing cloud native distributed applications.

(37:05):
And so you do need people who are focused on that. But I think the idea should be that they essentially offer a products to the rest of the engineering organization and hopefully they're informed by a good understanding of what it is that that organization is building and not just the ones and zeros. And then the engineering teams, the software development teams need to be using that platform, but they need to still own the full picture of what it is that they're building. If you are owning a software product only up until before it gets into production, you're not really owning it at all. You know, click deploys production, you forget about it, you have no idea what is that you're working on really because it's only from production that you learn. In some ways, I think that product folk have put engineers to shame over this over time.

(37:51):
I mean there's been a real movement towards data-driven product decisions. And so if I work with a new product person, they're always like, they wanna see the analytics, they wanna know what customers are doing and they know so much about the system and a lot of the engineers kind of are behind and they don't know about the system. And so there's a bit of catch up to be played there. So this whole notion that it's for engineers only to care about what's happening in production is I think outdated. I think everybody needs to care about what's happening in production, which I guess is the short answer to your question.

Lina Zubyte (38:28):
I think it's a great answer. It's a team accountability. It's about not only DevOps activities as well as QA activities. Ideally it would be team accountability. Sometimes we get overwhelmed or we don't know enough which maybe is just a sign for us to sit down, think about it, revisit the topic in the team where we had very good observability practices, we actually had this role of ops hat we would rotate it every week. And what it meant is basically this person was a person who would check the dashboards and also say, okay, we have this alert. If I cannot fix it, I will direct it to someone in the team who can. And that allowed us to a little bit reduce this cognitive loss. So not everyone thinks, okay, I have to check it right now, but we have it more balanced and we also would rotate the role so it wasn't always the same person. So then everyone can exercise the muscle of taking a look at the dashboards of doing these Ops activities within the team and understanding the metrics as well. So I really like this conversation and I feel like it's a very nice place for us to come full circle. If you had to give one piece of advice for building high quality products and teams, what would be it?

Rouan Wilsenach (39:52):
I think within the context of what we're talking about today, it would be to pay attention to what's happening in production. By doing that, you're bound to be surprised, you're going to learn some things that you didn't realize and you're going to understand your customers and your users and your product a lot better and be able to make good decisions. And I think that's the key thing.

Lina Zubyte (40:13):
Thank you. I really enjoyed talking to you today.

Rouan Wilsenach (40:16):
Thank you. It's been great.

Lina Zubyte (40:19):
That's it for today's episode. Thank you so much for listening. We're going to add quite a lot of resources mentioned in this episode, in the episode notes. And until the next time, do not forget, first of all, subscribe. Secondly, keep on caring and building those high-quality products and teams. Bye.

People on this episode

Lina Zubyte

Host