Code with Jason

277 - Gregory Kapfhammer

Jason Swett

In this episode I talk with Gregory Kapfhammer about flaky tests. We cover their five main causes, why fixing individual flaky tests isn't enough, and how test suite health connects to broader engineering practices, team culture, and the overall quality mindset of an organization.

Links:

SPEAKER_02:

Hey, it's Jason, host of the Code with Jason podcast. You're a developer. You like to listen to podcasts. You're listening to one right now. Maybe you like to read blogs and subscribe to email newsletters and stuff like that. Keep in touch. Email newsletters are a really nice way to keep on top of what's going on in the programming world. Except they're actually not. I don't know about you, but the last thing that I want to do after a long day of staring at the screen is sit there and stare at the screen some more. That's why I started a different kind of newsletter. It's a snail mail programming newsletter. That's right, I send an actual envelope in the mail containing a paper newsletter that you can hold in your hands. You can read it on your living room couch, at your kitchen table, in your bed, or in someone else's bed. And when they say, What are you doing in my bed? You can say, I'm reading Jason's newsletter. What does it look like? You might wonder what you might find in this snail mail programming newsletter. You can read about all kinds of programming topics like object-oriented programming, testing, DevOps, AI. Most of it's pretty technology agnostic. You can also read about other non-programming topics like philosophy, evolutionary theory, business, marketing, economics, psychology, music, cooking, history, geology, language, culture, robotics, and farming. The name of the newsletter is Nonsense Monthly. Here's what some of my readers are saying about it. Helmut Kobler from Los Angeles says thanks much for sending the newsletter. I got it about a week ago and read it on my sofa. It was a totally different experience than reading it on my computer or iPad. It felt more relaxed, more meaningful, something special and out of the ordinary. I'm sure that's what you were going for, so just wanted to let you know that you succeeded. Looking forward to more. Thank you for this. Can't wait for the next one. Dear listener, if you would like to get letters in the mail from yours truly every month, you can go sign up at nonsense monthly dot com. That's nonsensemonthly dot com. I'll say it one more time nonsense monthly dot com. And now without further ado, here is today's episode. Hey today, I'm here with Gregory Caphammer. Gregory, welcome.

SPEAKER_00:

Hi, thanks for inviting me on the podcast.

SPEAKER_02:

Thanks for being here. Um so tell us a little bit about yourself.

SPEAKER_00:

So my name is Greg Caphammer, and I'm a faculty member in the Department of Computer and Information Science at Allegheny College. That means in part I'm a teacher, I'm also a researcher, and in a few cases, I'm actually building software prototypes and sharing them with the world on GitHub.

SPEAKER_02:

Um, you know, it's interesting. I recently learned what computer science is. Um and it's kind of funny because I majored in computer science in college, although I didn't finish. I went to college on and off for five years and then just left. Um but what we were taught, you know, we were taught algorithms and stuff like that. Um, but I never really grasped the concept of computer science. And not only did I not grasp the concept of computer science, I didn't know that I didn't grasp it, and maybe I still don't. Um, but I read this really interesting book recently, um, called The Beginning of Infinity by David Deutsch. Have you heard of him?

SPEAKER_00:

I've heard of the individual, but I haven't read the book. It makes me think of another book that was called Gödel Escherbach, which you might be familiar with as well.

SPEAKER_02:

I've heard of that one, but I haven't read it. Um but yeah, for somebody who's interested in computer science, um, The Beginning of Infinity, I would highly recommend. And I imagine the same with Gödel Escherbach. What is that book about?

SPEAKER_00:

Well, in general, it's talking about some of the things that are like the foundations of computing. And and I do think if even if I go beyond the two books that we're mentioning, there are some like deep magical truths in the computer science. Like, what is it that we can compute and why are certain things computable? Why are other things not computable? And for me, there's a sense of awe and wonder that often comes out of studying some of the foundations of computer science.

SPEAKER_02:

Yeah, exactly. I I've gotten those same feelings, but only recently um from reading these few things that I've been reading recently. So in the beginning of Infinity, um, it talks about the concept of universality and how humans are universal explainers. You know, we can we can observe phenomena and come up with explanations for the phenomena. Um and we're universal computers also. Like anything that can be computed, a human can compute. Um and it talks about um it talks about um machines that aren't universal computers versus machines that are, um, like an abacus, for example. An abacus is a computer, but it's not a universal computer. Um and Charles Babbage's Difference engine that he made in the 1800s uh could have become a universal computer, but um I think it was only a a design for a machine, but he never actually successfully constructed it at that time. Um I understand the Difference engine was only successfully built in the 1990s, which is insane. Um anyway, I I kind of learned about what what computer science is a little bit from that, and then I started reading this biography of Alan Turing. And I was really concerned because you know, sometimes you read these biographies. Like I started one on Nikola Tesla, um, and the book opened with like it was a rainy night, and Nicola it's you know, it was just like it was written in a very um it it was written in a way that didn't get into the substance at all. But this book is the opposite. Um, it starts, it doesn't even start with anything to do with Alan Turing, it's like mathematical background and stuff, and I'm like, yes, this is this is all substance. And so that's giving me a feel for like what computation is, what computer science is, and all that stuff. I didn't mean to go on for as long as I did about that, um, but I just wanted to mention that I've recently gained an appreciation for what this stuff is all about.

SPEAKER_00:

Yeah, I really agree with the point that you're getting at. And and not only are there perhaps deep and magical insights at the foundations of computing, but I think there's also an awe and wonder that goes along with I have an idea for something that I want to build, and then I can actually create it and ship it and see people using it. And so I do think in the practical applications of computer science and in software engineering and data science, there's also a lot of awe and wonder there as well.

SPEAKER_02:

Yeah, yeah, I totally agree. Um okay, so we are gathered here today to talk about flaky tests, among other things. Um how come you're interested in this topic? I'm curious.

SPEAKER_00:

Well, that's a good question. So, first of all, as a software engineer, I have written test suites that have flaky test cases. So oftentimes as a researcher, the place where I choose to investigate is a place where I myself have had a problem. Another thing that I often do is look at large open source projects on GitHub, and I tried to discern if they may have a similar problem that I had when I was building a system. So I got interested in the area of flaky test cases, largely in part because of the fact that I myself, as a programmer, have written flaky test cases and then found myself incredibly frustrated because I didn't know if the bug was in my test suite, or if my bug was in the environment, or if the bug was in the program that I was actually trying to test.

SPEAKER_01:

Mm-hmm.

SPEAKER_02:

Yeah, and it's interesting that you use the word bug because I haven't heard the word bug associated with flaky tests very much, but I personally have, because I consider a flaky test a form of a bug, and fixing a flaky test uh is an exercise in bug diagnosis and fixing, and the same um techniques and principles that apply to fixing other kinds of bugs apply to fixing flaky tests. And I think that's a significant point because um almost nobody knows how to fix flaky tests. Um and and and when they do, it's often very like uh people if you if you ask somebody how to fix a flaky test, I don't think they can explain it a lot of the time. They just kind of go by the seat of their pants a lot of times. Um so I think that framing of flaky tests as bugs is significant.

SPEAKER_00:

Yeah, that's a good point. So the way that I think about it is that a flaky test is a test case that may sometimes pass and sometimes fail, even though I'm not changing the source code of the program under test or even changing the source code of the test suite. So if my test case fails and there's other areas that aren't being changed, in my view, that is a kind of bug. Now, interestingly, I think there's also a sometimes a silver lining to flaky test cases, because those flaky test cases can, for example, reveal assumptions in my program, or maybe an assumption in my test suite, or maybe some un not well understood interaction with the execution environment. So I think in all of those test case situations, they are like bugs. We do have to debug them, and yet they are also often silver linings because they help me to find ways to improve my system.

SPEAKER_02:

That's a really good point. A flaky test is a symptom, and sometimes the cause of that symptom is just some annoying thing that we have to fix, but sometimes it shows you a genuine problem, and by fixing that problem that manifests itself as a flaky test, um, we not only fix the flaky test, but improve the system also.

SPEAKER_00:

Yeah, that's absolutely the case. And I know when I first started writing test cases, I would not really think about my test in a full featured fashion. So I would just think about a test suite where I always ran it in the same order. And then after a period of time, I noticed that I was starting to introduce flakiness into my test suite because of the fact I had what I would call order-dependent test cases. And then that helped me to simply realize, you know what? Maybe when I'm running my test suite, I should always run it in a random order. Because if I do that right at the start of a new project, maybe I'm less likely to introduce order-dependent flakiness into the test suite.

SPEAKER_02:

Um and why would that make it so you're less likely to introduce order-dependent flaky tests?

SPEAKER_00:

So at least in my experience, if I'm always running my test suite in a specific order, I could get into a situation where even though I don't know it, test cases are assuming that other test cases like maybe won't clear out some of the state from the database or might leave files on the file system. And so they tend to pass, perhaps we might say coincidentally. But if I really start to run them in a random order, then in that situation I might shake out some of the assumptions in my test suite because test cases start failing early. And then I can make test cases that are, for example, less likely to pollute the state and therefore less likely to be order dependent, and then maybe we end up getting fewer flaky tests in the system overall.

SPEAKER_02:

Right. So when you always run your tests in the same order, um problems might arise, but those problems might be masked. Um, because in order for the the tests to be revealed as flaky tests, um, they would have to be run in a different order. Um and so you could have a test that's a problem, um, but again, because the problem is masked, you won't know about it until very late. And by that late time, the test might be harder to diagnose and fix than it would have earlier on.

SPEAKER_00:

Yeah, that's exactly what I have experienced. And even worse, if you have a really large regression test suite and you add a new feature to the your system or a performance enhancement to your system, and then all of a sudden a whole bunch of test cases start failing in a flaky fashion. Now you're not sure is it because of the performance optimization? Is it because of the new feature, or is it because of some artifact inside of my test suite that has nothing to do with the optimization or the feature? So then at that point, I have to dive into my test suite again, and I lose focus on the optimization or the feature that I'm building.

SPEAKER_02:

Um, I want to talk about the causes of flaky tests. Um I I have my list of causes, and I think I've found uh the the complete list of correct answers. Um and so far I haven't encountered anything that shows me that I'm wrong. But before I share my list, I'm really curious to hear your take on what causes flaky tests, uh, and we can compare it to mine.

SPEAKER_00:

Okay, so I hope we're gonna get to the complete list, but let's see if at least we can get close. So a moment ago, I mentioned the idea of an order-dependent flaky test, and then I would say the other category would be like a non-order flaky dependent test. So that's one where the order in which the test suite is run doesn't make a difference. And then I would say broadly within those two categories, there are lots of examples of things that could lead to flakiness. So, like for example, it could be related to the network, it could be related to environmental differences, it could be related to things that are about the time or the date. It also could be related to issues for floating point values and like the precisions and assertions. So those are some of the examples of different categories of flaky tests that I've encountered. Now, if I'm allowed to ask, what are some of the things on your list?

SPEAKER_02:

Yeah, yeah. Um so so my list includes all the things on your list. Um I I've never thought about the floating point thing specifically, um, but everything else. Um, but I think it is valuable to group those causes into categories. Um and in my figuring there are exactly five categories. Um and before I even say what those are, um all flaky tests are caused by some form of non-determinism. Um and that non-determinism can either be in the test or it can be in the program itself. Um and so let's see if I can repeat the five causes from memory. Um we're talking about order dependency. Um that is it's kind of a description, order depends order dependency is a description of the nature of the problem, but the root cause is something different. Um the root cause is um I th I think you used the word pollution earlier. It's it's some kind of pollution or mutation of global state. Um so that's one cause of flaky tests is global state mutation. Um another big cause is race conditions. And that brings us back to the fact that the problem could either lie in the test or in the application code. A race condition in the application code can cause a flaky test, even uh if there's nothing particularly wrong with the test. Um you mentioned the network. Um things when when there are external dependencies, uh like dependencies on the network or uh third parties, which always happens over the network, um, that can cause flakiness because sometimes these external services are available and working, and sometimes they're not. So that's three. Um randomness is another one. Um I think if I understand your like floating point precision um case, uh maybe the reason that uh can cause flaky tests is because that introduces a form of randomness. I I don't know whether that's the case or not, though.

SPEAKER_00:

Yeah, I know what you're getting at in terms of randomness, because you know, broadly speaking, a test case could use a random number generator, and that could lead to flakiness. Or alternatively, the program could use a random number generator, and then that could also lead to flakiness. So I I think you're right, randomness is certainly one of the categories that I would put into the list as well.

SPEAKER_02:

Yeah, and even more subtle randomness, um, like when you do a database query, a lot of the times the records will come back in the same order as previous queries, but it's not a guarantee. And so you might have one case out of a large number that just so happens to be different. Um, and so that's a case of randomness that the programmer might not even be aware is randomness. Um, okay, so that's four of the five. And then the fifth one is fragile time dependencies. Um so I I've had a case where like where there's a report, for example, that gets one day's worth of items. And the report the the test for this report works fine if you run it at 8 a.m. But if you run it at 8 p.m., it fails because I don't know if if in the test you use like relative times instead of um fixed times, you say like, okay, get everything for the next eight hours starting at 8 a.m. That works fine because it's all inside the same day. But if you say it's 8 p.m., give me everything for the next eight hours, that goes to 4 a.m. the next day. And so now your eight-hour chunk spans two days and your test won't pass anymore. So fragile time dependencies is the fifth cause. And everything that I've ever encountered, I've been able to put into one of those five categories. I have not yet encountered anything that um that tells me there's more than just those five.

SPEAKER_00:

Yeah, out of the categories that you had and what I'm thinking that I have experienced myself, I can see how everything lines up. So for example, you mentioned issues that I would have called like async weight and a concurrency. You mentioned that one. When I talked about floating point, I was thinking about situations where like different computers have different representations of floating point values, and so then when you do the computation, you might get a different answer slightly, depending upon the like CPU that you use in order to do the floating point computation. But maybe broadly that one's under the category of what we might call the environment, because sometimes when you run a test case, it might run differently depending on the operating system that you pick, or the way in which characters are encoded on that operating system, or the way in which numbers are encoded on that operating system or platform. So I I like the list that you said, and I don't know if I kept every one of the five perfectly in memory, but what I would perhaps add to that is what we might call environment dependencies, which I think I remember you alluding to in the context of the network.

SPEAKER_02:

Okay, well, I'm distracted because I think you just touched on a sixth category. Um, because if one computer uh performs some operation differently from another computer, and the test passes on one computer and not on a different computer, um that's not a case that I have been considering before. When I was talking about the environment, I was talking about pollution of the global environment state. That's not a case of pollution of the of the global state. That's just a difference that, you know, a static difference in the environment. It's not something that the tests changed, it's just the way it is. Um so I think that's that's a new category.

SPEAKER_00:

Okay, cool. And I can actually give a concrete example of this that I've experienced frequently myself. So I build and maintain some systems in Python, and when you run the Python program, it's often very sensitive, depending on what you're doing and how you're doing it, to the default character encoding for that operating system. So if you run a program through like a subprocess call and you're communicating back and forth in text, and you've written it so that it works correctly for UTF-8, but the operating system doesn't use UTF-8 by default, and you don't force it to use UTF-8. I've actually been able to find flaky test cases where my test cases stop passing on certain environments because of the fact that environment, most likely the operating system, doesn't encode characters in the way that I was expecting.

SPEAKER_02:

Does that cause flaky tests as opposed to tests that just never pass on that OS?

SPEAKER_00:

That's a really good question. So this is where I think we should dive in a little bit deeper. So in one way, you could kind of call it flaky, because like for example, I could run it in CI on GitHub Actions, and depending upon the kind of Windows OS that GitHub Actions gives me, it might sometimes fail for the character encoding and sometimes may not, if I don't tightly control what I want the character encoding to be. So I can actually see the argument for why perhaps this isn't a kind of flakiness, because on that specific OS configuration, it's always going to fail. But if I don't always get that OS configuration, then I don't necessarily have an easy way to control for it. And so I do still sometimes perceive of it as a kind of flakiness.

SPEAKER_02:

Yeah, that's really interesting. Um so I'm I'm gonna partly take back what I said about that sixth category. The the jury's out in my mind as far as whether that's truly a sixth category or not, um, because you could choose to change your environment, and then that new environment just doesn't work at all for certain tests. Um, or you could have a situation, and this scenario has never entered my mind before now, a situation where you sometimes get one kind of environment, you sometimes get a different one, and the test pass with one environment, don't pass with another environment. Um is that a case of flakiness? In a way it is, in a way it isn't, it's a really interesting scenario.

SPEAKER_00:

Yeah, and I would almost think of it more like a source of flakiness in the program under test itself, because if my program hasn't carefully controlled how it does floating point arithmetic and accounts for all of the variations in the IEEE floating point standard to get the right answer across all architectures and operating systems, or the same kind of thing for character encoding, my program might work on some OS architecture combinations and might not work on other OS combinations, but like once I find that combination where it doesn't work, it's not a flaky thing, it's a reproducible thing, but it's perhaps for a component of the environment broadly construed that I haven't necessarily thought of or tested carefully.

SPEAKER_02:

Yeah, that's really interesting. Yeah, because if you're writing a program that, for example, um targets Linux but doesn't target Windows, um, and you run it on Windows and it fails, um, that's not a bug because you're never intending it to run on that operating system. Um but if you are targeting those operating systems and the test passes on Linux and fails on Windows, then that is a bug. It depends on the requirements that you've decided your program will have.

SPEAKER_00:

Yeah, that's precisely the case. I mean, you said it depends, and I think that's the right way to think about it. It depends on what your spec says and what OS and architecture combinations you need to accept. With that said, in my experience, there are going to be lots of people who try to, in the case of my Python program, pip install it on an OS or architecture combination that I don't have a setup for in GitHub actions. And even in the past, say six months or so, I've had people report bugs for some of my systems on GitHub because I just never thought to do a test for like a certain version of the operating system and a shared object or a dynamic link library or a version of Python. And all of those things combined are like one big state space for which I haven't fully explored all the possible combinations for running my test suite.

SPEAKER_02:

Yeah, that's really interesting. Um, I think I mentioned in a recent podcast um that I realized at one point that bugs tend not to be a case where the program is behaving differently from what I intended. Um they tend to be cases where there's some scenario that I never even thought of, and it's it's not behaving as someone would expect under that scenario. It sounds like that's exactly what those bug reports are. It's it's not that your program is not be behaving as you intended it, it's just this configuration that you didn't account for yet.

SPEAKER_00:

Yeah. And again, I think your point from a moment ago is the jury is still out, and that's a good way to think about it. At least for me, this is a kind of, if I could use a phrase from before, bug in my test suite that I have to figure out. And sometimes it's a bug in my program, and ultimately it leads to me like making a change in my test suite, making a change in my CI CD pipeline, and often also making a change in my program. And so at least for me, when I shake out those bugs, it's a pretty holistic change that goes from program under test to test suite to CI CD setup.

SPEAKER_01:

Mm-hmm.

SPEAKER_02:

Um, I want to touch on another point. Uh, and and this is an aspect of flaky test problems that in my experience is not very broadly understood, which is that uh every flaky test problem is really two problems. So there are individual flaky tests, and then there's like the broader, more general flaky test problem that an organization or a project has. Um and I use an analogy for this. Um there's a boat, uh there's water in the boat, and there are holes in the boat. Um I'm I'm having to refer to my own book here because uh it's a little bit mind-bending. Um but there's there's two layers. Okay. So there are multiple layers of symptoms and causes. Um in my experience is that like uh some test suites are more flaky than others, not just in the sense that like there happen to be more flaky tests, but like there are flaky tests and you fix those flaky tests, but more flaky tests arise. Um like that's what I mean when I say flaky test suite. Like, even if you plug the individual holes, more holes appear and more water comes into the boat. Um so a flaky test is a symptom. And I think a lot of people think that like okay, we have flaky tests, we fix the flaky tests, and the problem is solved. But it's like okay, you've solved a s you've you've solved a certain amount of problems at at a certain level, but at another level, there still are unsolved problems. So fixing individual flaky tests is analogous to bailing water out of the boat. Um and there's nothing I'm I'm not saying that in a negative way. There's nothing wrong with bailing water out of the boat. Uh bailing water out of the boat is necessary because all tests are flaky. There's always going to be some water in the boat, and so there's always gonna be some bailing out necessary. Um but the question is, why is there water coming into the boat? Uh water's coming into the boat because there are holes in the boat. And so then there's the this other layer of problems. Okay, why are there holes in the boat? And that's the question that needs to be addressed in order for um in order for the rate of new flaky tests being introduced um to decrease. And I shouldn't even say that, because that's not precisely correct. Um I shouldn't say new flaky tests, I should say new instances of flakiness because a lot of people are under the mistaken impression that the way flaky tests arise is by developers writing tests that are flaky. And that can happen, but quite a lot of times people write tests that are just fine at the time that they're written, but then some months or some period of time later, flakiness arises in those tests because conditions have changed. And so that that other question needs to be asked of why do both why do holes keep popping up in this boat? Um anyway, I'd I'll I'll pause there.

SPEAKER_00:

Yeah, I like the way you developed this analogy. And in particular, a moment ago, I heard you use the word symptom. And I think symptom is the right word to use because test flakiness may be a symptom of what perhaps we could call the underlying poor health of the test suite. So, you know, there's a lot of people. Or the system under test. Yeah, so there could be some underlying more fundamental issues to suggest my test suite isn't healthy, and flakiness is one of the ways in which poor test suite health manifests itself.

SPEAKER_02:

Exactly. Um and I want to emphasize again that it's the health of the test suite and the application itself. Um and and to illustrate the importance of this, I want to use an extreme example. Um imagine that you have a an absurdly simple web application that's just a single static HTML page, and your test suite consists of one single test that visits that page and asserts that certain text is presence present on the page. Um the test suite is almost completely pointless, but that's beside the point. The point is that the the app is as simple as can be, and so is the test suite. Now imagine a different web application. Um the back end is Rails, the front end is let's say React. Um the Rails app uses every gem known to mankind. Um, so does the React app. There's just a bazillion dependencies, um, there's all kinds of asynchronicity, is and and the UI is really, really complicated. Um and it does a huge amount of stuff. There's 10,000 tests. That test suite is going to be super flaky, like almost guaranteed. Um and so the I think the lesson there is that your architectural choices for your application have a huge bearing on how flaky your test suite is likely to be.

SPEAKER_00:

Yeah, that's a good point. And a moment ago you mentioned the dependencies, and that's something that I've encountered as well. The more dependencies I take on in my system, in some ways I'm seeding control for my program and my test suite because I don't know how those dependencies are going to evolve over time and how they may change the assumptions that they had when I first took them as dependencies. And of course, you and I like dependencies sometimes because it saves us from reinventing the wheel, but the more dependencies you take in your application, the more likely different types of flakiness are going to creep into the program or its test suite.

SPEAKER_02:

Yeah. Um it's kind of like to me, it's like, okay, have you ever been shot?

SPEAKER_00:

I have not.

SPEAKER_02:

Me neither. Um I've never been in in combat in a war. I'm assuming you have not either. I have not. Okay. Um, so you know, pretty low chance of getting shot if you don't get yourself into combat ever. Um, but if you go into combat uh and then a few years later you go into a different war and put yourself in combat and do that over and over again, um the chances you're gonna get shot increase the more you do that. Um and so that's the way I look at it with libraries. It's like um the more libraries you take on, the more it increases the chances um that these things that give rise to flaky tests are going to come into the picture. Um so it goes back to those five causes of flaky tests. I'll just mention a couple, um race conditions and global state mutation. Um so if you have more libraries, you have more configuration. And the more configuration you have, the more opportunities you have for global state mutation. Um and so you know it's it's not just global state mutation, it's all those different five causes potentially. Um and again, the more the more libraries you bring into the picture, the more opportunities there are for flakiness.

SPEAKER_00:

Yeah, and since you mentioned global state, another thing that came to my mind is the more you have global state and manage that state, although it may be beneficial from the perspective of your application, that's then something that your test cases have to start managing as well. And in my experience, there's one thing that's often work to avoiding pollution of that environmental state, which is like to very carefully hermetically seal each test case from every other test case. And I don't know what your experience has been, but when I do that, I tend to also make my test suite execution time slower because I'm clearing state out of the database or clearing out shared files. And when I do that, test after test after test, that leads to test case execution times going up, which then drumroll drumroll goes to the case where I want to run my test suite less frequently, which would then lead to more bugs creeping into my system.

SPEAKER_02:

Yeah, vicious cycle.

SPEAKER_00:

It's exactly that.

SPEAKER_02:

Yeah. Um, and that comes back to, you know, you mentioned test suite health and stuff like that. Um I I think that leads into like um test suite architectural choices. And by the way, everything is connected to everything else. Um, it's very hard to solve just one problem. Um, and I find that when I've consulted with um engineering teams in the past, uh it's really the case that they're doing one thing excellently and other things poorly. It's like they're kind of doing everything well or everything poorly. Um anyway, that's an interesting point about like the cost of isolating the tests, um, the setup and teardown and stuff like that. Um I think another option, uh depending on the scenario, another option is to have tests that uh that deal with less stuff. Um so for example, instead of an end-to-end test that involves the database and requires the database to be cleared before or after each test, um, you can write tests that just address one particular object and don't involve the database at all. Um, and that test can be less costly to run.

SPEAKER_00:

Yeah, that's a good point. And now I think you're bringing up another trade-off that I have faced frequently, which is I want my test cases to be realistic, but I also need my test cases to be focused enough that I can quickly run them and in a way that doesn't introduce a lot of flakiness. So for me, another trade-off has been as I introduce more mocks or spies or or whatever you might want to call them into my test suite, it can help them to be more focused and maybe less flaky, but they also sometimes may be a little bit less realistic. And at least in my situations, I've had a few cases where I think I fooled myself into thinking I was really testing my system well and the test cases were passing, but I introduced more than one layer of mock objects, which then made my test cases a little bit too unrealistic. So I think you're bringing up a really interesting trade-off between maybe what we would call like focus versus realism in terms of the test suite.

SPEAKER_02:

Yeah, I think the realistic unrealistic distinction um is a good one and makes a lot of sense. Um I think we can even frame it a different way too, which is maybe like the um the depth of the test. Um a lot of this stuff we think about in layers. Um so with a web application, there's the OS, the database, the back-end language and framework, the front-end code, HTML, CSS, all that stuff. Um and uh end-to-end tests, of course, exercise the entire stack by definition. Um and then what you might call unit tests, although that's an ambiguous term that not everybody agrees on the definition of. Um, but I'll say unit tests might exercise just one layer of the stack, like the back-end language. Um, and so there's a cost-benefit judgment that has to be made. Um, like the unit tests don't have very much depth, but they are relatively cheap to write and cheap to run. And then the end-to-end tests um tend to uh cost more to write and more to run, and their costs include a greater susceptibility to flakiness, but they have much greater depth than than unit test. And people often want to know what's the right uh balance, but I don't think that's a great way to frame it. Um because with a lot of things, there's not a single right answer. That doesn't mean there's no that doesn't mean anything's equally as good. That's not true. There's better and worse ways to do it, but there's not a single right answer, and there's not um a single rule that can be stated where it's like follow this rule and you'll get the right answer. Um it's it's just that you have to make a cost-benefit analysis on a case-by-case basis um to develop a sensible policy um for choosing which kinds of tests to write. So it's not it's not a correct policy, it's just a sensible policy.

SPEAKER_00:

Yeah, I see what you're saying, and I agree with it. So often when I'm writing my own test suites, I am thinking about things like what's the test coverage score or what's the mutation adequacy score, both of things which I could define in a moment if you'd like. But yet I also don't want to over-optimize on, like, for example, achieving a high coverage test suite, because I could get into a situation where as I over-optimize on code coverage, I write test cases that are brittle. You know, they're change detector tests that are focusing very heavily on the way I happen to implement the system right now. And then when I change those test cases, I'm sorry, when I change the program under test in a way that's completely valid, my high coverage test suite could get to a situation where it starts to fail when in fact it legitimately shouldn't have failed. So I think you're right. All engineering is about trade-offs, and although I care about high coverage and I want to get a high test coverage score or a high mutation adequacy score, I also don't want to over-optimize on that because I could end up with a brittle test suite that has, for example, change detector tests inside of it.

SPEAKER_02:

Yes. Yeah, and that raises another interesting point about um tight or loose coupling between tests and the code that they're testing. Um it's uh it's a skill to write tests that are coupled with the appropriate tightness to the code that they're testing. There's a common belief, which has some truth to it, um, that unit tests are gonna be more tightly coupled to the code they're testing than what I'll call system tests, um, end-to-end test, system tests. Um and that often is true because like with a system test, you're operating at a very high level. You might visit a page, uh, fill out some uh fields, click some buttons and stuff like that. You you don't necessarily have to refer to any code at all. You shouldn't be referring to any code. Um whereas in a unit test, you are referring to code specifically. You're initializing a certain object and calling methods and stuff like that. But you do have some discretion as to how tightly you couple those unit tests to the code that you're testing. Because, for example, you could have an object with five methods on it, and you could have test cases that explicitly refer to all five of those methods. Or you could choose to design your object a little bit differently and have just one or two public methods and the rest of the methods are private, and you exercise the behavior of all the methods, including the private ones, you just do it through the public interface. And so your tests care about um they care that your object works, but they don't care how it works. So as long as uh the specified behavior is achieved, it doesn't matter what the code is. Uh you don't know or care what the specific lines of code are that achieve that behavior. Just as long as you get the outputs you expect based on the inputs you give it, then you're good, and that gives you the freedom to refactor uh your code without having to worry about updating the tests also.

SPEAKER_00:

Yeah, I agree with what you're saying, and you've highlighted the issue of whether we should or should not directly test things like private methods and objects, and that's something I myself have struggled with as well, because I do want to somehow test that code, but I don't want to make my test suite not focused enough, and yet I also don't want to make it brittle so that, like for example, when I refactor my system and make a new thing private that used to be public, then I've got to go in and make heavy changes to my test suite as well. So again, I think we're highlighting another balancing act that as testers we have to be careful about. Um I mentioned a moment ago the idea of code coverage, and again, it would be very naive for me to simply focus on covering the code, even code and private methods, without actually putting assertions inside of my test case that check to see whether or not those methods that I'm testing actually did what I expected them to do. So again, there's a balancing act there because I could get really high coverage, but if I don't have good targeted assertions in my test cases, I might not actually be later able to find bugs in that code.

SPEAKER_01:

Hmm.

SPEAKER_02:

You know, I have a little bit of a uh controversial take on code coverage. Um I don't use code coverage tools or pay attention to coverage numbers. Um because I go off of habits instead. Um I've gotten to a point in my habits where not writing a test for something is almost unthinkable. And so my habits lead to an outcome of um of um having high test coverage. And you know, test coverage, of course, is just a proxy. Um and you can have very high test coverage without having tests that are particularly valuable. Um I guess you can't really have it the other way around. You can't have low test coverage but still have a great test suite that that doesn't work. Um so I don't pay attention to code coverage, and it makes me wince a little bit when people are highly focused on like hitting a certain code coverage number, especially when it's um tied to like um what's the word? Punishment's not the word, but like hand slapping and stuff like that. It's like you have to have such and such coverage level in order to merge this PR or whatever. Um because that is applying the um applying the constraint too far downstream. Um, because if you have developers who are trying to merge PRs that don't have very good test coverage, then your problem lies further upstream. Like your developers don't have good habits or they don't have um the beliefs that you want them to have around testing and stuff like that. And what's probably gonna happen is you'll they wrote the code already, they're trying to merge it now, and now you're forcing them to like reluctantly go back and do the annoying chore of adding a test. Um, and so now they're doing it backwards, they're adding the test to the existing code, which is almost always more annoying than writing the test first and then writing the code to make it pass. And so you have a deeper problem, and so they're like trying to um solve this deep problem with this superficial uh, for lack of a better way to put it, uh lazy fix where they don't allow you to merge a PR without a certain um code coverage level. None of that stuff was meant to be, you know, a criticism of what you said. Those are just the things that the thoughts I have about code coverage in general.

SPEAKER_00:

What you're saying is really compelling, and I track with it quite well. So, in quick response, one of the things that I would say is I have found code coverage helpful, like if I'm tracking maybe the coverage of branches, and then I say to myself, oh, you know, I never did check that transition. So when you mentioned habits a moment ago, I also think about what are for me my professional practices when it comes to building the system as an engineer and delivering it in such a way that I have a high confidence in its overall correctness. And so when I have that habit or that professional practice, I think what you're saying is it should drive us in the direction of delivering top-notch high-quality software. And I have still found, though, when I aim for that, there could be cases where I overlook, oh, I've never tested this branch, or oh, I forgot, what if that loop executes this number of times? And so at least for me, there have been some circumstances in which code coverage or other kinds of proxies have pointed me in the direction of that aha moment where like I realized there's a way in which inadvertently I might have deviated from my professional practice. The other thing that I'll very quickly say is connected to something about what you do if you haven't shifted it left and you've got this PR and you just want to increase according to some artificial metric. And actually, in my experience, this goes back to test flakiness. Because what happens in those situations? So I've worked with various development teams, and one of the things that can happen is somebody just says, fine, I'm gonna automatically generate some test cases using whatever approach you think is cool. You know, whether you're using Copilot or Tab Nine or you're just pasting stuff into ChatGPT, and you end up with a whole bunch of tests which you really didn't think about, and you only wrote them in order to get high coverage. And in some of my experiences, guess what happens? You end up with flaky test cases that pass that one time around, you got your PR merged, and then later on down the road, using a term we said before, the test suite health has decreased further, and now you thought you were doing something that helped your team and its velocity, but what you've actually done is hindered your team and its velocity.

SPEAKER_02:

Yeah, um it's it's depressing. Um, you know, whenever I've worked with teams, um a lot of times the the presenting symptom is that, hey, we have flaky tests, or we want our developers to write more tests or better tests or whatever it is. Um, but then we dig deeper into the root causes, and there are usually multiple layers of symptoms and causes. Um but what I often find is that the tests are written in in kind of a perfunctory manner with a mindset of we're writing tests because we have to check that box. We're writing tests because we have to be able to, I have to be able to say that I wrote a test for this feature. I don't really want to. Um and maybe they they they understand that the tests are valuable because they've been told so, and they kind of understand that like, yeah, automated tests, fewer bugs, all that, that makes sense. In in a like vague sort of way, they they get it. Um but they haven't really bought into it um at the level that's necessary in order to be successful with the practice. Um it's like you know, they're um they're reading they're reading the holy book and repeating the scriptures, but they don't they they haven't converted to the religion yet, you know? Um anyway, that's that's what I often see. And then once we dig into those testing skills, um it turns out that there are deeper, more fundamental programming skills um and habits that aren't present yet. Um so for example, deciding what you're gonna do before you do it. Um like there's there's a certain thing you want to achieve, you have a vague idea of what it is, and people will just go from vague idea to 30 seconds later start typing. Um and that can't be a successful programming methodology uh under any circumstance, whether you're writing tests or not writing tests, you're gonna tie yourself in knots very quickly if you just start implementing things without having an idea of what you're gonna do. The industry went from this um extreme of big design up front where you try to plan a year's worth of coding in advance to no design up front, no planning at all. Um, but that's not that's not what agile programming means. Um, that's not what evolutionary design means. You can have a plan, and it's beneficial to have a plan. The significant thing is that your plans should respond um to feedback and contact with the real world, and you should be totally open to changing your plans and throwing out plans, but that doesn't mean that um you shouldn't make plans in the first place. Anyway, like I said before, everything's connected to everything else. And when I notice these symptoms with test suites, people want to fix these symptoms, but unfortunately, there's this huge, huge picture that needs to change significantly before those symptoms can be solved.

SPEAKER_00:

I think you're right. Often it comes down to, as an engineer, what is our mindset and what is our commitments? You know, like what are we deeply committed to when it comes to shipping high-quality software? And I can even say myself, there have been times where I just wanted a quick fix. You know what I mean? I had to be able to figure out that bug, and I had to ship a fix to that bug, and I had a very tight time schedule. And there have been cases where I have done that. Now, regrettably, sometimes I wasn't in the right mindset that I did that when I did that, and so therefore I might not have had not only enough empathy for the other programmers on my team, but perhaps I also didn't have enough empathy for future me who was going to have to maintain that test suite or better document that bug or whatever it may happen to be. So I think you're right. A lot of it ultimately comes down to the systems that we put in place as engineers and our mindset when we're following those sit s uh systems and our commitment overall to both good engineering practice and to shipping high-quality software.

SPEAKER_02:

And that um triggers a whole other rant for me, and I will uh hold myself back. But the just you know, I can't go without saying what I was thinking now. Um so many organizations aren't even trying to do high-quality work, which which is frustrating. Um I'm I'm reading this book about how they did software development at Apple right now, and it's in it's really interesting the stark contrast between a place that is trying to create high software, high-quality software, and the the standard way of doing things, which is just get something done and and make it so that it technically meets the requirements, but nobody really minds that much if it's not very good. Um, which is really frustrating, but that's a whole different story.

SPEAKER_00:

Yeah, so we've been talking about the word quality, and I feel as though I'm now obligated to mention the book called Zen and the Orde of Motorcycle Maintenance, because it's a book all about that word quality. And what it means. And you mentioned Apple a moment ago, which makes me think of the fact that often there was that quality mindset, not only to the system and how it was built, hardware and software, and how it looked, but even how it looked on the inside of the machine.

SPEAKER_02:

Yeah.

SPEAKER_00:

And I think that's important for us as software engineers as well. And that comes out in the book Xenon The Art of Motorcycle Maintenance. We should care about what our system looks like on the inside, even if the people who are using our software don't see our designs and they don't see our setup in CICD. And frankly, they don't necessarily care about our test suite and how we wrote it. But all of those things ultimately creep out into people's experience of our software, and they can discern whether it's high quality or not. And sometimes those symptoms are internal, but they tend to sneak out and other people then experience them.

SPEAKER_02:

Oh yeah. It's like again, everything's connected to everything else. How you do anything is how you do everything. I'm thinking of that expression, shine boots save lives, which is like I I I think part of why that expression caught on is because it's so um so what's the word? You know, obviously shine boots don't literally save lives. Um and so it's kind of a humorous saying. Um but the point obviously is that applying discipline, a high level of discipline to all areas um has a material benefit. And so like having a high level of discipline in your um the way your code is designed and even things like your physical environment that you work in, having a a clean, tidy, quiet space to work in and stuff like that. Everything's connected to everything else and influences everything else, and and it all has a very meaningful material impact.

SPEAKER_00:

Yeah, I agree. The other thing that I would say to build on your comment from a moment ago is that I have to care not only about what I'm building and how I test it, but I also have to care about my interactions with my team members. Because the way in which we interact with each other and how we respect each other and appropriately treat each other, the dynamic in our team and the culture in our organization, that's also a kind of everything is connected. And so therefore, as engineers, we have to think about how do we talk to our team members? How do we document our code so that when other people look at it, they're able to access it and to test it or to understand it. So I think you're right. Everything is connected, and it has to do ultimately not only with the code and the test suite, but also our relationship with the people who are building the system.

SPEAKER_02:

Totally agreed. And there's probably about 10 more topics in there that we could do a whole podcast episode on, but we're running short on time. Um, it's been really good talking with you, Greg. Is there anywhere that you'd like to send people to learn more about you and what you're up to?

SPEAKER_00:

Yeah, if possible, we can include my website in the show notes, but my website is GregoryCaphammer.com. I've also built a number of open source software testing tools, primarily for the Python programming language, and I link to them on my website. My students and I have also released a number of online learning platforms. So if you're interested in studying other topics or listeners are as well, I'll link you to those for my website also.

SPEAKER_02:

Awesome. Well, Greg, thanks so much for coming on the show.

SPEAKER_00:

Thank you. It was a delight to talk with you today.