Kevin Church - Data, Information, Knowledge and Profound Knowledge Artwork

Conversations on Applied AI

Welcome to the Conversations on Applied AI Podcast where Justin Grammens and the team at Emerging Technologies North talk with experts in the fields of Artificial Intelligence and Deep Learning. In each episode, we cut through the hype and dive into how these technologies are being applied to real-world problems today. We hope that you find this episode educational and applicable to your industry and connect with us to learn more about our organization at AppliedAI.MN. Enjoy!

All Episodes

Conversations on Applied AI

Kevin Church - Data, Information, Knowledge and Profound Knowledge

August 31, 2020 • Justin Grammens • Season 1 • Episode 8

In this episode, we go deep with the always fascinating and expert in the field of data, Statistics, and Six Sigma Master Black Belt, Kevin Church.

A chemical engineer by training and applied statistician by vocation, Kevin has more than 20 years of experience in the practical application of statistical thinking and the scientific approach to improve business performance. Kevin helps organizations by leading teams on Predictive Analytics, Executive Coaching, and everything in between. He’s spent the past 3 years at United Healthgroup / Optum and has been involved in co-founding and building multiple meetup groups that focus on educating attendees on data science through competitions that help non-profits.

We cover a ton of ground this episode. Everything from where Kevin got his start, how he has been able to continually apply first principals from W. Edward Deming, Six Sigma to dozens of challenges in his career that involve data and touching on the applications of Artificial Intelligence to create profound knowledge and delighting the customer.

If you are interested in learning about how AI is being applied across multiple industries, be sure to join us at a future Applied AI Monthly meetup and help support us so we can make future Emerging Technologies North non-profit events!

Here's just a few of the many fun and interesting topics discussed during this podcast:

Enjoy!
Your host,
Justin Grammens

Kevin Church : 0:00

So when I think about big data, and I know I listened to your first podcast, and you talked about, you know, everybody wants to go find the newest, latest, greatest tool. And sometimes we get in fascinated with the shiny toy. And that I think can be a problem. You know, one of the powers of big data is we not only have more rows of data, and I don't know that we need more rows of data to be smarter. Oftentimes, like when I hear people analyzing datasets with, you know, millions and millions of rows. I'm not convinced you couldn't randomly sample that and come up with the same answer with a much smaller data set. But what I find powerful is the columns, the amount of information you get and how the tools now can help you sift through hundreds or even thousands of possible features to find the ones that are actually have predictive power.

AI Announcer : 0:53

Welcome to the conversations on applied AI podcast where Justin Grammens and the team at emerging technologies North Korea with experts in the fields of artificial intelligence and deep learning, in each episode, we cut through the hype and dive into how these technologies are being applied to real world problems today. We hope that you find this episode educational and applicable to your industry and connect with us to learn more about our organization at applied ai.mn. Enjoy.

Justin Grammens : 1:23

Welcome everyone, to the conversations on applied AI podcast. Today on the program we have Kevin church. Kevin is a data scientist and Six Sigma master black belt, a chemical engineer by training and applied statistician by vocation. He has more than 20 years of experience in the practical application of statistical thinking, and the scientific approach to improve business performance. Kevin helps organizations by leading teams on predictive analytics, executive coaching and everything in between. He spent the past three years at UnitedHealth Group and has been involved in CO founding and building multiple meetup groups that focus on educating attendees on data science through competitions and helping nonprofits. Welcome Kevin

Kevin Church : 2:00

Thank you for that introduction. Yes, well, maybe as a starting point, you can maybe give listeners maybe a short background on yourself and the trajectory of your career that sort of got you here today. Well, as you said, I'm a chemical engineer and I started off in the oil industry grew up in Minnesota and moved to California. I after graduation, and after about 10 years in the oil business, I met Dr. W. Edwards Deming, who some of the listeners may remember was the American who went to Japan in 1950, at the request to the Japanese government to help them improve business processes, and improve he did and so when I was a kid growing up, Mason, Chapin had a connotation of being kind of cheap, but junk. Things like toy Japanese toys wouldn't last very long. And now today, you have like the most reliable automobiles in the world are produced, for instance, by Toyota Motor Company, and that's a direct result of the influence that Deming had so by the time I met him in 1987, he was 87 years old. Still going strong and teaching these four day seminars that would have thousands and thousands of people in them. He said at the start of that seminar that I was sent to, and I frankly, I'd never heard of him. Yeah, I fly off from California to Chicago, and I'm in this room with 2000 other people. And he says, If you pay attention to what I'm about to teach you, it can change your life. And I may have been born in Minnesota, but I'm kind of from Missouri, when it comes to stuff like that. And so I'm like, yeah, that's not gonna happen. But I'll be diamond by the end of the week, if I wasn't getting in line to get my book signed. And it really did change my career trajectory, in that I learned the value and the importance of delighting customers. And you do that by improving your business processes. And that that customer is really the focus of all that you do and the act of you know, increasing shareholder value, turning a profit, all those things are consequences of taking really good care of your customers. So that was the best valuable lessons, along with I had a chance to work with one of demings masters, a gentleman named Dr. Harold allor. And Harold is, in my opinion, the finest applied statistician in the world. And he had come out of Goodrich chemical company and he was running a consultancy when we met him. And he, I learned from him this transformation, this process you use to convert data into information into knowledge and eventually profound knowledge and that progression, which a lot of times I see even people today with all the power we have at our fingertips in our computers, we stop it, taking the data, you know, extracting the data and putting it in an Excel spreadsheet. And that's what we talked about tables. We don't even bother to graph it. And this isn't graphing in the sense of a graph database. This is just create a picture Yeah, sure, the data into a picture and a picture is indeed worth 1000 words and then you can learn something. So that's the knowledge part. But what Deming talked about was you needed to go one step further and that was profound knowledge. He really thought that was what the role of the leadership in an organization should be the understand the console systems to such a profound extent that you can actually predict performance right, which is a lot of what artificial intelligence is driving for is predictive analytics and, and that Deming was talking about that because if you can understand the causal system to a profound extent, then you can understand those parts you have control over those parts that you have to be able to react to, that you do not have control over. And the parts you can control then you can tweak to try to improve your performance. And again, in demings world, it was all about delighting the customer. Excellent, excellent. So you were doing this in a chemical engineering sense at this time, correct. was started in an oil refinery, but the concept that Deming taught were pretty universal. Sure, so it really can be applied anywhere. So after a couple of years of being a client of Dr. Heller, then I translate And being one of his employees. And so I was a consultant with his small firm for 10 years. And then I've also freelance since then for about another 15 years. And so I've worked in healthcare, manufacturing, oil, chemicals, nuclear power, water treatment, medical devices, and even a one time I helped radio station analyze some of their customer response data. The tools are pretty universal. And every going enterprise has customers. So if you've got customers and you've got data, then there's hope for you because we can't we have the tools and the methods to be able to improve things. Yeah, I guess maybe walk me through this process, a little bit of data, information, knowledge and profound knowledge and maybe how does that how does Six Sigma Scrum apply to that because I'm, I know a little bit about six sigma, but maybe you can enlighten our listeners sort of how that works and how that plays into this. Well, and I'll use an example from current events if I might, and that is with the trash. of the killing of George Floyd, in Minneapolis on Memorial Day. And anytime there's a tragedy like this, particularly the ones that revolve around gun violence, there'll be lots of people who want to jump up, raise their hand and propose a solution. And we call that jumping to solution. And what happens is, if you jump immediately to solution, and you don't, you know, do the work of getting to root cause, which is part of what, when you're converting data into information, at least in the world of Six Sigma, often what you're trying to do is understand the console system, so that you can figure out what you can tweak or change or remedy to improve the outcomes. But if you just focus on the solution, it's like getting shot by a gun and putting a bandaid on it. That's not going to help you may seem like a good idea, but it's not going to help you. The quick fix the jumping to solutions is expedient. Politicians, I think want to be seen as being proactive and kind of taking charge of the situation. But inevitably, you consider threats and stuff. War on Drugs. We've been fighting that war for as long as I can remember. I mean, remember Elvis Presley went to President Nixon and wanted to be the drugs are Yeah, we're still fighting that war, because we haven't really addressed the root causes. So that's kind of how that that transformation and the data plays into that is if you want to understand the causal system, you've got to get into the data. That's the best way to do that, for sure. And as you improve the process, ag is there is there like a looping cycle that happens, I guess, with regards to we don't have enough data right now. Now we need to go out and get more and once you get more in it feed in where the whole direction goes? Well, sometimes there's kind of two situations. And this is where we'll talk about the lessons from the data science Venn diagram, which anyone who listens to this podcast can just Google Data Science Venn diagram and images and you'll get all kinds of examples. My favorite one shows these three circles and at the intersection, there's a unicorn and a little, a little side note is if you are in the job market for job in data science, you will often see the job descriptions are written as if they're in search of a unicorn. Right. And frankly, I don't know what the solution to that problem is. But the reason they put unicorn is at the center of that is because they're rare. And in my work, I think I've only met one or two unicorns in my life. And that's people who have outstanding, you know, hacking skills, coding skills, they're also the subject matter experts. And then in the diagram, you'll see on line, they'll put statistics or math and statistics in the third circle. And I really think that that that's too narrowly focused that should be process improvement, which could include the math and the statistics. But all you know, like data science applications, in my opinion, are ultimately efforts to improve something, improve some process, improve an outcome, improve a result, and therefore we ought to be kind of systematic about how we go about doing that in process improvement methodology, particularly the world of 16 There's the demand for this five step methodology where you define a problem. In the measure step, you create a baseline, to prove in data that in fact, the problem exists. And then you do work. The third step and analyze is to get to the root cause. And then the fourth step, you finally get to solutions. And so rather than jumping straight to solution, there's all this due diligence you have to do up front to make sure that you arrive at the elegant solution. And even that you'll test on a small scale, and the elegant solution is solve the problem. Don't create any new ones. So when I first heard this concept of defunding the police and maybe I'm getting too political, for the for the for an AI podcast, but I thought, well, that'll never work. Because they're just human nature. It's like saying, We don't need regulation of business. You know, look how that worked for us in 2008 2000. No, you're going to need a policing function. But now if you look, there's more than Who wants to attend? I'm learning now, and that they want to kind of redirect some of the funding. And I heard the City of Minneapolis has actually got some research going on where they're analyzing 911 calls to look at what the nature of those calls are, because they have kind of they triage the calls into three departments. Is that fire? See, it was fire police and maybe ambulance. Yeah, are the three. Yeah. And they're saying that a lot of the ones that it's like fire and ambulance and everything else goes to police, but some of the police ones like the domestics, maybe we don't need to be sending people out, you know, who are as highly trained and armed as a police officer. Sure, maybe somebody that can just call the situation by coming in and at least appearing to write the police rather, instead, maybe a social worker or something, right. So so that makes sense that you look at the data you try to understand and you always have to ask yourself the question, what problem are we trying to solve here? Yeah, and the elegant solution, again, is one that doesn't have side effects. If you simply define At the Minneapolis police department, there would be all kinds of adverse side effects from it, right. And that's typical of jumping the solution, as you give the appearance of doing something, you may make the particular problem you're addressing go away, but then likelihood you will create new problems, that may actually be worse. And so that's why we want to get to root cause and then test solutions on a small scale to prove that they are elegant, solve the problem, don't create any new ones. And then in the last step in the dometic process, you can ramp that up to full implementation, as makes sense, so that when you've tested it on a small scale, if it doesn't work, then you're going to learn a lesson and you're keeping the tuition costs low by initially testing out a small scale. Excellent, very cool. Well, you had a little formula was E equals q times A is that right? Yeah, that's what I learned a three B three years ago next week that I started working with the United Health Group, initially as a six sigma trainer, and my boss Bev harbeck had the Change formula that equals q times a and this is an algebra. This is more conceptual, but the E is the effectiveness of some proposed solution. And that solution might be some AI application that you've built out for a customer. And then equals Q, the Q is the quality. So it's basically how good technically is that solution like on a zero to 10 scale. And then the A is the acceptance. So it's q times A and A is how well accepted is that tool by the user community. And one of the first times I taught this one of the students in our class came back to me the next day and said, Well, I built this really cool application that reduces attorney member but it was over 50% of the effort or hours required to do this particular task. He automated cut the time by more than half, he said, but I never went back to see how many of my team was actually using it. And he found out that only 40% of the team was using The tool. So if your solutions at 10 Yeah, your qualities of 10, but your acceptance is only a four, then you've stopped short of what you could have achieved. And, and the acceptance side is really the people side of the equation. And, and you know, we can spend an entire day talking about how you do change management. But for sure the people that are going to be the recipients or the users of whatever tool pool tool you build, they need to be involved initially, so that you understand the profit trying to solve and you engage them throughout the process so that they have their inputs in there, and they have some ownership and that'll likely improve your odds of the tool actually getting used. Yeah, yeah, absolutely. I mean, I think about products and projects that I've built for companies, and they haven't really interviewed the customer, right? You talk about voice of the customer, I am assuming in this in this whole process. And if it's painful to use your customer, even though you write how cool it is. If it's painful to use, they're not going to use it and so You really need to make sure you get acceptance from whether it be the end customer, or even the internal team. It sounds like Right, right. Yeah, for sure, for sure. And so, so just with regards to building cool things, have you seen teams just building models for the sake of building models? Again, not fully understanding what they're going after? And what are some, I guess some pitfalls you've seen, in some ways, you've tried to maybe address that? Well, I'm I should know better, frankly. But I'm as guilty as the next person that if you put up if you put a good data set in front of me, I'll want to dive into it. And my favorite example of this was, and adjusting. You might have actually been involved with this. But there's a housing nonprofit in Minneapolis called Ayaan that was having difficulty with what they called their water appliances, which are basically things like toilets and faucets and leaking in the buildings and their tenants who all were low income aren't charged a water bill like in For instance, if you own your own home, you've got a water meter and in your house and that you that's how they decide what to bill you. And the onus is on you to make sure you maintain your water appliances so that you're not wasting water because you'll get charged for it. But in these buildings and was was eating it. And so they asked us if we could build a model that would allow them to monitor the billing information and identify buildings, rental buildings that had likely problems with appliances, right. And so we did not we did a great bit of work in a Saturday meetup. And we had prepared the data in advance. And we had two teams, I led one team that was taken more of a dinosaur approach with just a linear regression model. And we had our model at least a rough model done by lunchtime, the other side of the room. We're using Python and coding. And we got done one model for one building before they did, but a couple hours into the afternoon, they'd built models for like all 50 buildings, right? Once they wrote the code, they could just run the data. And so we we then build the tool. That when they uploaded their their billing information for all these buildings every month, they could identify the buildings that were suspect. Yeah. So it was a pretty cool application of data science. However, I was doing some work with the City of Minneapolis not too much later, and learn that the next generation of water meters for buildings and residences can be read via Wi Fi. So the city is right as we speak, they're in the process of changing out all at night. I think it's 89,000 water meters, and that's residential water meters in the city, or at least it's bigger than the city. It's their service territory. And they're going to be able to read them all by Wi Fi, potentially in real time. That's not their plan. Right? Right. They just have to read them for billing, but you could read them in real time. And I thought, well, if you've got the technology where you can read the water meter in real time, then you could simply put those meters into these rental buildings. Yeah. And you establish a base line like, in the middle of the night when most people are sleeping, what is your water consumption. And you can put that and this is where kind of I go back to the six sigma. And the need for statistics is you there's a tool called a control chart, basically, it's a way to look at data to see if a process is stable, you know, random variation about some central tendency. So there's probably some small amount of water use background water use in a building of that size that you could measure, right. But if you get to where you have in, it turns out it only takes one toilet with a leaky flapper valve, and the water consumption goes up enough that you can pick it off the meter head, and a control chart would set off an alarm. And then you would not only you wouldn't have to wait for the bill to come out. Right, you can send someone out the next day. And with a technology, you could probably have a system that monitors that control chair, spits out an email and sends it to the building maintenance manager and the next day they could be investigating fascinating. Yeah. So because of the new technology, you can kind of flip the whole solution.

Justin Grammens : 19:00

Yeah, on its on its head, right teams had originally come up with well, and that's the risk of being a data science meetup. We look for places to build models. And that's one that may be a simpler solution might have been more elegant. Yeah, yeah. As I think you've said, if if you own a hammer, I guess everything starts to look like a nail. So around model development, right, yeah. Do you have some common practices that that you typically follow things, some pitfalls, things that should be avoided? Yeah, I do, actually. And again, this goes back to my, you know, the first model I ever built was probably in 1989. So I've been doing this for a long time, mostly linear models, often like around in the oil industry. And even when I worked at the New York Times, and other newspapers, we were using linear models most of the time with a continuous dependent variable, but also some logistic regressions when we had a binary dependent variable. What were the tool sets at that time? Oh, my gosh. I mentioned Dr. Heller, this applied statistician and he had

Kevin Church : 20:00

programmer right in Fortran. He had a tool for designing experiments, he had one for doing multiple variable regression. So what we call linear regression, but you're not limited to linear, obviously, you could introduce the terms to build curvature into your models. And then they had another Fortran program that did multiple property optimization. So that was all it was all proprietary custom and write and build stuff. And then when this next generation came out, he had programmers in India, who were writing things, I think in C, but after I left the New York Times in 2002, I started using Minitab. Okay, which is a tool I think that came out of the University of Pennsylvania and actually one of the founders of that was another Deming Master, a guy named Brian joiner, Dr. joiner, and so I've been using Minitab that's what the Six Sigma people United Health Group use as many tablets. So that's my tool of choice and although I'm fascinated with the programming languages, and I can now get around in SQL and do a little bit of work in SAS, many tabs so much quicker because of the user interface. But again, not as powerful. Sure if you have to build one model, I'll get done before you do. But if we're going to, like, remember, our first Meetup group we did here in town was predicting sales at fast food restaurants, right? And then we had 50 restaurants. So I ended up building 50 models where the people were coding built one. Yeah. And then just replicated it. Oh, I don't want to be disparaging about the may sound like I will in a minute here, but they're doing about how beautiful the technology is just the only comment on that for a second. So when I think about big data, and I know I listened to your first podcast, and you talked about the, you know, everybody wants to go find the newest, latest, greatest tool. And sometimes we get fascinated with the shiny toy. And that I think can be a problem. Yeah, you know, one of the powers of big data, we not only have more rows of data, and I don't know that we need more rows. have data to be smarter. Oftentimes, like when I hear people analyzing datasets with, you know, millions and millions of rows, I'm not convinced you couldn't randomly sample that and come up with the same answer with a much smaller data set. But what I find powerful is the columns, the amount of information you get it, how the tools now can help you sift through hundreds or even thousands of possible features to find the ones that are actually have predictive power. And that's where that's where the power I think of data science, to me is the biggest place that's gonna gonna move the wheel here. Yeah. And that's, that's something that a human would not be able to do. Right. It's just, you know, being able to figure out which features are actually having the most impact. Well, I mean, theoretically, I guess you could, but that is a perfect application for an artificial intelligence system to do is figure out which features are actually are actually impacting it. You know, I was I was listening to a podcast this morning. They were saying that they had they had trained the self driving car and if

Justin Grammens : 23:00

Forget which company it was, but it was some, some artificial intelligence company that was doing essentially automated driving. And what they found was when there was a certain hue to the sky, the car would turn right, for some reason. And after a lot of analysis and trying to scratch their heads trying to figure out based on what's going on, they realized when they had actually trained this car, they had trained it in in Nevada, and the days that they had trained it, the sky had that color. And as it was driving around the track, it was every time was making right turns, it basically thought it needed to make right turns based on the color of the sky. And it was one of these things where it was it was like, you know, maybe they had too much data. Maybe it was maybe it was too high fidelity, maybe they had too many features going into this model that they were doing and it kind of shot themselves in the foot. And it took them a lot of time to figure this out. You know, and I'm sure they applied some of these techniques that you're talking about to try and narrow down exactly why this is doing it because when a car is driving like this, they're probably taking thousands of

Kevin Church : 24:00

inputs at any one particular time. And I found I just found it really interesting. I mean, I guess I guess a lot of it comes down to feeding the right data into building the model. Well, when you may not know the right data, so the way I was taught these methods by Dr. Heller was once you had a model that passed the statistical requirements to be valid, so you know, you've done your internal validation, perhaps you've even done some external validation, you know, you had really, if it's a continuous dependent variable, you look at your residuals, your errors, and make sure they're normally distributed. They're constant with respect to the dependent variable. They're constant with respect to the independent variables. And most importantly, in my mind, they're constant with respect to time. And so once you satisfy those requirements, then I learned to and of course, I'm working with a with a linear regression. So I have an equation so I can use that equation to hold the five get 10 features that are statistically significant, I can hold nine of them constant that their averages and draw a graph To show you what happens when we bury that 10th one over the range that buried in the data. And then you would prepare these 10 graphs and sit back down with your subject matter experts and say, This is what the math is telling us is going on here. You know, when x one increases x, you know, the dependent variable goes through an optimum at some point. Does that make practical sense? Can you explain this in terms of maybe the science thermodynamics, the heat transfer, whatever it is, so that you don't get the color of the sky trying to direct which way the cars turn. So that would not pass a sanity check. Again, when we learned to do this, the tools were so clumsy and and low power that even if you had the data, you couldn't afford to put hundreds of features into a model, right? You kind of did for those people that are familiar with six sigma. As part of the doing the causal analysis, you do a fishbone diagram of brainstorming where you try to characterize all the different types of influence As that could affect the dependent variable. And from that now sometimes you have a problem like, again, if the toilets leaking, and you can put a some type of an automated automation on the meter head to read the flow, there's a limited number of things that can cause that. But other times we'd have situations where there wasn't an obvious cause. And that's why we're this may be multifactorial, or the real tricky. One is when the factors interact with one another, where, you know, the coefficient of one feature is dependent on the level of some other feature. example I use, like in real estate is, is you know, they say it's location, location, location. So if you build a model to predict real estate prices, you will get $1 per livable square foot coefficient, right, amongst other things, you know, if you pick up a given house, let's say in my neighborhood, and you move it to the Dinah, which is a more expensive neighborhood, what's going to happen to the price of that house right? And mathematically what's going on is that the the CO of per square footage is changing to the coefficient of square footage is a function of location. That's an interaction where the effect of one variable is modified by the level of some other variable. And, and interactions are hard to tease out by gut feel, even if you're the subject matter expert, but you'll learn when you're brainstorming, you'll learn to listen to the subject matter experts. So when they tell you that, well, you know what the cost of for isn't to build a house on dollars per square foot is a function on what part of the country you're in. And the minute they say that one thing depends on something else, then you know, that's an interaction. And you have to go fishing for that. Now, if you're using a linear methodology, you have to actually build that interacting term. If you're using something like a random forest model, any of those decision tree models, the second and third level branches of those trees are essentially interactions. They tell you that well when the give variable a is at this level, and variable b is at this level here. What happens? And so those things that come pouring out, and this is part of the place where the power comes from, in those decision tree methods is that they very easily can pick up on the interactions. Yeah, but I was, again, I always like to go back to the causal system, scientific principles at the end of the day. And this is one of the big problems I have with blackbox methods is that it's hard to know what your model is actually doing. Like the example you gave, it took a lot of work. I'll bet for them to figure out it was the hue of the sky, right that was causing that problem? Yeah, yeah. What what I was thinking about when you said the black box thing is another big sort of thing that I've been seeing in the news that's been talking about is just the bias of some of these models. Right? And, and you're feeding it so much data that then when people come and say this models biased in a certain way, if you don't know what's really going on inside of it, inside of the neural net, because it's just so complex. It's hard then to justify or fight that or better validated or invalidated if you don't understand what's really happening inside. And that's I guess that's what people are talking about a lot these days is these blackbox models are really cool. But they are biased. And how do you prove if they are or how do you prove if they aren't, if you don't even know what's going on inside? Correct? Yeah, there's a great book I read a number of years ago called weapons of mass destruction, that's ma th destruction. And if the book was folded, excellent book, it was a full of examples of how data science has kind of gone wrong, because they perhaps didn't understand what was going on under the hood of the blackbox. Or in some cases I've that I've read about the bias is actually built into the data. So the model has no choice but to be biased in its results because it was fed data that had the implicit bias built into it. Sure. This is not an area that I would that I consider myself having any expertise whatsoever, but I think protecting In light of what's going on today, it would be important to try to understand and, and route bias out of your model so that I know that we've got people that opt in that are working on that very thing. Yeah, I'll be sure to put links to that to that book in the summary and description of this podcast. And since since we talked, I actually have downloaded that on Audible. So they're actually have an audio version of it. So I haven't started to listen to it. But I'm super excited to so thanks for suggesting that. You know, the other thing that I was I was thinking about is you you had mentioned about not deleting outliers without trying to learn where they come from. You have an example of that or use cases like that. Well, I tell you the the way I was trained as a chemical engineer, when we did laboratory experiments, and we would analyze the data if we found we had an outlier data and there's there's tools in statistics that you can use to try to determine, you know, if you take the average and the standard deviation of a given data point is more than three standard deviations away from the average. You might be suspicious about that value of being an outlier. But we were taught as engineers to delete the outliers and then go back and rerun the analysis. And the first time I met Dr. Heller, and I told him about that he's like, it didn't ask you to go investigate it. Because you know, your systems, your process, your data collection, your technology produced that value. And if you think that value is bogus, there's got some other problem, you need to go find out, you know, could it be the measurement, like I had an example when I work did some work with Chevron chemical back in the 1990s, where they had a lab technician who turned his back on a wet chemistry titration, which is kind of the first rule of doing a titration for any of those of you who have taken chemistry, you know, you've had an erlenmeyer flask with some liquid in it. And with a magnetic stirrer and you're dropping some patron into that, waiting for a color change, and you can't turn your back on this thing because one drop, not that you'll see a little splash of color the next drop the whole thing change and this guy was multi tasking now he's multitasking because they cut the staff and they had extra work to do. And that's kind of always brings to mind, Dr. Deming is that you can hire a monkey to cut costs, because it takes no particular genius to pull costs out of the system, just get rid of the people, right, but to remove waste, you know, that type of thing, you know, bad defective product, rework delays, that kind of thing to him to work those out of the system. That's how you really go about reducing costs. But every time this guy walked away from this titration, and if he missed it, sometimes he got back in time. And other times he turned around and there was it turned pink in the stead of rerunning the test, he just recorded the value and sent it out to operations. And now you've got a bogus high number that the operator gets in his hand says, Oh, my concentration of the fertilizer they're producing was too high. I'm going to change the process. And now you've just turned good product into bad product. And that's the result. having bad data and that outlier, those when we ran a test to the lab where we did blind replicate samples, and we found out that every time is one person, there wasn't every time they were involved in a pair. But every time there was an outlier, this one person named gene was involved in the pair. And and it turned out he was the supervisor who felt the most pressure to get the work done because his staff had been cut. Yeah, sure, sure. So and there's more to that story. But outliers are a golden opportunity to learn which kind of gets to my next point is we need to learn how to look at the errors in our models as opportunities to learn and so I'll go back to that meetup. I mentioned where we were predicting the sales for the 50 fast food restaurants. It turned out they were Wendy's restaurants in the south in the southeast. And what we did was he predicted we had three years worth of weekly, I think it was a weekly data, and we would plot the air so the model air so over the, you know, the three years 52 weeks a year of data you had for this One store, you've got these errors and the average error by default to zero, by definition, your average error for your model is zero. But when you plot these errors over time on a control chart, and you put reference lines at plus or minus one, two and three standard errors, you find out that there's times where the model doesn't work on average. But when I'm remembering was we stack these charts up. So you might have had 2013 2014 and 2015 data on top of one another. And every October like around the second weekend in October, you would have these shifts where the model was under predicting. And so what our person our analysts did, was did a Google search, figured out what happens in this small town in Alabama, in October, and you know, what it was, I don't recall was college homecoming. Right and they had their homecoming weekend, and lots of people showed up and Wendy's had a ball right now, do that. So that's not like we're taking our friends out for our, you know, the alumni are going out for an expensive meal. They are going out to Wendy's. But you you learn. So I call that going to school on the residuals, you knew the model was having trouble. It just happened to be the same roughly the same weekend in October, all three years. And then she did the work and found out what that was. So we built a homecoming feature. And then we could actually go back to this Wendy's restaurant and say, here's what homecoming is worth it, right? So in for instance, in COVID. Now with COVID, if they canceled homecoming, we could tell you, well, here's the kind of here's the kind of hint you're going to take in your revenue based on that. So that's what I mean by going to school on the residuals is you have to study those errors and you plot them versus time. You plot them against the x's, and you see if you can learn something that basically enhances your feature engineering. Nice. Yeah, absolutely. One of the things we talked about was then improving your model over time, right. I think a lot of people just build a model and they just continue to use it forever without actually enhanced Is that something you've seen? Yes. Yeah, it's rampant actually. And, and I would get an I'm kind of a troublemaker. So I would, I knew I could get under people's get on their nerves a little bit if I would stay at one of these meetup groups or at a mini analytics conference that I'm not a fan of internal validation. Because you know, my way of thinking if you take a data set your analytic data set, and you randomly sample it, split it into, you know, 70%, train and 30% test, that it's a random sample. So really, those two data sets should have the same information in them. And now, if you're using some of these high powered blackbox techniques, absolutely, you have to do internal validation, because you can build a model that fits that 70% very well, but not the 30%. So you're kind of obligated to do that. But to me, the real test is when you build the model and you put it into production, you need to monitor those residuals. Going forward to see how the model works on data it's never seen before. And again, I think of the concept of stability, that the errors in the model should be small, but they should be consistent. And so I we typically run in my work I do now with optim. We run control charts, you know, if we have weekly data, and every week, we plot the model air on this control chart, and we run it out over time, and we look to see if the air is randomly bounced around zero. And if there isn't, if there's some bias, if at some point, there's a shift that suggests either there's a new factor that's influencing your outcome, or perhaps one of your coefficients changed. And you know, sometimes you're actually trying to change the coefficients. If you're trying to get more people. For instance, to accept an offer you have as a business, you might actually want to break your model. Like, when I worked with the Art Institute of Chicago, we build models to predict the museum attendance. And when at long after I'd quit working with them though they had this great system for monitoring the models. They had a bunch of different flavors. They had attendance for members attendance for just the paid public attendance for the like the local Chicago area attendance for tourists, you know, people come in from outside the state outside the country. And when, when they were voted the number one Museum in the world by TripAdvisor members, their foreign attendance model broke Three weeks later. sure about that. So imagine you're sitting in Europe, you're a TripAdvisor member, you see the latest rankings and say, Wow, I didn't even know that I just said to the Chicago was such a prestigious, I'm going to put you'd have to book a trip. You're not going to go tomorrow, right? You're going to go on three weeks, right? So so they actually saw that their model broke, but the only reason they knew that is the models. If the model is explaining 90% of the variance in the dependent variable. Then when you're running these control charts, the noise you're looking at isn't 100% of variation in attendance, it's only 10% of the variation in attendance. And so you monitor those residuals. And then when you get what, you know what that control chart people would call a rule violation, you know, a non random event a shift away from zero, your errors are no longer averaging zero. That's an opportunity to learn something, but in this case, you know, they have this, this effect. And then after a number of months, that effect kind of died off again. Yeah, sure. But they said, You know what, we never saw a bump in the local attendance. So they ran a marketing campaign, letting people in the Greater Chicago LAN know that the Art Institute of Chicago was number one according TripAdvisor, and then their local attendance model broke. Oh, interesting. So like, leverage that yes, they were smart enough to know how to leverage that. It's just a brilliant application. And by the way, it's one person doing all that work. That's the that's one of the unicorns I know. He builds the models. He knows how to do the coding. He sees the data To the database administrator essentially for their like their membership systems a built the interface he tied in all the weather data he had to go get and you know the number one predictor of tourist attendance is hotel bookings. And so they get information from the Bureau of tourism and and he built all of these systems to monitor that stuff. I actually had him present to team I was working with a United Health Group. And they were like, How many? How many people in your group and he said that you're looking at them?

People on this episode

Justin Grammens

Host