Information theory and the complexities of AI model monitoring
Mar 26, 2024Season 1Episode 16
Dr. Andrew Clark & Sid Mangalik
In this episode, we explore information theory and the not-so-obvious shortcomings of its popular metrics for model monitoring; and where non-parametric statistical methods can serve as the better option.
Information theory and its applications in AI. 3:45
The importance of information theory in computer science, citing its applications in cryptography and communication.
The basics of information theory, including the concept of entropy, which measures the uncertainty of a random variable.
Information theory as a fundamental discipline in computer science, and how it has been applied in recent years, particularly in the field of machine learning.
The speakers clarify the difference between a metric and a divergence, which is crucial to understanding how information theory is being misapplied in some cases
Information theory metrics and their limitations. 7:05
Divergences are a type of measurement that don't follow simple rules like distance, and they have some nice properties but can be troublesome in certain use cases.
KL Divergence is a popular test for monitoring changes in data distributions, but it's not symmetric and can lead to incorrect comparisons.
Sid explains that KL divergence measures the slight surprisal or entropy difference between moving from one data distribution to another, and is not the same as KS test.
Metrics for monitoring AI model changes. 10:41
The limitations of KL divergence and its alternatives, including Jenson Shannon divergence and population stability index.
They highlight the issues with KL divergence, such as asymmetry and handling of zeros, and the advantages of Jenson Shannon divergence, which can handle both issues, and population stability index, which provides a quantitative measure of changes in model distributions.
The popularity of information theory metrics in AI and ML is largely due to legacy and a lack of understanding of the underlying concepts.
Information theory metrics may not be the best choice for quantifying change in risk in the AI and ML space, but they are the ones that are commonly used due to familiarity and ease of use.
Using nonparametric statistics in modeling systems. 15:09
Information theory divergences are not useful for monitoring production model performance, according to the speakers.
Andrew Clark highlights the advantages of using nonparametric statistics in machine learning, including distribution agnosticism and the ability to test for significance without knowing the underlying distribution.
In this episode, we explore information theory and the not-so-obvious shortcomings of its popular metrics for model monitoring; and where non-parametric statistical methods can serve as the better option.
Information theory and its applications in AI. 3:45
The importance of information theory in computer science, citing its applications in cryptography and communication.
The basics of information theory, including the concept of entropy, which measures the uncertainty of a random variable.
Information theory as a fundamental discipline in computer science, and how it has been applied in recent years, particularly in the field of machine learning.
The speakers clarify the difference between a metric and a divergence, which is crucial to understanding how information theory is being misapplied in some cases
Information theory metrics and their limitations. 7:05
Divergences are a type of measurement that don't follow simple rules like distance, and they have some nice properties but can be troublesome in certain use cases.
KL Divergence is a popular test for monitoring changes in data distributions, but it's not symmetric and can lead to incorrect comparisons.
Sid explains that KL divergence measures the slight surprisal or entropy difference between moving from one data distribution to another, and is not the same as KS test.
Metrics for monitoring AI model changes. 10:41
The limitations of KL divergence and its alternatives, including Jenson Shannon divergence and population stability index.
They highlight the issues with KL divergence, such as asymmetry and handling of zeros, and the advantages of Jenson Shannon divergence, which can handle both issues, and population stability index, which provides a quantitative measure of changes in model distributions.
The popularity of information theory metrics in AI and ML is largely due to legacy and a lack of understanding of the underlying concepts.
Information theory metrics may not be the best choice for quantifying change in risk in the AI and ML space, but they are the ones that are commonly used due to familiarity and ease of use.
Using nonparametric statistics in modeling systems. 15:09
Information theory divergences are not useful for monitoring production model performance, according to the speakers.
Andrew Clark highlights the advantages of using nonparametric statistics in machine learning, including distribution agnosticism and the ability to test for significance without knowing the underlying distribution.
Do you have a question or a discussion topic for the AI Fundamentalists? Connect with them to comment on your favorite topics:
LinkedIn - Episode summaries, shares of cited articles, and more.
YouTube - Was it something that we said? Good. Share your favorite quotes.
Visit our page - see past episodes and submit your feedback! It continues to inspire future episodes.
Speaker 1:
The AI Fundamentalists, a podcast about the fundamentals of safe and resilient modeling systems behind the AI that impacts our lives and our businesses. Here are your hosts, andrew Clark and Sid Mungalik. Hello everyone, welcome to today's episode of the AI Fundamentalist, where today we will be talking about information theory and distance metrics. And before we get into our topic, two things have been really hot in the news. One Gary Marcus, our favorite came in to had some really deep thoughts like what if all of this is just a hoax or a bunk. And then the other one is the NVIDIA conference this week, so nothing we can ignore, but plenty of news to discuss. Let's get into what Gary Marcus has been thinking about the whole evolution of generative AI so far.
Speaker 2:
What's really funny with the timing of his blog post and the NVIDIA stuff, you have two opposite sides of the spectrum. Gary Marcus basically saying yeah, what if everybody realizes that? This is kind of a? I mean, it's what we've been saying the whole time on this podcast the generative AI is not near as wonderful as everybody says it is and it's not going to do as much. And basically his article is kind of outlining hey, microsoft, google, everybody's kind of backing away from these statements about how generative AI is going to transform everything.
Speaker 2:
Google, everybody's kind of backing away from these statements about how generative AI is going to transform everything and that general intelligence is right around the corner. It's great. It's just now documenting stuff that he's been saying for a long time. We've been saying and now other people are kind of like backing down on their pushing that this is the next. You know, this is the new oil or whatever. And that's what's great is. And that's contrasting all the absolute nonsense coming out of the NVIDIA conference about Gen AI going to be general AI within five years, and all this nonsense If only you buy their computer chips.
Speaker 1:
Yes, a great market message for shareholders, for sure, and you can expect that from conferences. You've got to get everybody excited. But it's still with the subject matter of AI and what we're trying to predict as a technology cohort for businesses and for people, some of these statements do tend to ride the line of dangerous if we don't predict them with some sense of accuracy and levity. So I think that's the one thing that really also sits in the juxtapositions of what Gary Marcus is talking about in his article. And what if this doesn't pan out the way everybody thought, versus hyping up shareholders on where this is going to go.
Speaker 2:
Yeah, that's. What's just wild is. I highly recommend everybody read this Gary Marcus article as it's really laying out some of these things of people are still struggling. We're like a year in now and businesses are still struggling to see where does this actually move the needle? Besides cool, it can help me do autocorrect Guess what you were already doing that before this hype anyway. Or like GitHub copilot is kind of cool. It helps you write doc strings a little faster, but then you have to review it because it's wrong half the time.
Speaker 1:
But where is this?
Speaker 2:
Like. It's a slight productivity enhancement, but it's not the zero to one moment. Well, let's get into it. Information theory.
Speaker 3:
Sid, you want to kick us off here. Sure, sure, sure. So information theory, our main topic of the day, it's a. It's a field that's based on probability and statistics and it was originally designed to study how information was transferred and used, quantified as bits, ie. One of the major questions of information theory was how many bits of information do I need to transmit this data to you? Minimally, what's the minimum amount of bits needed to move data from point A to point B? And this is incredibly useful. Back in the early days of networking, we were just trying to decide. You know what kind of infrastructure you need, how do you represent data and how do you move it around. So information is pretty key to the field of computer science.
Speaker 2:
Yeah, information theory is hugely beneficial. It's huge in cryptography. It was used in you know, the Enigma and things like that, cracking the code during World War II, and it's a huge baseline of basically anything of how do we even communicate via computers, how does the internet work, how do you do email. All of that is based in information theory and one of the key concepts of information theory as Sid said about transferring bits from one place to another is called entropy. So entropy is really the quantification of uncertainty, of a random variable. So what's surprising is the more unlikely of an occurrence, the less uncertainty it has a different entropy level. So that's really what you're trying to look at.
Speaker 2:
Is so like a coin flip, as an example two equal likely outcomes. It has less information on lower. Is so like a coin flip, as an example uh, two equal likely outcomes. It has less information and lower entropy than like a roll of a dice, because it's you don't know of the six. So there's a higher entropy level because there's more uncertainty. So that's really like the framing piece of information besides, like bits and things like that of how entropy, uh, of entropy, of how information theory is often used. So it's this very large discipline. It's the baseline for a lot of theory in computer science on how information is used.
Speaker 3:
Yeah, and for a little bit of historical context, information theory as we know it today came out of Claude Shannon's work in the 1940s, from our favorite lab it's Bell Labs, and so you know this has been kicking around for a while, and this is pretty fundamental to computers as we know it, even you know, 30 years before the internet was really a thing.
Speaker 2:
Yeah, there's so much that can be traced back to Bell Labs. It's crazy. Yeah, there's so much that can be traced back to Bell Labs. It's crazy. Too bad that's still not a research institute, but so much key stuff in computer science and related fields came from them.
Speaker 2:
So why are we talking about information theory today? And well, we've been seeing it pop up a lot recently. We're not 100% sure why. Specifically in the field of machine learning and adjacent modeling fields, we're seeing a lot of information theory specifically around monitoring. So we wanted to get into a little bit about what that is and why, and we'll have an accompanying blog post to really get into what the specifics are of the math, of what we're talking about.
Speaker 2:
But first we have to talk about what is the difference between a metric and a divergence, because there's, I think, some misunderstanding here. And this is where information theory, as we've set up, is a very powerful and fantastic discipline. We just think it's being misapplied right now for a lot of people using it for, like, machine learning, monitoring or data monitoring and things like that, where I don't think it's the best tool for the job. We recommend you go back to listen to our non-parametric podcast. There's a lot of overlap and people using these divergences from information theory and applying them in areas where they don't necessarily make sense. But first to understand that, I think we need to understand what is the difference between? What is a metric actually? Because normally when you're wanting to quantify information, you have metrics and you can run those statistical tests on them. So, Sid, do you want to talk us through, like, what's the difference between a metric and divergence and why does this even matter?
Speaker 3:
Yeah, so let's think of this as basically a metric, as having a set of rules that are required to be tentative to a metric. So, for example, distance is a very common metric, and when I say distance, I mean like a simple Euclidean distance, right, how long does it take from point A to point B? And just at a high level, I'll describe what that looks like in the blog post. We'll give more specifics of what that looks like. A distance must always be positive.
Speaker 3:
You can't have a negative distance from point A to B If A and B are the same point, there should be no distance. The distance from A to B If A and B are the same point, there should be no distance. The distance from A to B should be the same as B to A. And if you take a detour from A to B by going through a C, that needs to be either the same distance as just going A to B or greater. And these might seem obvious, these might seem simple, but when we consider divergences, they don't really follow these rules. So for use cases like modeling, the drift of a model's outcomes or inputs, using divergences can get you into a little bit of trouble, especially if you don't understand what they're doing under the hood. Don't understand what they're doing under the hood. So you should think of metrics as a very specific subset of measurements you can take, which divergences don't satisfy, but they do have some nice properties.
Speaker 2:
One other thing to note on the divergences. That is key and we'll get into a couple ones we've been hearing lately and are being applied and kind of the alternatives. But one of the things that's also important to note with these divergences is, since they're not metrics and they're not rooted in statistics, you can't do like p-value tests and have like a certainty of something you're detecting. It's just more like hey, I see a distance change, okay, cool, how much is like, am I being surprised by the change in distributions? Things like that. It's just a little harder to do comparing apples to apples, so it just doesn't work quite as well for these monitoring use cases. So let's talk about a couple of the common ones. There's three big ones that people often use from information theory that people are using for like machine learning, monitoring and data monitoring. So say, do you want to walk us through some of these common ones and what the issues are?
Speaker 3:
Sure. So our three main tests the first one is probably the most popular, probably the one you hear the most about, is called KL divergence. That's forgive my pronunciation, Kolbach-Lieber divergence. This gets thrown on a lot of papers, this gets thrown on information theory, and you should think of this as basically like, if we have data distribution A and data distribution B, what is this like? Surprisal, or like the entropy difference between moving from A to B, you know, do we really consider B a different sample? So this may remind you of KS test, but this is some different properties and this is written more information theory. And, notably, this test is not symmetric, which causes a lot of problems for a lot of people that they test the difference for going from A to B but then don't remember that the distance for B to A is not the same distance. So there is actually a very important directionality there which is not always noted.
Speaker 2:
Yeah, and that makes it very difficult to be comparing or using as like a standard. Hey, I'm trying to monitor these inputs to my model and I want to see what happens over time. Well, the shift in if, like, the data is decreasing in value is going to be different than increasing, so it's kind of hard to keep an eye on it. You'd have to have extra rules on top for how to monitor for that. The other crucial thing that you would have to have error catching in your code for is you cannot have it's like an undefined at zero so it actually is going to throw an error if there's no change at all, like if there's no surprise in the distribution change. So there's all these like error catching things you're going to have to have in place and all this work around, even get a KL divergence to even work in a continuous use case.
Speaker 3:
Yeah, and so this is kind of why we see the introduction of, you know, jensen-shannon divergence, which is meant to basically improve on KL divergence by fixing at least those two problems. Right, it's supposed to fix the problem of asymmetry A to B should be equal to B to A and it's supposed to fix the issue of handling zeros, right? So if you were to encounter a zero issue, you would hit an undefined response. But Jensen Shannon can't experience zero outcomes, so it kind of sidesteps around that issue.
Speaker 2:
The final one that we've been seeing a lot recently is population stability index. This one is very popular in the ML literature and really model risk management literature as well, not just machine learning literature and really model risk management literature as well, not just machine learning because it came out of a paper in 2004 in model risk management of how to quantify changes in model distributions. Back then I guess as a starting point it is a good starting point, even though we've had statistics metrics back then. So I'm not quite sure why this came to a big following. But because it was published by consulting for model risk management it got a big following and is used by a lot of major banks today. But the issue is you have to calculate specific bins and then calculate percentage of data from each distribution that falls within a bin and you have to specify those bins. It also has a possibility of zero outcome the userified bins and what is actually the level of criticality? You can't do a p-value test. People have heuristics of oh, if this is this amount of change, it means this.
Speaker 2:
But there's so much of this configuration and it was because of like one influential paper, like the first thing to jump on when model management thought they had to start looking for changes in models. I forget which regulation. I forget which regulation maybe one of the first Basel Accords, something like that says like, hey, you need to have some sort of like monitoring in place. This was the first paper addressing it. Thus it got a following and it's kind of a legacy thing that I'm seeing in like blog posts and things and other companies using today for all types of data monitoring. No-transcript.
Speaker 1:
Given that legacy and some of the history that you shared, why is everybody talking about information theory now?
Speaker 2:
I think it's two things. I don't think people are really talking about information theory. They're just using information theory metrics and honestly Sid and I had a big conversation about this we honestly don't know why they're being used so much. I think it's just these legacy. I think there's a little bit of like in ML apps. I've been seeing this come up a lot more because those are usually like computer science background individuals that are now trying to provide like continuous integration and monitoring and trying to like professionalize the deployment of models. So it's a little bit like hey, this is the toolkit you learned in computer science. It was built here. We're going to use this.
Speaker 2:
For a lot of people If they're not trained in statistics itself. Statistics is kind of this mythical thing that doesn't make that much sense and it's complex and statisticians aren't always the easiest to work with. They make everything esoteric and it's just kind of confusing If you don't have a statistics background. Information theory you will learn in a computer science degree. So I think it's a lot of just. It was built here, we're familiar with it. It makes sense to me type of toolkit more than like a conscious choice when weighing the options.
Speaker 3:
That's right, and this is not to, like you know, strictly dunk on information theory, but it's to note that, for the use case that we're thinking about here in the AI ML space, we want really robust models, we want really good quantification of change and risk, and these might not be the right metrics to use it, even though these are the metrics that we know about.
Speaker 2:
Correct. It's kind of like when you have a hammer, everything is a nail a little bit. Information theory, as Sid mentioned, we're not dunking on it at all. Information theory is a hugely powerful, highly important discipline. We're just saying the specific divergences that come from information theory aren't, as Sid mentioned, for what our use cases are building solid, resilient models that are fit to purpose. They're not the best tool for the job. Nothing to bash the field at all, but if you're monitoring data inputs and stability of models, this should not be what you're using.
Speaker 1:
So let's get into that. Where might or might not it be a good idea to do this?
Speaker 2:
For the deployment of production modeling systems. I don't think these are ever a good idea. There's the legacy toolkits that say so, but there's not a reason, like when we talked about our non-parametric statistics podcast. I don't in every use case. I think those are outperforming. There's even someone did an exhaustive study. Evidently AI has a blog post where they actually compared a lot of different methods like for statistical power and checking, finding a change if a change happens and things, and like the KS test we talked about in the nonparametric podcast, just absolutely thwomped these information theories, like they weren't detecting changes. When changes occurred, errors were happening. Like they're not even doing a good job of checking things regardless of the theory behind it. Even a simple distance metric was being more useful than these divergences. So I don't actually see and let's see if Sid disagrees, I don't see a use case in our field why you should ever use these.
Speaker 3:
No, I mean, I think that's about right. I mean, maybe you'll find some use for this in like early exploratory data analysis phases, but not at the point where these models are at production need to be monitored. It's a great tool when you're building your model, you know, perhaps you're looking at difference in gradients, but it's not as useful for detecting drift, which is probably what you're more interested in if you're on the ML op side.
Speaker 2:
I am a statistician so I'm probably a little biased. But the fact that you can have that precision and have the ability to test for significance and actually have theory behind, knowing what's my confidence. What does a change look like? They're having those actual things versus heuristics of oh, if it's over one, that normally means there's a problem. That level of precision, in to me is a, is a huge pro and a for statistical methods and a huge con against these.
Speaker 2:
I think the other reason some people have liked these is for statistics, as we've talked about. You like, like. Unless you know the actual underlying distribution, you wouldn't you know. You'd use a Z test, a T test or all these different things. It's very confusing. I think one of the draws for especially like MLOps folks from these metrics is it doesn't matter what your distributions are, they're distribution agnostic, so you don't have to dive into that whole. I don't understand my distributions and non-parametric statistics are not that popular. That's something that Sid and I have been talking about for years, but in the statistical world there's a lot of people still doing that Like. You need to know your distribution, you need to choose a specific test so I can see from. If you don't want to go down that world. This is a close enough, and that could also be why there's some traction. But you have an answer now. Non-parametric statistics supplant this, in that this in the easy button, but provide you a lot more statistical power and accuracy and you don't have these drawbacks?
Speaker 3:
Yeah, and if you need specific tests to Google, look up the KS test we have a whole episode dedicated to non-parametrics and look up your chi-score test. These two will give you a lot of what you're actually looking for from these divergences and supply you with meaningful p-values and also usable test scores. If you want to look at the test scores, Definitely.
Speaker 2:
We did a blog post on this as well. We can link in the show notes and then we'll have a blog post diving into these metrics a little bit more and, as Sid mentioned, into these metrics a little bit more. And, as Sid mentioned, some of the what's why are these not metrics? What are those properties, triangle, inequality and things like that.
Speaker 1:
So actually dive in with the math on proving why, what we're saying, why that makes sense, and then can link to the non-parametric Given all the news on AI and I want you to put your frame of reference in AI right now and the state of things. What would be some of the things that people should look at right now if they are trying to experiment? From that context, what would you recommend?
Speaker 2:
Definitely still non-parametric. I would also be recommending you think very hard about using an LLM at all.
Speaker 3:
But yeah, especially if you know you need guarantees of success or correctness. You're going to be struggling if you're not at least doing something. That's more like rag oriented, where you have some store of data that you're, that you're calling from and you know think about like, do we really care about metrics if we're gonna do everything fuzzy anyways?
Speaker 1:
You mean in experimentation versus production?
Speaker 3:
That's right, that's right Okay.
Speaker 2:
And LLMs in general. If you're just playing fuzzy around and playing with something, do you even care about accuracy? If you're using an LLM, definitely refer back to our podcast on knowledge graphs, because even the RAG use case isn't that accurate. So I don't know. Nonparametric still wins for this use case. But then, as I said, made a great point Like if you're not caring about accuracy, you're not caring about performance. If you are caring about those things, use normal NLP classification or whatever. If you're using LLMs you're currently with the current architecture you don't really care about accuracy anyway. So use whatever method you want you want to use. Uh, hopefully this was helpful. Love the feedback on this.
Speaker 2:
Um, and just to be very clear, we're not. We're not putting down information theory. It's a hugely important discipline. We are just saying that these divergences do not make sense in our minds for robust production, monitoring of systems and checking for differences in distributions. That's all. It's just been something we've been seeing come up a lot. So definitely, if you disagree or have something to the contrary, please let us know. And, as always, we love ideas on new podcasts and things like that.
Speaker 1:
Perfect. Well, that's all for today's episode. Please check out our homepage. Our feedback form is there. We love hearing your comments. Keep them coming Until next time.