The AI Fundamentalists

The importance of anomaly detection in AI

March 05, 2024 Dr. Andrew Clark & Sid Mangalik Season 1 Episode 15

In this episode, the hosts focus on the basics of anomaly detection in machine learning and AI systems, including its importance, and how it is implemented. They also touch on the topic of large language models, the (in)accuracy of data scraping, and the importance of high-quality data when employing various detection methods. You'll even gain some techniques you can use right away to improve your training data and your models.

Intro and discussion (0:03)

Understanding anomalies and outliers in data (6:34)

  • Anomalies or outliers are data that are so unexpected that their inclusion raises warning flags about inauthentic or misrepresented data collection. 
  • The detection of these anomalies is present in many fields of study but canonically in: finance, sales, networking, security, machine learning, and systems monitoring
  • A well-controlled modeling system should have few outliers
  • Where anomalies come from,  including data entry mistakes, data scraping errors, and adversarial agents 
  • Biggest dinosaur example: https://fivethirtyeight.com/features/the-biggest-dinosaur-in-history-may-never-have-existed/

Detecting outliers in data analysis (15:02)

  • High-quality, highly curated data is crucial for effective anomaly detection. 
  • Domain expertise plays a significant role in anomaly detection, particularly in determining what makes up an anomaly.

Anomaly detection methods (19:57)

  • Discussion and examples of various methods used for anomaly detection 
    • Supervised methods
    • Unsupervised methods
    • Semi-supervised methods
    • Statistical methods

Anomaly detection challenges and limitations (23:24)

  • Anomaly detection is a complex process that requires careful consideration of various factors, including the distribution of the data, the context in which the data is used, and the potential for errors in data entry
  • Perhaps we're detecting anomalies in human research design, not AI itself?
  • A simple first step to anomaly detection is to visually plot numerical fields. "Just look at your data, don't take it at face value and really examine if it does what you think it does and it has what you think it has in it." This basic practice, devoid of any complex AI methods, can be an effective starting point in identifying potential anomalies.

What did you think? Let us know.

Do you have a question or a discussion topic for the AI Fundamentalists? Connect with them to comment on your favorite topics:

  • LinkedIn - Episode summaries, shares of cited articles, and more.
  • YouTube - Was it something that we said? Good. Share your favorite quotes.
  • Visit our page - see past episodes and submit your feedback! It continues to inspire future episodes.
Speaker 1:

The AI fundamentalists A podcast about the fundamentals of safe and resilient modeling systems behind the AI that impacts our lives and our businesses. Here are your hosts, andrew Clark and Sid Mongolik. Hello everyone, welcome to today's episode of the AI Fundamentalists. We've got a great show for you on anomaly detection, a topic that you're going to find has a lot of pieces to it, but in general, we tend to overthink it, so we hope you're excited to dig into this a little bit more. Before we get started, though, we did get some comments and feedback on our episode of non-parametric statistics, and it brought up a good point about information theory tests.

Speaker 3:

Andrew, you had some thoughts on that Information theory is a method that a lot of people are liking to use for monitoring and things like that, and to Sid and I it's kind of a curiosity of why are people reaching for this harder to use method that's a little outdated for some of these use cases. So we definitely, based on the feedback we've been and things we're seeing in the industry and some of the feedback we've had, we definitely will do an information theory podcast in the future. So thanks everybody for your comments.

Speaker 1:

In other interesting news this week. The Wall Street Journal had an article yesterday and the Google CEO is calling out chatbots.

Speaker 3:

Yeah, this was a pretty interesting article. Basically, jim and I is doing kind of their new version of chatGBT or their large language model. They're doing things like you asked it a question who's had a worse impact on humanity Elon Musk or Adolf Hitler? And it doesn't really know what to say about that. That was one of the citations in the article. Which is a more harmful ideology libertarianism or communism? What is harm more people? And it basically waffles on. I'm not making this up. This is exactly the quotes from the article. The Jim and I is kind of waffling. We don't know which ideology is harm more people, like type thing, like what.

Speaker 3:

You may disagree with libertarianism, but I don't think anybody's been done genocide under the name of libertarianism, to my knowledge anyway, that kind of thing. So basically, the CEO of Google is now coming down really hard Like we got to fix. This is unacceptable. It's kind of like, well, what do you think is going to happen when you don't do testing and quality control and all that?

Speaker 3:

But again this comes back to that we've talked about many times. It's kind of the dead end of these current year running, with an architecture where models are supposed to look or just sound like a human wrote them or be analogous of, like even images and things, something a human has done. They're not architected to be accurate. You can go back to our you know, knowledge graph episode and things like that, but I really think and now you know, open AI is wanting to invest trillions now and making these models like you're pouring money into a dead end. So if we really want to fix some of these underlying issues, we have to redirect and kind of start from scratch We've talked about so I don't want to go too much into that again, as we've done some other podcasts on it, but very interesting to see this high profile of a case from Google and some of these crazy statements that were cited in this article.

Speaker 1:

When you called that article out. It makes me think of some things, like I'm when, if I was to go on a rant, I'm always going on a rant about the last mile of any technology, like the stuff that's in front of the end user. You know, you see it every day. I'm looking at it now in trying to find a healthcare provider for a particular problem. I've got an app full of resources to find this doctor. You call them and almost everything is wrong about the information that's provided in the app. These apps are being scripted by these bots. These bots already have all the information going into them to begin with. This stands to reason that If you're putting bad information in, it's just gonna keep perpetuating bad information and, to the point of the article, you're gonna have these, these indecisions that people are depending on these bots To actually decide if you will Rant on, rant, rant, like it's just always down to the last. The last mile is just where it starts to show.

Speaker 3:

Oh, yeah, and that's it really goes down to now. I want to get since opinion on this. You know you get the McKinsey's and people of the world that like, do these grand presentations for C-suite about how wonderful Something is, potential, it's like, yeah, but the devil's in the details, the last mile of making it actually happen. Like something could sound great in theory, just because you invest money in large language models Doesn't mean it's magically giving, gonna solve world hunger and all your business problems, just because McKinsey told you so. Following their North Star not really trying to hate on McKinsey too much, but consulting firms in general, that kind of like do these presentations?

Speaker 2:

Yeah, and I mean this kind of highlights a different problem that Google is also facing now, where they're trying to shudder. They're down to Google assistant team, which is a more traditional input output machine, right Like there's basically mappings between intense and what it should respond back to you, and they're thinking like, oh well, we'll just use this model, which can answer any question and doesn't need templates. But the problem is now it's fuzzy, right is a fuzzy understanding of the inputs and as a fuzzy understanding of the outputs. So to Susan's point, you get these problems where, even though the factual information exists, it's not able to represent it well or give it back to you very well.

Speaker 1:

There will be no shortage of our opinions and thoughts on this, so more to come in future episodes. So let's get into today. When preparing for this episode, you know you guys presented the topic of anomaly detection and I went doing some being on this in a not in a Blast from the not so far past. In 2020, forbes stated that AI, and specifically anomaly detection, is not a simple concept for most of us to digest. Nonetheless, it's extremely underrated as a modern business tool. So, number one you know I found this you know spoiler alert under a vast list of anomaly detection tools and services and platforms. So you know you're welcome to all those platforms out there for the mention. The other thing is that you know it was also interesting to consider that here we are in 2024, four years later from that statement. Are we using AI to detect an on anomalies or are we detecting anomalies in AI? Both?

Speaker 2:

Yeah, and I think that you know, through our discussion we're gonna get a good sense of basically what our anomaly detectors and is the AI component a little bit overhyped? And I think that'll. I think we can true back as an answer this question at the end once we kind of dug into what is arguing, what they aren't doing, right. So let's first talk about, like, what is an anomaly or an outlier you might call it. So we would consider a data point in anomaly if it's so unexpected that it's inclusion in our data set raises warning flags about its authenticity or that it's misrepresenting the true data collection, right. So these are, these are data points that are not only Strange but so strange that we're we're basically considering that this cannot be correct, and this is not to say that all strange data is wrong, but that this raises a flag to check this.

Speaker 2:

And this type of detect is very common in Finance and sales and networking and security, machine learning and systems monitoring, and we'll see some of those use cases later, but this is a pretty widespread use. For example, insurance companies might use this to detect fraud. Right, they might look through a bunch of transactions and use in anomaly detector to shortlist items that are likely potential frauds, or In cyber security or other security systems. These are good ways to detect intrusion versus valid access to data or access to resources and facilities, and and so we would expect that a well-controlled modding system should have as few outliers as possible. We're outliers in this case means data that is truly misrepresented or incorrect.

Speaker 3:

Yeah, I think a lot of people make. This is like this, this esoteric, mythical concept of like, oh, outliers, anomalies, what are they? How do they work, all this kind of thing, when it's really, yeah, good systems shouldn't have them because you should have controls in place. Most, most of time, you know it's it's bad data entry, type type errors, or you're in like a financial type setting or anything, where you're literally doing like looking for fraud and looking for those, those outliers, and we're gonna get into some of the methods and how you you handle them some, but oftentimes it's still that like expert opinion and rule-based systems that will do a good job of detecting these things. It's it's Besides like anomalous cyber security and other type of looking for fraud. It's it's not that big of a of an area.

Speaker 2:

So, yeah, let's digest a little bit how these anomalies are, how these outliers get into our data, right, because knowing where these come from actually gives you a chance of stopping them from being in there the next time.

Speaker 2:

This is this is one of my favorite examples here for a for a Anomaly that came through a data entry error. So over a hundred years ago Approximately, a dinosaur vertebra was found which was measured at 1500 millimeters. Now, this was the by far longest vertebra ever found for any animal, and so when they, when they examined how long this would be this animal would be, it would be about a hundred eighty eight feet longer, over 57 meters, so this is incredibly long, considering the next biggest animal was close to half that size, right, and then when they went to go back and find that vertebrae, based on the records, they couldn't find it, and so this hypothesized dinosaur may just be impossible. It was very likely just a misdecimal place or a swap of two digits, and so we got this strange outlier, which was reported on. There's articles about it, there's fun drawings made about it, but ultimately it may have just been an anomaly, and we'll link a 538 article about this.

Speaker 3:

Oh, I haven't heard that one before, so that's a great one.

Speaker 3:

This is a very pernicious issue. I've seen it all over the place. Yeah, so many just data entry errors or just misclassifications or things. And that's why in a well-controlled modeling environment, you really have a lot of data cleansing, data understanding, a lot of review, a lot of checks and balances and why data governance is such a big thing and that overarching is that AI governance, model governance the synonyms of just good model governance practices it's huge and that's where a lot of these things unless there's like fraudulent I could have, like we're talking about, and there are areas especially like cybersecurity and finance where you wanna be having there are like legitimate things you can be finding. Oftentimes it's just poor practices that a well-controlled model wouldn't have, or having domain experts putting in data that's a great example.

Speaker 2:

Yeah, exactly. And so, speaking to domain experts, you might feel like, well, we have so much data to collect, let's just have a data scraper pull it. Anytime you run a data scraper, you should definitely be checking those outputs for outliers, because it's very easy to scrape. You know a misdecimal place? The next page has like a number one next to it, so you accidentally catch that one too.

Speaker 2:

Any number of scraping or autonomous systems built to collect data these should be checked right, because the web isn't perfectly formatted as much as we like to pretend it is, and so that's a great way to pick up anomalies. And then the one that we might think about the most is adversarial anomalies, right, meaning people that are intentionally trying to cheat you and put bad data into your system. That might be something like a fraudster or an intruder into a security system, but let's not forget that, like you know, these can happen for much more mundane and everyday reasons, which is just you collected the data wrong, and so no one is basically exempt from having to check their data, but especially if you're doing anything automated, with manual entry or has potential for adversaries.

Speaker 3:

Yeah, I think that's a great point. A lot of times when people think about anomaly detection and things like that is automatically to fraud and you know the like cybersecurity type. Yes, that's definitely a thing, but oftentimes it is more benign and you know, as you're investigating them, definitely take that approach of not everything, of not everyone's out to get you Like there are definitely a lot of errors and it's so easy to make these errors like just one decimal point difference. And two, one thing Sid said for the data scraping, just in case everybody's not clear from listening to this podcast, llms won't fix this problem either. They're not gonna be fully accurate either, because data scraping is extremely complex, as Sid said, and very hard to get right and the little, just the smallest errors get you chasing, made up dinosaurs, like literally that just one error in one spot and you have those types of errors. So the data scraping you can't be like oh, this is a 98% accurate system. Well, that means you have 2% outliers.

Speaker 1:

Then, in the rollout of these large language foundational models, what might be done? It's what could be done at that scale to try and get this information in there, or training them, or, yeah, yeah, to get the outliers out of training Some of the things that we're using today from open AI, google, bard, anything like that.

Speaker 3:

Comes back to really highly curating your training data and it's for language. It's a little bit harder to get the outliers. We definitely. You know there are definitely methods for that, but there's no shortcuts. It's that very hard yards of figuring out the process and your data. Really understanding everything is and making those domain specific. What's outside of realm Example like you shouldn't have much information about Hitler in your training set, as Google is realizing. Like you need to be very specific if you're including things like oh, hitler, bad, that's all you need to have in your dataset, like that's it, so you'd have to have, but it's very nuanced specific cases for that, for data gets a lot, a lot more difficult. Sid, some thoughts on that, as you're our resident NLP expert.

Speaker 2:

Yeah, I mean, it's a great question and you get this problem of, like, what is an outlier? Yeah, is an outlier an opinion that's just not popular, or is an outlier an opinion that's like actively harmful, and that's a very difficult problem. So you know what is an outlier in language that probably looks like. You know going for high quality data sources, you know falling back on our usual mantra of high quality data for high quality models and then also making sure to curate the sources themselves right. Even though you have a source that you might like, you'll still have to go and make sure that you're picking up opinions that you're happy with putting on and not just throwing in Reddit wholesale, where you'll get all kinds of opinions which you might not want your model to reflect.

Speaker 3:

That's a very interesting point that so far what I'm seeing in large language models hasn't been well done. If you do need to be having differing opinion and things on these areas and a lot of times I think some of the people training the models are very biased in one way or the other in thinking and use status sources that they like, and it's kind of echo chamber a little bit on some stuff, just because you don't like something. There's factual representations of all sides, right, and then there's of course, politicized versions and things like that. So making sure that in those large language models you have a non-biased assessment of every topic that's truthful, that's not harmful, but this is a very difficult problem and it's not something you can automate easily for large language for sure.

Speaker 2:

Yeah, and we can hear a whole episode of this, but suffice to say it's a problem of like. Do you want this model to read everything or do you want it to read a safe inversion of the world? Do you want it to have a useful tool or do you need to be a representative tool and you might not want the representative tool all the time.

Speaker 3:

And that comes back to what we've talked about before Data quality. Smaller data is better, brings stats back of. You can't train these things by scraping everything on the internet. As Sid mentioned Reddit, like just because it's on Reddit doesn't mean it's accurate, right? So like you need to have highly curated data sets for any sort of training, because just scraping places like we talked about, data scraping, is not being accurate. Just even accuracy is lost in scraping the data and 100% some of the training data has been lost in the scraping. But even like, what are you scraping and is that a truthful source is a key part.

Speaker 1:

Let's assume good, curated data. How would you sanity check for some things to determine if it's truly an outlier?

Speaker 3:

Outliers aren't only like min-max ranges. You can look at expected values for something. For instance, how long do humans live? Well, I mean, it's going to be probably zero, right? Because unfortunately some babies die, which is really, really sad. And then you know what's the physical limit.

Speaker 3:

Has anybody like scientists are now saying 119? I don't even know if anybody's made it to that far, but like that would be whatever the scientific limit people say would be right, so you shouldn't have a 500 year old person in your data set, so you can do basic type of things like that. But you have to have like we're even seeing here how difficult it is even for that Like if that scientific basis or something for even what that top line or bottom line is, so it gets very negative values. You can't be a negative age Like that's not how we normally do things. You can't be a negative age, right, because you're zero when you're born, right? So like you can't be negative unless someone wants to start talking about it. Like normally the method is not negative age, right, so whatever it is, and then values are represented properly and then you don't have issues with decimal points or anything like that.

Speaker 2:

And I will say on the negative age thing quickly in Korea they don't start at age one, and so they have a very funny situation where you can be age zero for your first year.

Speaker 3:

There you go, there you go. Still not negative, but it is.

Speaker 2:

So, like you know, you got to think a little bit wider, right For like age, like you have some sense of like what is a weird age. But there's going to be data points where, like you don't know what is a weird number, and so that's where it's good to just go back to your stats, you know, pull up a box and a whisper plot of each numerical feature and just look at them with your eyes and see like most data points are, like, pretty close to the mean. Like one standard deviation from the mean, this one is six up. That's a very likely an outlier and that's something you should check, right, so something you can check with a histogram or just a simple box and whisper plot, and that's really great to just eyeball check any variable that, like you don't have a reasonable expectation of. Like how many claims does an average person make? I don't know, I could guess not 300, but what is a true outlier?

Speaker 3:

That's a great point and this is where you could even you can't even just say, oh, we'll do this standard deviation from the mean or whatever and consider it outliers, because I know someone's going to be thinking, oh, you use the whisper plot, you can back up from there, you can do that, but what's the distribution of the data, as Sid just said, is very key. So like, where is they? And back to the distributions we've talked about distributions on a podcast and for simple data, as we're seeing, even age, you need to know the context of Korea and to know the context of Max. And then you need to know, like, what's the distribution of the data you're looking for? So like, is it power law, for instance, like the income of billionaires, as an example, and where's the location of the world, like you have to know what's that underlying distribution to then be able to properly use one of these like box and whisper plot type analysis to see what's an outlier or not.

Speaker 2:

Yeah, that's exactly right. You know, if you just go for the simple, like standard deviation method and you're looking at incomes, you're going to find Bill Gates and you're going to be very confused. You can say, well, that's an outlier, that can't be real, but that's misunderstanding the underlying distribution of the data.

Speaker 3:

And then as part of the processing and conversation we had and you know, making sure you don't make errors there or errors in your dinosaurs is scaling. Scaling is huge too because some data a lot of times to change with, like, if your data is skewed and you want to make it into like a normal distribution, people do logs. It's actually been interesting podcast talking about like logging data and the different pre processing techniques. Actually, let's make another that pre processing techniques. It would be a fun data fun podcast. Anyway, you could have log scale data that outliers will look different because you're kind of like changing the distribution. So if you're comparing, you know, billionaire power law distribution to something that's been logged, it's going to look even wonkier too. Like there's all of these weird interplay of. You have to understand Korea and scientific maximums. You need to understand the distribution dealing with me, to understand how is this data treated before you're looking at it in the distribution. And this is just one variable we're talking about here.

Speaker 1:

Yeah, so are there other things that we should be looking for, like not just the numbers that are too big, or like the subject matter knowledge? Is there anything else that we should take into consideration?

Speaker 2:

Yeah, I mean, when we're considering anomalies, we're usually not thinking about anomalies as just one variable, although that's a great way to look for it. We want to think about anomalies in the greater context of a bunch of variables. Right? If someone's committing fraud, they're not likely just changing one number, they're changing a couple other numbers too, so that when they step that one big number in, it's not as big of a surprise to the system, right? So we're looking for data points that are truly different from all the other data points, and so that brings up the question of what features do you use and how many are too many, and is too many really a problem?

Speaker 2:

You might be thinking, well, I want to catch these fraudsters, let me use every single feature I have and let's run the book against them and let's catch them.

Speaker 2:

But the problem is you're probably not getting more signal, you're probably just getting more and more and more noise. And then you have the additional problem where, when you try and determine which data points are really really different, say with a measure like Euclidean distance, which we've talked about in the past the more variables you have, the more you dilute the distance between two points, because when you go up into the n-dimensional space and you're trying to find the distance between two points. If they're different on 100 features, and now they're just different on four, at the end only 4% of that change is now explained by your variables, whereas if you had just been a little bit more selective and only picked maybe 10, 40% of that change could have been accounted for. So this is sometimes called the cursive dimensionality and this is just good modeling practice in general, but especially for anomalies, if you're looking for data points that are weird, you need to find the core of what is manipulable and where the source of error could come from, and not just thoroughly kitchen sink at the problem.

Speaker 3:

That is a great point and something in the data science world isn't often understood. You just get as much data as you can and put extra boost on it. Right, it's going to be great. No, that's a very good explanation of how you just talked to me about.

Speaker 1:

What are some of the approaches that we want to recommend?

Speaker 2:

Yeah, I think there's four broad categories and then Andrea will work through them. So that's your supervised methods, your unsupervised methods, your semi-supervised methods and then your old school statistical methods. So I'll sort of talk with supervised methods for anomaly detection. This is a golden case scenario where actually you have a bunch of labeled data you have like 1 million records and 10,000 of them your experts have mainly said these ones were fraud, and now you have a modeling problem. Right, this is your extra boost use case.

Speaker 2:

Right, you want to learn what's an anomaly, what's not an anomaly, and so that's really great, potentially, because it's really concrete. Right, you can look at a data point and you can say is fraud, is not fraud, which is a great binary outcome. But this has two pretty big problems. The first problem is you need a ton of training data and, specifically, you need a lot of fraud examples, because if you build a model with 1 million not frauds and 10,000 frauds, your model is going to guess not fraud 100% of the time because that's a shortcut. So if you're not going to use deeper metrics, like your F1s or your F betas, and if you're not going to do proper balancing of your data to make sure you have a good representation of fraud and not fraud. You're going to build very, very weak models because fraud is inherently rare and outliers are inherently rare, so it's a rare case of modeling.

Speaker 3:

Yeah, and that's a great, great point because the thing is it's also going to look like you built an awesome model, because most of the time it's going to say, oh, you're like 99% accurate, that's because the model says you're never fraud. And that's where, as Sid mentioned, like using some of these other metrics, like F beta scores a great example because you can say like five acts on recall, because recall is what you're concerned about is finding fraud when fraud exists. So you want to be fed up also with the synthetic data we talked about and the rebalancing data. You want to be creating more frauds or amplifying that fraud signal if you're doing supervised. So yeah, supervised isn't usually a great use case for anomaly type detection. I mean, there's definitely times you can use it. There's even, like I think it's support vector machines has like a specific version that you can use for anomaly detection, but it's not normally, it's not the technique guy we like to use usually.

Speaker 1:

Is there a case where you use it just to, like, get that severe statistic of like the 99% accuracy? Shouldn't that raise more questions for you? If you see that, if you have nothing else or know, don't know where to start, throw it in there, and if it's that severe, you've got something else to work with.

Speaker 3:

No, because it's a lot of work, as it said, and a lot of data it's. Actually, if I was starting something from scratch, I would do a rule-based thing, for instance, like the we talked about the Bill Gates example. I would do those sorts of things and then that flag for manual review and like, oh okay, bill Gates, that's, that's a correct entry. Bill Gates doesn't. He is that wealthy. Okay, cool, he's good, right. So, like I would build just simple, rule-based as their starter, because the supervised takes a lot of work and you should never have that 99% accuracy because you would be using the wrong metric in that case.

Speaker 2:

And there's a great question there about baselines and maybe we should talk about baselines later, but when we're baseline modeling, we would just want to go for the simplest statement. We want to go for that, for that if statement, and we want to say can you do better than the if statement?

Speaker 3:

Okay. So the next one would be unsupervised and this is what we see a lot of like some of the neural networks are doing now in cyber security. They can be some pretty impressive stuff because in cyber security you can see a lot of like this person logged in from multiple countries at the same time and some weird bot type stuff you can see there. So it's a little bit. I would say the most interesting area right now for like anomaly detection, refining these models, is probably cyber security. There's a lot of great data there. There's a ton of data that we talked about, the issue of not having enough data and you have bots and stuff doing weird stuff. So that's a. It's definitely a good area. But really for unsupervised, you find the outliers using distances from a no important origin or clusters of normal data. So you'd have like normal data and then look at outliers you're using like liquidity and distance and things like that. It's not always that accurate.

Speaker 3:

It's easy to get started with because you use like k-means or something basic or basic. Clustering doesn't require any training there, you just start looking for weirdness. But it requires all the statistical interpretations and things. So it's kind of hard to like. It requires a lot of domain knowledge. If you're going to start using that, it's not just a to your point early about, like where do we start? That's not a great starting spot because you still have to have a lot of domain knowledge baked in yeah, exactly, and that statistical interpretation is it's not a.

Speaker 2:

You get to look at a data point. You say fraud or not fraud. You have to go in and look at the, at the deviations. You have to look at the changes and you have to evaluate personally if that's, if that's a fraud or an outlier, unlike the supervised. So you know, maybe there's something in between and that's maybe something like a semi supervised approach, right. And so the hope and the goal with this is basically to merge these two worlds to get the best of both of them right.

Speaker 2:

So we want to do our unsupervised modeling, but we want to know what is and what is it fraud, and so one way you can do that is you can use a small sample of labeled fraud and then try and label the clusters. Right, you'll find these natural clusters in the data and you say, like, well, this quadrant or this region of the data is the fraudulent or outlier cluster, and so now you have at least some way to label these outcomes. Rather than just saying these are weird, you can say these are weird and probably adversarial or wrong, which begs the question why doesn't everyone do this? It's, it's very difficult. This is, this is the most difficult solution to try and and balance all these hyper parameters yeah, it definitely.

Speaker 3:

I think this is your panacea, most accurate method if you can do it, but it is a lot of work. I did grad school work on a method of you doing this and then I back in my accounting days when I was doing like the looking for fraud and stuff like that when I was working in accounting. This was something I used for pretty good success for, like getting it looking at journal entries and things like that. So I spent a lot of time with the supervised and it. It is a ton of work but it gets some really good results. So, like when Bedford wasn't cutting it and some of the work I was doing, starting like semi-supervised, was like the really interesting area. But yeah it's. You have to then work with like subject matter experts and help getting that labeled data set.

Speaker 3:

It is a lot of work, but you also it doesn't. It's not that's another thing to say is like for unsupervised methods and especially like in cybersecurity and things. You don't always know what you're looking for and that's where there's other methods that are better. This is very narrow. It's a it can be the most accurate, but it's very narrow, so you have to have domain experts helping you in all stages of the way and it's only for one specific thing. So, like I am training to look for financial fraud as an example, well, it's going to only be financial fraud in taxes, or only financial fraud and sec transactions, or whatever.

Speaker 2:

The whatever I'm analyzing can optimize it for that specific thing and then constantly evaluating if it's still on target yeah, and this gets an even bigger problem with outliers, which is that you can only catch the outliers that you know exist. You're very unlikely to catch an outlier that you haven't thought of before. You know if you're looking at, if you're looking at age, and then all of a sudden someone uh mis inputs a height and you never thought anyone would would mis input a height because, well, a nurse does that, they, they do it, they put them on the scale and, heck, it's connected to epic directly and it puts it right in the system and you never thought to check that. Uh, you're not going to catch the outlier and so you can only catch the outliers that you anticipate, unless you use a unsupervised version and you happen to catch that specific outcome.

Speaker 3:

But you can only catch the outliers you're looking for is a fundamental and recurrent problem if you're really figuring out to kind of summarize, all right, we have statistical methods as well, and then I want to do some. Do some summarization. Statistical methods it's really simple min max and like trend analysis and just basic unsexy stuff that off. That really works well and you can do. It depends on like is it time series? A lot of time series metrics we can do like change point, structural changes you can do a whole podcast on those and then that basic like mean max we talked about the whisker plot and things like that but really, when you're going to try, and what's your use case for anomaly detection, um, or outlier detection or synonyms, um, like is it? Is it fraud? Is it cyber security? Like log a malicious logins or is it just like data entry errors? Because you're really going to need to have different models for different things. There is no one model that you can put on everything.

Speaker 3:

For instance, I think some of the statistical methods work better for like data entry errors, because anybody that deals with data knows that. And when I'm taking talking data scraping, like the example said just said, every human makes errors all the time and a lot of, and the vast majority of these this is part of this discussion as well. Most errors that are made or in data entry errors aren't malicious, so you have to take that too as, like everybody's like, outlier detection fraud it's not necessarily. Sometimes it's mistakes. So having layers of like here's just benign, like making sure someone didn't say someone's 18 feet tall, great. Well, let's just quickly check that and then you can do the more specific like targeted like fraud anomaly type detection models. But they're not going to catch the data entry errors. So you have to have multiple layers of defense for these types of things and I think I think it now we're positioned well to answer that.

Speaker 2:

Would that in mind, to answer the question you had at the front of this, which is are we using AI to detect anomalies? Uh, and the answer question is kind of, when you look at what we're realistically doing, it stops looking like AI. It really stops looking like we. Oh, we use a neural network to determine like what what anomaly looks like, because ultimately, the supervised case, while it's a goldilocks case, it is very rarely what people are really using. So are we using AI to detect anomalies? Rarely and if we are not that advanced? And then our second question are we detecting anomalies in AI? I would say that we're not detecting anomalies in AI. We're detecting anomalies in human research design. We're catching people's inability to identify where the outliers might be or to correctly understand the distributions of the data that they're working with, and catching faulty anomalies.

Speaker 1:

Ooh, I like that callback to the problem at hand and probably like what we're facing today. If someone was listening to this and felt inspired to go back and check something right now like sometimes when I'm listening to podcasts like this and I'm like I think I may have to go look at that what would you recommend? What would be a first step out of everything that we've said today?

Speaker 2:

The first step is to you know if you have a numerical field. You have a continuous data field and it looks like a number. Plot them. Just look at the plot for the first time in your life and really look at that plot and see does this make sense? To me, that's the greatest sniff test you can start with and it's simple. There's no machine learning involved. Just look at your data, don't take it at face value and really examine if it does what you think it does and it has what you think it has in it.

Speaker 3:

Fantastic advice. And then you can see from there like does it look wonky? Do I need to do a model? Literally I understand the cutoffs or what kind of controls would I want to have in place? Because this is where you get even statistical process control and things are like okay, well, there's outliers on both tails Great, put some sort of control in place there to see when things deviate. But first you have to investigate if those tail values are in fact anomalous or if they're just rare. And then that's where maybe you need a distribution and stuff. But you can kind of layer up from like understand the data, what's happening, domain knowledge around it, figure out if there's basic methods, distributions that fit, and you kind of work up the chain there and you won't be hitting, you know, supervised models or deep neural networks, unless there's really no other option or that one. If I catch one mistake, it saves me this much money that it's worth it.

Speaker 2:

Yeah, and if you're already doing anomaly detection, make sure you're using the right metrics. Make sure you're not just using accuracy, especially via highly imbalanced data. Make sure you're evaluating your systems correctly if you're already starting this path.

Speaker 1:

Excellent points. Any other closing thoughts?

Speaker 3:

Please just keep providing your feedback and let us know what would be an interesting next episode. Yeah, thanks for listening.

Speaker 1:

If you have any questions, our homepage is open with a feedback form. We love getting your questions. Until next time.

People on this episode

Podcasts we love

Check out these other fine podcasts recommended by us, not an algorithm.

The Shifting Privacy Left Podcast

Debra J. Farber (Shifting Privacy Left)