148: Statistics of Generative and Non-Generative AI – 7-Part Livestream 4/7 Artwork

Digital Pathology Podcast

Aleksandra Zuraw from Digital Pathology Place discusses digital pathology from the basic concepts to the newest developments, including image analysis and artificial intelligence. She reviews scientific literature and together with her guests discusses the current industry and research digital pathology trends.

All Episodes

Digital Pathology Podcast

148: Statistics of Generative and Non-Generative AI – 7-Part Livestream 4/7

August 11, 2025 • Aleksandra Zuraw, DVM, PhD • Episode 148

0:00 | 36:24

Send us Fan Mail

You might be using AI models in pathology without even knowing if they’re giving you reliable results.

Let that sink in for a second—because today, we’re fixing that.

In this episode, I walk you through the real statistics that power—and sometimes fail—AI in digital pathology. It's episode 4 of our AI series, and we’re demystifying the metrics behind both generative and non-generative AI. Why does this matter? Because accuracy isn't enough. And not every model metric tells you the whole story.

If you’ve ever been impressed by a model’s "99% accuracy," you need to hear why that might actually be a red flag. I share personal stories (yes, including my early days in Germany when I didn’t even know what a "training set" was), and we break down confusing metrics like perplexity, SSIM, FID, and BLEU scores—so you can truly understand what your models are doing and how to evaluate them correctly.

Together, we’ll uncover how model evaluation works for:

Predictive Analytics (non-generative AI)
Generative AI (text/image generating models)
Regression vs. Classification use cases
Why confusion matrix metrics like sensitivity and specificity still matter—and when they don’t.

Whether you're a pathologist, a scientist, or someone leading a digital transformation team—you need this knowledge to avoid misleading data, flawed models, and missed opportunities.

🕒 EPISODE HIGHLIGHTS WITH TIMESTAMPS

[00:00] Warm greetings and a peek into my citizenship journey 👋
[02:30] How exam attire differs across countries
[04:00] Model evaluation isn't about memorizing metrics—it's about understanding concepts
[06:30] Story: My first exposure to AI misuse in pathology
[08:00] Confusion matrix basics: TP, FP, TN, FN
[11:00] Metrics breakdown: Accuracy, Sensitivity, Specificity, F1-score
[15:00] Regression-based metrics and why they matter
[18:00] Statistical challenges in Generative AI
[21:00] What is "Perplexity" and why low scores matter
[24:00] BLEU, ROUGE, and Next Sentence Prediction explained
[28:00] SSIM and FID scores for image quality in AI
[31:00] When metrics mislead: superficial similarity vs. real insight
[35:00] Best practices: Ensemble models, human-in-the-loop, and adversarial testing
[43:00] Choosing the right metric for the right model
[46:00] Closing thoughts on trust, testing, and trailblazing

📘 RESOURCE FROM THIS EPISODE:

🔗 Read the full paper discussed in this episode:
"Statistics of generative and non-generative artificial intelligence models in medicine"

💬 Final Thoughts

Statistical literacy isn’t optional anymore—especially in digital pathology. AI isn’t just a buzzword; it’s a tool, and if we want to lead this field forward, we must understand the systems we rely on. This episode will help you become not just a user, but a better steward of AI.

🎙️ Tune in now and let's keep trailblazing—together.

Support the show

Get the "Digital Pathology 101" FREE E-book and join us!

STATISTICS OF GENERATIVE AND NON GENERATIVE AI

[00:00:00]

Introduction and Greetings

Aleks: Good morning. Welcome my digital pathology trailblazers in the 5:00 AM digital pathology club. That was one of my digital pathology trailblazers calling us today. Let me pull. So I can say hi to you and I see you joining. So when you join, let me know. Just say hi in the chat and let me know where you are tuning in from.

And everybody who is 6:00 AM or earlier gets a golden star. I'm gonna just say hi in chat.

Trailblazers. Okay. My hashtag didn't work out.

Now. No. Why is it, oh no. My keyboard set up from whatever doesn't matter. Welcome, welcome everyone.

Cultural Differences in Exam Attire

Aleks: So let me tell you, I'm dressed very officially for you because I had a test yesterday and [00:01:00] in Poland when I was studying, you used to at least for, for vet school. We used to dress up for exams.

And then I went to, so regard or a written exam, everybody was, would dress up, like guys would come in suits and then I went to Spain for an exchange and people would come and like sports like those. What are the sports suits that you do for sports anyway? Just like super casual. Close and I'm like, okay, these are cultural differences.

Oh, I see people from Brazil. Hello. And also Scotland. Scotland, 11:00 AM Thank you for joining. On Friday at Ivan. That's like lunch and learn.

Citizenship Journey

Aleks: So anyway, going back to my official clothing, I had my citizenship appointment yesterday at the US Immigration. What is this U-S-C-I-S uses anyway, like my [00:02:00] immigration journey to the US is almost over.

I am just waiting for my naturalization ceremony, guys, so I will be officially polish American. Soon, within like a month or two. So that's why that was the clothes I was wearing yesterday. Actually. I put the same clothes as I was wearing yesterday. But why am I even talking about it?

Today's Topic: Statistics

Aleks: Because today we're gonna be talking about different tests, statistics, and when I was looking at the wrong window again.

Here when I was looking at, no. When I was thinking about this lecture, oh. Monica, welcome Edinburg Scotland again. We have good representation from Scotland. Good. So when I was. Looking at the paper and I don't know if you've noticed, I was like, oh, how am I need to prepare so much because there are so many metrics.

And I [00:03:00] was thinking, okay, we're gonna explain every single metric. Don't not gonna do that. You're not gonna like die of boredom or overwhelm which you would if we did that.

Model Evaluation Concepts

Aleks: And we're gonna explain. Model evaluation concepts that are so perfectly explained in this paper in addition to all the metrics.

So you can go, you can check the tables. I went through the tables and I learned about metrics that I have never heard about, so you can do if you want to. But the we're gonna be discussing higher level concepts because my goal with all these courses, all these videos, is to, bring you to the level where in digital pathology team, in a computational pathology team, you have all knowledge that you need to be able to contribute immediately and have productive, good scientific discussions with your co-members and different different.

Team members are gonna have different [00:04:00] expertise. My is Toxicologic pathology, right? So I'm on the pathology end. There may be other pathologists. I know Monica is here, another pathologist. There may be computer scientists, there may be software developers here, and we are gonna be all. At the same page so that we can bring digital pathology forward.

And if you are just joining, let me know in the chat. Just say hi and where you're tuning in from and let's dive into our topic. The topic of today is statistics. Let's see if my thing works. Coffee, 'cause it's 6:00 AM here. And thank you for the good wishes for my citizenship. Thank you. I already passed the test.

So an interesting question. So there is a, I have had more difficult tests in my life. But every test to me is like a game that you have to prepare for. And I started applying this when I did several language tests. So I would be fluent in the language at some point, [00:05:00] but the test is testings.

And one of the cool question, cool questions like unexpected questions I was expecting because it was on the list is like before general no. Before Eisenhower became a president, he was a general, which World War was he a general in? Let me know in the chat. I did answer that one correctly.

Once I see it in the chat. I will pull it up on and, hello Thomas Iris Freberg, Germany. Hello. Okay. Without further ado, you had enough time to join me. If you plan to join do this. Okay. Okay. So statistically. Sticks in generative AI and generative ai. I don't know, non generative was spelled like this.

I. There are different versions, right? Or not. Okay, one more thing before we dive into this a little story that I have for you. [00:06:00] So I started doing this, and by this I mean like digital pathology image analysis in Germany after my residency. That was 2016. And we went. And to, with our team, we WestEd the collaborator that we had and they had a little symposium they were organizing and the pathologist just was showing ooh, our 5:00 AM club showing up from Houston text.

So we went to this little symposium, pathologist was presenting some image analysis data and like fantastic results. And I'm like, oh, these are good results. Like good metrics, like high accuracy and stuff like that. And then a computer side, a colleague comes to me and like very quietly tells me, so we had the, I had super cool relationship with my computer scientist people.

We would teach each other each other's I. Fields. And he comes to me like, so that nobody hears it. And he's can you tell your pathology colleagues [00:07:00] to stop showing data from the training set? And I'm like, oh, sure. I'll tell them. I had no idea what the training set was that you had to have a test set and that you had to have a validation set.

But 2016. People started people started doing this before, but the awareness of using the right metric, using the right way to validating your model was not there yet. And especially among people without the mathematical, without the computation background, including pathologists. But now.

The thing changed and now we can actually dive into the paper. Because what our friends here, our usual suspects we, sorry, Mann crossed you Mann actually the he sent me a couple of more papers that I'm not gonna be presenting to you, but I will read on my own. [00:08:00] Different metrics, like the dive very deep.

I might use I might include it in the for podcast video recording and the email that you get. So the important thing is that grasping, oh my goodness, me and my underly stuff, grasping the underlying statistic principle that govern the design, validation and reproducibility is important. It the activity in this domain demands, it, demands understanding, knowing about this.

And if this is not your area of expertise, it's not my area of expertise. But I know who to reach out to. And in my in this first job that I had where we did a lot of these different metrics, we actually had a team that was. Designing those tests and we would have discussions which metric is okay, how can it feasibly be checked against pathologist annotations?

So anyway, the long story [00:09:00] short is this is our responsibility to know that we have to test it because if we don't test and we're irresponsible and we don't wanna do that.

Generative AI vs. Traditional AI

Aleks: Fundamentally there is so we always, in the series we talk about generative non generator. AI and these two category different metrics.

And after we go through this introduction, we're gonna go to the images. So don't worry. And if you're just joining, because I see new people joining, let me know where you're tuning in from. And they call it so generative and traditional, also known as non generative predictive analytics. And there are a couple of synonyms, but basically generate.

That creates something new like the text and the classical, which calculates predicts some analytics, right? And they rely on certain common statistical measures of function. And there are some that you'll know the more classical, because this is like in any lab test and we're gonna dive into them.[00:10:00]

But there are some new ones that are going to be applicable to the generative AI like perplexity. Do you know this software perplexity? It's a search engine retrieval, augmented generation based search engine. I love it. My best friend. But. Love but verify. That's the like, overarching me message.

I did have to check some sources. Anyway, bilingual evaluation, understudy score, and other different ones. And these are less familiar because these methods are new for us. So we need to familiarize ourselves with that. The typical the classification based one are confusion matrix based, and I have a little confusion matrix that I drew for you.

You'll you'll recognize instantly, even if you don't know the word confusion matrix, you will know what I'm. Talking about but everything has heard about accurately sorry, accuracy, sensitivity, F1 score maybe or dice coefficient receiver operating [00:11:00] characteristics under area, under the curve.

And some of you may have heard about regression studies, like root mean square error, and and R square. I highlighted this word here, better stewards. That goes with what I just said, that this is our responsibility to know how to test this stuff. And I am, the first person who has to check herself after being super enthusiastic about the new model, new method, new whatever, right?

Love, but verified. So by understanding these similarities and differences of these different metrics, the we become better stewards of this transformative space of digital pathology. This is part of digital computational pathology, and if we wanna be digital pathology trailblazers, we need to know that.

So without further ado, let's check where we are today. First. How do I make this [00:12:00] again? We are already here guys. We're ha past halfway. Mark. We have three more to go and then we're gonna be back to our regular abstract reviews or maybe, some additional paper reviews. If you have suggestions, let me know.

But we are statistics and ml of ML machine learning in medicine. We are not gonna start with this table. Let me start with,

actually, I take it back. We are just gonna go as. We go through the paper. So in the paper you will find amazing graphics, amazing tables. If you don't have time to read the paper, just go and go through the tables and graphics.

Generative AI Performance Measures

Aleks: So in table one, we have the common generative AI performance measures.

There is a list. We have the name of the measure, the basic description, the key points, which I have a beautiful heart here because this explains you, the limitations. So like in the story [00:13:00] at the beginning, the limitation of. Data from your training set is that it's gonna be very high scores, right?

Because you trained on it and so it's totally overfitted. And then there's type of data assessment, and this is generative ai, so most of it is gonna be text, but we also have some stuff for image just and image and text. I'm gonna mention them and then based on some graphics, we're gonna dive deeper into them.

So we have perplexity, we have a bilingual evaluation, understudy. This is the blue score. Recall oriented, sorry. Recall oriented UNDERST study four. Distinct evaluation that is the rouge score, WER. And metrics of evaluation of translation with explicit ordering, like stuff I had no idea existed. But that's okay because [00:14:00] now we know it exists.

Consensus based image description evaluation spice, which is semantic proportional image, caption evaluation. Elo, FID I-S-S-S-I-M that have names that we will look at later, but basically you have a mean opinion score as well. And the dice similarity coefficient, which is very similar to something that we're gonna have in the next table.

This one is used for segmentation and in the stable. You go through the table, let's say you have a model, you go through the table and you check, okay, which one fits best? We have different measures, we have description, we have the limitations. So for example, perplexity is not intuitive and is difficult to interpret absolute values.

Is this gonna be a good metric for our, our use case or not bilingual evaluation understudy does not consider recall and [00:15:00] insensitive to meaning and concept. If meaning and concept are important, then that's not gonna be a good metric. Or, you can combine metrics, right? So these are the things that you can use for generative ai.

And when you. Look at the output. And for the different data categories that we had image, we had text, and we had tabular data, right? So here for generative ai, statistical assessment methods for the outputs, we have image and we have text. I'm gonna decla it a little bit and here. We have a list of what can we use for image, we can use FID, we can use inception.

S-S-I-M-M-O-S and dice. DICE is for comparing masks. This is something you might, may have heard about for classical computer vision stuff, where this is the thing where we compare annotations against the. The masks of [00:16:00] the model. The model against ground truth generated by a pathologist or a subject matter expert.

And thank you so much for joining Scott. 6:00 AM golden star for the early mornings and actually golden star for everybody. Thank you for joining me. Text. If we are dealing with text we have a different set of metrics. We have the blue, we have the rouge, we have bird score, BART score, Panda, lm, or perplexity, right?

So my message here is even if we didn't dive into the scores there specific metrics for a specific output for type of model, generative versus non generative, and. We need to figure out which metric is good. So accuracy doesn't cut it anymore. And actually this is the example that people always give.

Oh, especially if you have like unequal data distribution and for explaining test results like [00:17:00] high accuracy. Okay? If you have, set where there's one disease person and the model in hundred samples, one disease person and the model says everybody is actually healthy, then it's 99% accuracy, but you totally missed the sick person, right?

So each of those metrics has those limitations that, that we have seen in the table. And you need to know what the limitations are and which metric to pick. And for image, we have these green ones for text and generative ai. We have the orange ones that we just mentioned. So let's dive a little bit deeper into some of the generative AI statistical assessment methods.

Let's see if I can make it bigger without compromising. Sorry, the. I wanna just show one at a time. Can I do it? Ah, shoot. I always have trouble showing you [00:18:00] like, I wanna put one fourth of it on the screen. Why cannot I do it? I should be able to. Okay.

Apologies for that. Okay.

Perplexity Score Explained

Aleks: Let's focus on perplexity. Any way of doing it better. That's okay. That's good. As we can get right now. Perplexity, right? Perplexity score is so perplexed is like confused, right? So the. Least confused you are the better the your score, right? So that's why low perplexity score is the good score, whereas high perplexity is the bad one.

So you want perplexity to be low. We have. Training data, prior text, and you can see that this is like the dotted lines. And when you check the perplexity score, this has dotted lines. The high sample output matches original well has dotted [00:19:00] lines on the image. And, the sample output that matches original poorly doesn't have dotted lines.

It has stars. So it is dissimilar from the training data, meaning it has high perplexity high, like this is very confused. It is less likely to produce, and I cannot say confused because all these, these I can say whatever I want, right? But my point here is that in this context, different common words have a specific meaning because they correspond to a statistical metric that can be calculated.

So perplexity has a. Common language meaning, but it also is a score. The same with confusion. Anyway, I know you get what I am what I'm trying to explain. But here, low perplexity score is actually good. Why did they put it in red? They should put it in green or something. Blue always is like more positive than red.

And the model is. [00:20:00] Is more likely to produce the next word at each stage of generating a whole output that's given the right output. And there's gonna be a lot about this model generating next sentence model generating coherent text model. Basically creating something that makes sense.

And this is what all the generative AI metrics wanna convey to us. And there are different ones for different things, right? And you probably need to combine. Welcome to my chat. Sorry guys. Okay. And we have people from Chile. Hello, welcome. So that is perplexity.

Blue and Rouge Scores

Aleks: Let's move to Blue and Rouge

Blue and Roche scores. Or blow and RO again, we have. Human written text. This is our human written text. We have a different, totally different understanding of ground truth. Now. It's not [00:21:00] gonna be the classical annotation, the classical showing which cell it is because we're working with text. And we are gonna do comparison of original text.

To AI generated text. And in those two scores here, the AI generated text with similar fluency and syntax, like how do you even quantify fluency and syntax? That is like amazing that you have to do it now, right? So maybe you can like even apply those scores to grading human generated. Text, like high school papers or something.

So if we have good score, the scores are good, then the AI generated text is similar to what we had from the human. And then, if it's not a good score, then AI generated text is dissimilar. It doesn't have dissimilar fluency or syntax, so it doesn't even say like correct, incorrect. It's just [00:22:00] dissimilar from what was there originally.

So these are the blue and rouge text com, sorry, text comparison methods. You basically compare text, right? Our next candidate here.

Next Sentence Prediction Accuracy

Aleks: It is called Next sentence prediction accuracy. And I put this this little note here. Is it logical, coherent, this is what this metric is about. So if so. It predicts it, it checks if the next sentence after a certain sentence makes sense. So let's say we start with sentence A.

It's gonna give each of these sentences S score, and this one is very high, right? So the score to predict the sentence G, would be higher than if it predicted sentence H. And just for those who are listening to this and not viewing it's figure three in the paper. We [00:23:00] are discussing figure three so you can.

Listen to stuff and look at the paper as well. So if the desired sentence combination would be A and G, this model would have relatively good sentence prediction accuracy. However, if the desired sentence combination was C and C on the left hand is like very low below oh 0.2 and H here we had the H, then it would have, very bad.

Introduction to Prediction Accuracy

Aleks: Next sentence, prediction accuracy. The good thing about this is that the name of this metric actually tells you what it does and the like. Summary of this, is it logical and coherent is the next sentence. You give it a sentence and it's supposed to predict the next one. Is it logical and coherent?

If it is good score, if it's not bad score and we can apply. Also different metrics [00:24:00] to images.

Understanding Image Metrics: SSIM and FID

Aleks: And images are always more complex than tech. It depends. Sometimes they're less, sometimes they're more complex. But we have these FID and SSIM score, and I don't wanna leave you hanging what they mean.

I will basically check what they mean. Right now it's in the paper, but they didn't highlight it.

So SSIM is structural similarity index measure, and FID is. Fresh air, inception, distance. And these metrics are used to evaluate the quality of images, especially those generated by models like guns, generative advers, adversarial networks, so SSIM, structural similarity index measure, and fresh air inception distance.

In normal person language the. [00:25:00] FID is comparing features of the image to the original real image, whereas the SSIM is comparing the perception, so luminescence and those like image qualities comparing to the real image, whereas here is features and yeah so on the left, this is still figure three, and in this particular graphic.

I don't see the differences between these images because it's a very small histopathological image. But the point is that if the SSIM is high, it's good. And if the FID is low, it's good as well. And then the image is similar. Whereas if these scores go in the opposite direction, SSIM goes down and FID goes up.

Then the. Images are [00:26:00] dissimilar. And how does this apply to generative adversarial networks? So you have a combination of two networks. Where one is checking the output of the other one with those scores, or, probably some other scores as well. But basically the point here is you have the real image and you compare output of a network or of a model to this real image.

On different levels, on the feature level and on the image, property similarity level. So contrast luminescence, colors and all the stuff that you basically see, but don't even think you see. And these are what those scores are about. One more. That is pretty cool.

Masked Language Modeling Accuracy

Aleks: Is the masked language modeling accuracy.

So what's happening here? You're masking. This one is easy. You're masking words. Lemme show it to you. [00:27:00] Sorry. Nope. Does it work? Yes. We have an original sentence, right? Ah, nope, sorry. Original sentence, the scariest predator that humans can encounter in North America is the bear. Okay. And then we mask we cover some words and the new sentence is, as you can see on the screen, the scariest that humans can encounter in is the bear. Something scary in somewhere is the bear. And then AI model is supposed to fill in the blank. So obviously if the model says Mask one is predator and mask two. Is America, it did a good job. Whereas if it says herbivore and Antarctica, it didn't do a good job.

That's a cool one too, as well. But what you see here is that these [00:28:00] metrics are totally different from what we were used to from the classical computer vision and. If you're just joining, let me know where you're tuning in from. Thank you so much for all the, oh. Oh, I see again, general Eisenhower answer, world Wari.

Yes. So for those who joined later I had my naturalization interview yesterday to becoming a US citizen, and that was one of the questions in which world. In which war was General Eisenhower, a general before he became a president, and the correct answer comes from Thomas it was the World War Second.

Moving on, we are moving to table two, which is gonna tell us, and we're gonna go through them very quickly just to mention them, just. That you have heard the names, which most of them you probably know already because this has been around for [00:29:00] longer than the generative AI, are the non generative AI statistic metrics in classification.

And, but actually before we dive into, I wanna show you something that I forgot to show last time. The book classification. Do we have classification here?

Not like specifically. I'll come back to it. This is the computer vision to pathology, vision translation. What do you actually do when you detect objects, when you do semantic segmentation and that you need different metrics? That's from the book. If you don't have the book yet, which I suspect most of you have, but you can get it, right now you can scan this code and that's gonna be a free ebook. My free ebook, digital pathology one-on-one, where you will find this image on page. Which page is that? 84. So if you wanted grab it, but we're gonna go back to the paper. Does it stay the code? Okay. I have [00:30:00] to, how do I take off this code?

Oh, hide. I am, I'm happy they make this streaming software so user friendly. I'm just gonna pull up, pull back the chat so that I can see what you're saying. Okay.

Confusion Matrix and Non-Generative AI Metrics

Aleks: So common non generative AI metrics. We know. Ah, so they are. Confusion based metrics, and I'm gonna show you my confusion matrix. I drew confusion matrix for you because they didn't have confusion matrix image in this paper, which I guess it's maybe like too basic, but you will know this super well if you have ever evaluated or learned about test performance, not just AI performance, but any test performance.

Confusion, comma, confusion matrix. This is confusion Matrix this come off. True positive [00:31:00] and false. Positive. Tp true positive fp, false positive. Tn two negative fn. False negative. And we have the actual values. And the predicted values. So if our model goes and classifies or says that a true positive is actually a true positive, then we have, plus and a plus, it gets a green color and it lines up on a line. And these confusion matrixes, can grow a lot because you can have a lot of classes and for every class you have a prediction. But the prediction comes down to is it a true positive or is it a false positive? Is it a true negative or a false negative?

So all the other metrics. Are based on this all the confusion, metrics based metrics. There are other ones that are regression based, which quantify like in [00:32:00] general how much error a model is making, but what it comes down to accuracy, sensitivity, specificity, F1 score, everything that we have in the table is, okay, is a true positive, classified as a true positive is a true negative classified.

A true negative. Yes. No. How many of these are classified correctly? And there are different combinations of that. So let's go back to our table.

We have sensitivity

come on. Sensitivity, specificity, balanced accuracy, R-O-C-A-U-C and the curves. And you have not only do you have the description and another name 'cause they of course have different names. It Is this okay? Is it, oh, it is super important information. Is it prevalence? Dependence. So accuracy. Where is accuracy?

It should be pre, yeah, pre precision or positive [00:33:00] predicted value. We don't have accuracy here. Is for example, precision positive. Predictive value is prevalence dependent. So you're gonna check, okay, what is the prevalence of the disease? Or whatever the feature, whatever I'm classifying or looking for if it's.

Dependent on prevalence, then maybe we need a different metric or we need a combination of metrics. That was the example where I said, okay, there's just one disease person among a hundred people. The positive predictive value, if you say everybody is healthy, is gonna be very high because you only have one disease person.

And there are different combinations of this like fallacy. Then precision is positive predictive value, negative predictive value. Oh, we have accuracy. Of course we have. And yes, it is prevalence dependent, right? This is so useful, guys. Like this table is useful. If you're working with it, just print it out and paste it on your, I don't know, we are bored or [00:34:00] wherever.

Because that's something, if this is not your day to day, which. If you're like me, it probably is not, but you need to know it. And I knew it. Like I, every time I studied I know, but then I forget about it. It's, I don't know what else is like that, but these metrics, like I've given presentation on them, super comfortable, confident, and like very like I knew what I was talking about as down the road.

I forgot. And then I had to with this table, you don't have to do the searching anymore. You already have all the information that you need to pick the right metrics, or if somebody's like giving you metrics performance metrics. Of something you can go to this table and check, okay, the they prevalence, dependence, what are the important points, right?

And all these other things. F1 score that is pretty similar to the D [00:35:00] coefficient, but dice coefficient is being used for images and then MCC also known as t that checks the correlation. Coefficient, that is the correlation coefficient between observed and predicted ification. So this is also important BI classification P-R-A-U-C, calculated from various recall sensitivity or true positive rate and precision values.

We have combination of all these, just all these metrics that are mentioned in table two. And this is a gift. From the authors to us. Thank you. Thank you so much. Yeah, Monica saying we all go through this through the confusion matrix of confusion Matrix. Okay. Thing is listed there.

This is fantastic. And, but, and, but there so that. A lot of it was confusion matrix based. So the [00:36:00] true positives, true negatives, false positives, false negatives. If it has anything to do with that, it's gonna be confusion, matrix based and that's, that, that will be the metrics that you're gonna be using.

And R-O-C-A-U-C curve is an example of that. There is another option, a regression base regression based metrics. And these base. Basically calculate the error magnitude. How much does the output of all differ from what it was trained on? And the goal is to minimize the errors, but this is a lot less granular.

So it's more overall performance specific metrics that you wanna know. Stuff to remember. You can do for classification and the, we have confusion, metric, matrix based metrics for regression. So checking how many error have has the model made? And [00:37:00] now in figure five let's compare the.

Generative AI vs. Traditional Machine Learning

Aleks: Traditional and non and generative AI methods and their statistics. So this is super cool. Also, actually, I should have put the one here in this one if you're, I don't know how to guide you through this because if you're listening from the beginning to the end you're mid presentation, that's okay.

But figure one is gonna compare the benefits and drawbacks of predict. Predictive analytics for traditional machine learning, generative ai. So let's just quickly go through this. I want you to hear it. You can see while we're reading this the predictive analytics for traditional machine learning, non generative, since the outputs are simpler and we know what is going on during data generation inter.

[00:38:00] Interpretability of the data and its validity becomes much clearer. Very good at providing precise prediction, civic well-defined tasks. And what is not so great is limited scope of outputs due to following human program rule sets, and an inability to learn from previous experience. And we're gonna comment on this ability to learn a little bit in a second.

But that's a drawback. And also noisy or incomplete train quality can lead to inaccurate predictions and classifications. That is across the board. You have bad data results and statistical methods that these are the confusion, ba confusion, matrix based accuracy, precision and everything we talked about, recall sensitivity.

And regression. Okay, the first decision you're gonna make, okay, are we doing confusion, metrics based, or are we doing regression and based? And points are that they [00:39:00] have greater interpretability and this ties back. To the category itself as well as to the statistical methods. So the predictive analytics, they have greater inability.

So the metrics of the predictive analytics are gonna have greater interpretability as well. The thing here is they're, and we have plenty of them, they cannot always capture the full complexity of non-linear, multifaceted data, which is more common in real world. And since humans are designing aspect of the model and statistical analysis, we introduce our own biases into the process, which is not limited to the predictive analytics.

Probably like a feature. It's not sure they always say it. In the software development world, actually, it's our feature by, we are biased, by our experience, by our education, and by everything. This is what we like, carry inside of our head [00:40:00] and. This, then translate it to everything we create, including designing statistical analysis and statistical performance metrics for predictive analytics.

Right? And when it comes to generative ai, and there is wider range and scope of output due to being able to iteratively learn and therefore presents greater possibilities, can. Augment existing data sets. This is cool for better reach or training outcomes, and it's better at capturing complex patterns and relationships with when text, you can analyze text, you can.

Summarize text. It's not just like matching ground truth. The problems here are difficult interability and verification of validity. The generative processes are very complicated black boxes. So we have ai, we have the black box components that we cannot fully explain even [00:41:00] though we're working on explainable AI as a.

Science method or as a, as part of ai. But the gener gen AI is even more complex. And then obviously poor training data quality can lead to unrealistic, biased or harmful output. That's across the board. And greater privacy concerns due to the immense capabilities and ability to learn and ability to learn is, something that like can be front. It's not that like the models that we're using Chad PT or cloud or whatever. When you are inside of the user interface that they're constantly learning you can iterate with them with different prompts. But to train this model and, make it better.

You have to start the training process and there is an option to that. For privacy reasons, I opted out from in chat GT right now is, oh, do you want your interactions with this chat [00:42:00] to be used for farther training of this model? And I said, no, I don't want it. But it's not something that's happening real time.

It's something that the developers use like new data that you're generating to later train the model. And the statistical methods that we have mentioned were Rouge score, blue Score, FIDS, S-I-M-M-L-M, accuracy, perplexity score, mask language modeling, accuracy. Next sentence, prediction accuracy. So everything like check checking.

Does this text, most of it is still text. Make sense? And sometimes these methods appear to be valid, but are in truth, prioritizing superficial data. Sorry, similarities over, over meaningful relationships. So if data is superficially similar, it will have high scores rather than meaningful relationship.

So that, that's a prob something you have to consider. Then highly sensitive to data quality issues, minor [00:43:00] flaws in training data lead to bias per quality and output. That is a problem. Across the board. And one of these things is challenging, right? You don't have a number, you cannot put the threshold, oh, above this and that positive predictive value.

It's okay because that's not a metric you would use here. So validating this is gonna be an interesting endeavor.

I am just checking how far. Okay. We're almost at the end guys. Amazing. If you have questions, let me know right now if you have any discussion. Oh, and we have people from Argentina. Hey. Hello. Hello. Great to have if you haven't said hi, say hi before we get off the live stream.

Future Directions and Solutions in AI

Aleks: But on, in let's discuss this figure sick which they always have this looking into the futures looking into the future.

And basically they discuss different metrics in the paper. Every use it. You need to [00:44:00] choose the right metric for the right use case. All of them have limitations. So the solutions to generative ai statistical limitations is to maybe create large language model based assessment platforms, integrate large lines into scoring solutions, assess performance based on the coherence, fluency, and relevance.

Not just like data similarity, not just sentence similarity and human expert assessment platforms enable experts to review and rate generated samples. So the, this human in the loop concept, which ma I'm a fan of as you cannot always predict everything and a person will be able to catch it.

You can enable it to review and rate generated samples. Provide high quality human cured evaluations. And then in the future you can automate this process to scale. Because obviously if you have human in the loop [00:45:00] all the time, and I know this from the annotation examples, it's not scalable.

And then also something called adversarial testing. You train an AI to evaluate an ai maybe that is part of right adversarial testing, like the general adversarial networks where one network is checking the output of the other network. And once they are on the same page, then maybe then you show it to a human or however you wanna design it.

The goal is to create robust and reliable evaluation frameworks to improve model performance and trustworthiness in generative AI applications in medicine. Yeah. Improve model performance and trust. That's what we want. Whereas when it comes to solutions to traditional ml, non generative ML statistical we can use something called ensemble methods where we combine prediction of multiple models, improve performance improve accuracy and [00:46:00] robustness of classification and regression models.

Then. The challenge that is data drift. We design data drift assessment, identification, improving machine learning operations. And then we can also do secondary and tertiary generalization testing to minimize overfitting issues and enhanced model transparency ties back into the, explainable AI and the reation is za. Sorry, za. This is actually an official beration of explainable ai. The goal here, so identification of feature importance to minimize explainability issues. And the goal here is to develop a reliable AI integrative systems for performing the statistical statistics associated.

With traditional machine learning, for example, classification and regression models. So that is it for [00:47:00] today.

Course and Book Information

Aleks: We are past that way. Mark what I am doing right now already, and I think this is already available online. Make myself. Bigger is, I'm putting so you were on YouTube with, my talking to you and my engagement with you and our discussions.

You'll also find it in a course. It's an a in a course. Let me show you a QR code for this course. I always say you either pay with time or with money. If you don't have so much time to sift through my interactive way of doing this live. Then there is a version in this course of.

All the live streams where that has audience interactions cut out, and it's just down to the information. So if that's what you're after then the, you can scan this code and and check it [00:48:00] out on the website. If you don't have the book yet, get the book because I'm gonna be updating the book.

And in the updated book version, there is gonna be a summary of these live streams based on those on those papers. So where is the book?

Here is the book, ah, I bumped into my computer. This is the book. You can get it also for free in the digital version and this QR code is gonna take you there.

Conclusion and Final Thoughts

Aleks: Other than that, if you have any questions, let me know in the chat. Let me know in the comments. Reach out on LinkedIn and I will talk to you in the next episode.

Have a fantastic rest of your day. Have a fantastic weekend, and let's keep trailblazing together.