219: POLARIS: Reliable AI Classification and Risk Stratification of Colorectal Polyps Artwork

Digital Pathology Podcast

Aleksandra Zuraw from Digital Pathology Place discusses digital pathology from the basic concepts to the newest developments, including image analysis and artificial intelligence. She reviews scientific literature and together with her guests discusses the current industry and research digital pathology trends.

All Episodes

Digital Pathology Podcast

219: POLARIS: Reliable AI Classification and Risk Stratification of Colorectal Polyps

April 03, 2026 • Aleksandra Zuraw, DVM, PhD • Episode 219

0:00 | 27:15

Send us Fan Mail

Paper Discussed in this Episode:

Reliable classification of polyps based on artificial intelligence: a development and validation study. Julbø FMI, Henriksen AL, et al. eClinicalMedicine 2026;93: 103826.

Episode Summary:

In this journal club deep dive, we explore a groundbreaking 2026 study that tackles the massive bottleneck in gastrointestinal pathology caused by successful colorectal screening programs. We examine POLARIS, an AI triage system designed to safely clear over 50% of a pathologist's routine workload. But what happens when the algorithm fiercely disagrees with the human diagnosis? In a blinded showdown, the AI proves it's not just an efficiency tool—it might just be the ultimate safety net for catching high-risk cancer cells that human eyes overlook.

In This Episode, We Cover:

• The Pathology Bottleneck: Why the success of colorectal screening programs is drowning labs in biopsy slides, and how the subjective, visual nature of diagnosing polyps leads to dangerous inter-observer variability.

• The 5:2 Triage Strategy: How POLARIS categorizes gigapixel slide images into five biological classes (0 to 4) and translates them into two highly actionable buckets: "Review" (the complex and malignant) and "No Review Required" (normal tissue and routine tubular adenomas with low-grade dysplasia).

• Beating the "Clever Hans" Effect: How researchers prevented the AI from "cheating" by recognizing the digital fingerprints of different scanner brands, like Aperio vs. NanoZoomer. By using an image registration tool called elastix to perfectly align slides scanned on both machines, they heavily penalized the algorithm mathematically for relying on color profiles, forcing it to focus purely on biological morphology.

• The Showdown - Humans vs. AI: A blinded consensus review was conducted on 40 highly contentious cases where the AI aggressively disagreed with the original patient medical record. Three independent expert pathologists were brought in to break the tie without knowing the AI's or the original doctor's diagnosis.

• The Shocking Results: The expert panel sided with the AI over the original human diagnosis in a staggering 92.5% of the disputed cases, proving the established clinical "ground truth" isn't infallible.

• The RGBA Heat Map: How POLARIS functions as an active assistant, leaving normal tissue transparent (scaling the alpha channel to zero) while highlighting severe cellular atypia in glowing red, acting as a hyper-accurate topographical map for pathologists.

Key Takeaway:

AI in digital pathology isn't about autonomously replacing human experts; it's a hyper-sensitive navigational aid. By safely managing the flood of routine low-grade cases and accurately highlighting hidden high-risk dysplasias that exhausted human eyes miss, POLARIS corrects human errors and elevates the baseline standard of diagnostic care across the entire pipeline.

Support the show

Get the "Digital Pathology 101" FREE E-book and join us!

You know, usually when we talk about a medical diagnosis, there is this um this built-in assumption of absolute precision,

right? Like it's a math problem.

Exactly. We treat it like engineering or something. You break your arm, the X-ray produces this jagged white line on a dark background and the doctor just points to it. Broken or not broken, it feels entirely binary.

Yeah. And the visual evidence in radiology often gives that uh that illusion of objective certainty. But when you step out of the radiology suite and into the world, pathology,

it's a whole different ballgame.

It really is. That clean binary landscape just vanishes.

We're looking at a diagnostic reality that is inherently murky. I mean, if you picture a modern pathology lab, you might imagine this quiet, serene environment with a scientist peering into a microscope

again in the movies,

right? But the physical reality is um it's a flood. It is an endless river of glass slides, tissue samples, and a human being squinting through a lens trying to find a microscopic needle in a massive hay stack

and that flooded highly subjective reality is exactly the target of our discussion today. Welcome to the digital pathology podcast everyone.

Glad to be here

for all the trailblazers out there. You know the laboratory professionals, pathologists and clinicians navigating this daily grind. Our mission today is to unpack a massive journal club study that addresses this exact bottleneck.

And it is a fascinating one.

Oh, it's huge. We are diving into a groundbreaking paper from a clinical medicine published recently in March 2026. The study is titled uh let me get this right. Reliable classification of polyps based on artificial intelligence a development and validation study.

Yeah. And we really have to highlight the international team behind this because it wasn't just a small local project.

No, not at all.

This was a massive collaborative effort. It was led by joint first authors Freda Mi Jelburu and Odin L. Henrikson along with corresponding author Andreas KP.

And they coordinated this impressive lineup, right? Spanning institutions across Norway, the UK and the Netherlands.

Exactly. They are tackling a global healthc care dilemma that essentially sits at the intersection of well public health success and logistical failure.

Right. Because colarctal screening programs are actually highly effective.

They are they do exactly what they're supposed to do.

They catch polyps early. They save lives. But the paradox of that success is that uh we are basically drowning our pathology labs in biopsies

which is a good problem to have in theory but a nightmare in practice.

Yeah. So, the central premise of this paper is evaluating an AI system designed to safely clear over 50% of a pathologist's routine workload.

50%. Let that sink in.

Right. While simultaneously acting as a safety net to catch the high-risisk cancer cells that human eyes actually miss.

And to really understand the magnitude of this problem, you have to look at the supply and demand crisis in gastronenterology right now.

It's getting worse, isn't it?

Oh, absolutely. Global populations are aging and organized screening programs are expanding their reach. So we are fundamentally catching more polyps

which means more slides,

more slides, more data. Simultaneously the pipeline of specialized gastrointestinal pathologists is shrinking.

Oh wow. So fewer experts.

Exactly. We have an everinccreasing volume of highly complex visual data being routed to a dwindling number of human specialists. It's a bottleneck.

And beyond just the sheer volume of work, the work itself is um incredibly subjective. Diagnosing these polyps isn't like reading that broken bone on an X-ray we talked about.

No, not at all. It's nuanced,

right? You could give the exact same borderline tissue slide to three different highly experienced pathologists and you might get three slightly different interpretations.

You might even get a different interpretation from the same pathologist depending on, you know, whether they view the slide fresh on a Monday morning or at the end of an exhausting Friday shift,

which is terrifying. Honestly,

it is that intra and interobserver variability is a known vulnerability in the field. Yeah.

And it creates this urgent demand for automated objective decision support tools.

Which brings us to the star of today's show. This study introduces a tool named Polaris.

Yes. Polaris,

which stands for a polip artificial intelligence-based risk classifier. Polaris separates itself from previous algorithms through its underlying clinical logic. Specifically, what we can call its 5:2 strategy.

Yeah, let's break down that clinical logic because handing a clinician a complex mathematical probability doesn't actually help clear their desk. They need actionable data.

Right. There's an 80% chance of this or that isn't a diagnosis.

Exactly. So, Polaris initially classifies these digital whole slide images into five distinct biological classes.

Yeah.

These represent an increasing risk of malignancy.

Okay. So, 0 to four,

right? Class Z is your perfectly normal tissue. Nothing to worry about. Class one is tubular adenomas with low-grade dysplasia.

Okay.

Then class two steps up into tubular villis adenomas with low-grade dysplasia. or some of the more tricky non-neoplastic polyps

getting a bit more complicated.

Yeah. Class three covers serrated polyps and then class four is the critical end of the spectrum. We're talking adnocarcinoma and high-grade displacion.

So sorting them 0 to four is the um the technical achievement here. But the clinical application is how Polaris groups those five classes into just two highly actionable buckets. Right?

That's the brilliance of it. The first bucket is just called review. This captures classes two, three, and four.

So the bad stuff Right. These are the increasingly abnormal, diagnostically challenging or outright malignant findings. A specialist pathologist absolutely must lay eyes on these.

No question.

No question. The second bucket is no review required. And that covers classes zero and one.

And grouping class one into that no review required bucket. That is the major differentiator of this research, isn't it?

It's massive. We have seen older AI models in this space. Things like uh com man or uh iguana attempt to cases before

and they were more conservative,

extremely conservative. They played it incredibly safe.

They only excluded class zero, the completely undeniably normal tissues from the pathologist's queue. But Polaris is actively making the call to safely group tubular adenomas with lowgrade dysplasia that's class one into the no review bucket alongside the normal tissue.

And from a clinical standpoint, that operational shift just I mean it cannot be overstated.

It changes everything

because tubular adenomas with lowgrade dysplasia, especially the ones under 10 millimeters. They are incredibly common routine findings in these screening populations, right?

Oh, they're everywhere. And they possess a very standard consistent cellular morphology. They look the same.

And more importantly, finding one doesn't mean you rush the patient into surgery.

No, not at all. They are associated with long surveillance intervals. Finding one usually just means putting the patient on a schedule for a follow-up colonoscopy in, say, a few years. It does not trigger an immediate aggressive oncology intervention.

You know, it makes me think of um designing a hospital triage system.

Okay. How so?

Well, a highly inefficient triage system would only send the perfectly healthy people home and then it would force everyone with a scraped knee to sit in the waiting room right next to the patients having heart attacks.

That's a great analogy,

right? By recognizing that small tubular adenomas, the low-grade dysplasia, essentially the routine scrapes and bruises of gastroenterenterology, Polaris suddenly clears out the bulk of the waiting room.

It just empties it.

Yeah. It leaves the highly trained specialists to focus their cognitive energy entirely on the complex cases in the trauma ward.

But, you know, building a triage system, that reliable one, the clinicians can actually trust to dismiss cases, it requires an unprecedented amount of training data and a highly sophisticated architecture.

I was going to say, you can't just throw a few hundred images at an algorithm and call it a day.

Definitely not. The development data set they utilize to build Polaris contain 15,79 whole slide images.

Wow. 15,000

from nearly 3,000 patients participating in the UK bowel cancer screening program between 2014 and 2018.

And handling 15,000 whole slide images that require serious computational power. I mean, for you trailblazers listening, you know these are not standard iPhone photographs.

Far from it.

These are massive gigapixel images of tissue. If you try to feed a single whole slide image into a standard neural network all at once, the memory requirements would just instantly melt your hardware.

Your computer would catch fire. So to solve that hardware constraint, the researchers utilized an open source foundation model called H Optimus Zero.

Okay.

And they combined it with a technique known as attention-based multiple instance learning.

Multiple instance learning. So how does that actually work in practice?

Well, it works by breaking that massive gigapixel image down into thousands of tiny, manageable square tiles.

So it chunks it up.

Exactly.

Yeah. The AI then extracts a feature vector from each tile. It's essentially a mathematical summary of the cellular structures within that specific little square.

Okay, so instead of forcing the AI to look at the entire beach all at once, it scoops up thousands of individual buckets of sand, but it still needs to know which bucket contains the threat, right? Which one has the cancer cells?

And that is where the attention mechanism comes in. The algorithm acts kind of like a detective. It learns to ignore the vast expanses of perfectly normal tissue tiles.

It just t them out,

right? It learns to heavily weight or pay attention to the specific tiles that exhibit unusual or suspicious cellular architecture and then it aggregates the data from those highly weighted tiles to formulate a single comprehensive diagnosis for the entire slide.

That's brilliant. But um anytime we discuss training AI on massive data sets of images, the immediate concern that pops into my head is the clever Hans effect.

Ah yes, the lazy AI problem.

Exactly. AI models are notoriously lazy. They will find the path of least resistance to a correlation even if that path is totally biologically irrelevant.

Right? They cheat.

They do like we have seen dermatology algorithms designed to spot skin cancer that actually just learn to identify the yellow rulers that doctors place next to malignant tumors and photographs.

I remember that study.

Yeah. If the ruler is in the picture, the AI calls it cancer. It completely ignores the actual skin. So in digital pathology, what is the equivalent of That yellow ruler

in digital mythology, the yellow ruler is almost always the scanner used to digitize the slide,

the actual machine,

the machine itself. The physical slides in this UK data set were digitized using two completely different commercial scanners. The Leica Appirio AT2 and the Hamamatsu Nano Zoomer XR.

Okay, two different brands,

right? And every brand of scanner leaves a subtle non-biological digital fingerprint on the image. They have different color profiles, different cont contrast leveling, different sharpening algorithms.

Oh, I see. So, if the Apperio scanner was used slightly more often for the cancer slides, just simply by coincidence, the AI would just learn to look for the Apperio color signature.

Exactly. It would act like a camera critic. It would be reacting to the Instagram filter applied to the image rather than the actual tissue biology.

That's a huge flaw. How do we trust that Polaris isn't doing exactly that?

Well, the researchers anticipated this exact vulnerability and they applied a brilliant technical fix during the training phase.

What did they do? They utilized an image registration tool called elastics. They took a subset of slides and scanned them on both machines.

Oh, clever.

Very. Then using elastics, they perfectly aligned the two digital versions down to the micrometer. This allowed them to extract match tiles. So they had the exact same physical piece of tissue digitized by two different scanner side by side.

Wow. So they built a perfectly controlled AB test directly into the training loop.

That's exactly what it is. You feed the AI, the tile and the nano zoomer tile of the exact same cells.

And then what? How do they force it to ignore the colors?

They added a specific penalty term to the algorithm's loss function based on those matched piles. If the AI analyzed the imperio tile and the nano zoomer tile and produced different risk scores, the algorithm was heavily penalized mathematically.

So it gets slapped on the wrist.

Exactly. A big mathematical slap. So the AI quickly learned that relying on color profiles or contrast differences resulted in failure. It was forced to aband and in the scanner's digital fingerprint and focus entirely on the underlying biological morphology.

That is so elegant. Testing that theory in validation must have been incredibly satisfying. I mean, if the penalty worked, the model should produce basically identical results regardless of the machine.

And the results were definitive. During validation, Polaris achieved a 97.93% agreement in its predictions across the two different scanner types.

That's practically identical.

It is. The model proved to be extraordinarily robust against non-biological information. Okay, so having a model trained to ignore the camera and focus on the cells across 15,000 slides is a great foundation, but the real world of clinical pathology is messy.

Oh, very messy.

Taking an algorithm out of its uh its development sandbox and exposing it to totally unseen data that is the ultimate trial by fire because if the model was overfit to the specific staining protocols of that one UK lab, it would just fall apart externally.

Which is why the external validation in this paper is so critical. The study utilized two two distinct arenas for external validation.

Okay. What were they?

The primary arena was a massive geographically separate data set of 10,842 whole slide images from Chelenham General Hospital.

Over 10,000. That's a huge test set.

And crucially, these slides span diagnoses from 2008 to 2019. So the AI had to successfully analyze over a decade of changing lab protocols, fading stains, varying slide preparation techniques, all of it.

Our real world gauntlet What was the second arena?

The second was a specialized external test set of 495 slides from the University Medical Center Utrect in the Netherlands.

Okay. Testing in a different country introduces entirely new patient demographics and lab standards. But why isolate those specific 495 slides from Utrect? Was there something special about them?

Yeah, that UTRE cohort was specifically curated to have a high proportion of high-grade dysplasia cases.

Oh, I see.

It was a targeted stress test. They really wanted to evaluate the model's ability to catch the absolute most dangerous high-risisk polyps in an unfamiliar data set.

Got it. So, the performance metrics on those external data sets, this is the core of the paper. How did it actually do?

Well, in the primary analysis, Polaris achieved an 86.65% balanced accuracy.

86%. Okay.

But balanced accuracy doesn't tell the whole story in clinical triage. Right. The critical metrics are sensitivity and specificity.

Right. Because missing a cancer is much worse than double-checking a healthy slide.

Exactly. And Polaris achieved a 98.94% sensitivity for high-grade dysplasia and adnocarcinoma.

Wait, 98.9

98.94%.

Wow. So that sensitivity metric really represents the safety net. At almost 99% if a slide contains advanced dangerous cancer cells, Polaris is successfully identifying it and pushing it to the review bucket almost 99 times out of 100.

The false negative rate for severe malignancy is remarkably close to zero.

That is incredible. And on the other side of the equation, the efficiency side

by the algorithm achieved an 8 3.04% specificity for safely excluding the normal tissues and those tubular adenomas with lowgrade dysplasia.

Okay, so that specificity is the efficiency engine of the whole system. It means the AI isn't just playing it overly safe and throwing every slightly blurry slide into the review bucket,

right? It's not crying wolf.

It is confident enough to accurately tell the pathologist, hey, these eight out of 10 routine cases are perfectly fine. Do not waste your time looking at them.

And synthesizing those two metrics translates directly to workflow optimization. Polaris is maintaining a near 99% safety net for the critical cases while simultaneously demonstrating the capacity to accurately identify and exclude over 80% of the routine low-risk cases.

So in a live laboratory setting, that operating point translates to potentially clearing over half of all colurectal polip slides from a pathologist's desk entirely.

Half the desk swept clean

safely.

But you know, AI models are never infallible. They will eventually make a mistake. And the mark of a truly rigorous study is how the researchers handled the errors. They didn't just publish the accuracy rate and ignore the outliers, right? They dug deeply into the discordance. They looked closely at the cases where Polaris and the human experts fiercely disagreed.

And that discordance review is arguably the most illuminating section of the entire study. The research team isolated 40 highly contentious cases from the validation set.

40 cases,

right? These were cases where Polaris aggressively disagreed with the original clinical diagnosis, meaning the diagnosis that was recorded in the patient's actual medical record, the diagnosis we generally accept as the ground truth.

So, they took these 40 slides and what did they do?

They handed them to three independent expert pathologists for a blinded consensus review.

Okay, I love this setup. So, the AI flags a slide as cancer. The patients historical medical record says it was totally normal and three new experts are brought in blind, completely unaware of what the AI or the original doctor said to break the tie.

Exactly. It's the ultimate showdown.

So, what happened? Who was right?

The outcome of that blinded review completely upends how we think about ground truth. In 92.5% of those 40 disputed cases,

wait, over 90%.

37 out of 40 slides, the human consensus panel actually agreed with the AI over the original clinical diagnosis.

That is wild. So, the AI wasn't failing the test. The human answer key was wrong

pretty much. Yeah. Yeah,

it is literally the equivalent of a software spellch checker catching a deeply buried typo in a traditionally published heavily edited novel. The ground truth established in routine high volume clinical practice is clearly not infallible.

Not at all. And the real world implications of those corrections are profound. I mean the paper highlights a specific, frankly terrifying discrepancy.

But what happened?

The slide had originally been evaluated by a human and labeled in the patient's clinical record as class zero perfectly normal tissue.

Okay. But when Polaris analyzed that exact same slide, it flagged it with a high probability for class 4,

which is high-grade dysplasia or adenocarcinoma. You cannot have a wider diagnostic gap than perfectly normal versus severe cancer.

Exactly. And during the blinded re-evaluation, the three experts analyzed the tissue and agreed entirely with Polaris.

Wow.

The original human pathologist, likely dealing with that flooded queue and visual fatigue we discussed earlier, had simply missed a critical high-risisk lesion, and the AI caught it. That specific case proves that Polaris is fundamentally more than just an efficiency tool designed to save lab processing time. It is an active safety tool. It's designed to prevent catastrophic medical misses.

It's a safety net.

But to be fair, that still leaves three cases out of the 40 where the AI was genuinely considered wrong even after the consensus review. And analyzing how the AI fails is just as important as analyzing how it succeeds.

Definitely. And understanding those false negatives requires looking at how Polaris actually comm communicates its findings to a human,

right? Because it doesn't just spit out a number.

No, it does not just print out a risk percentage. It generates highly detailed RGBA heat maps. The AI blends its tile level predictions into a visual overlay directly on the digital slide.

RGBA. So red, green, blue, and A is alpha, right? The transparency channel.

Correct. The alpha channel is key here. Tiles that the AI predicts are completely normal remain transparent. The alpha channel is dialed down to zero, so the pathologist just sees the raw a tissue,

just a normal microscope view.

But as the algorithm detects increasing risk, the overlay becomes opaque and the colors shift.

How does the colors map out?

They map directly to the classes. Class one, lowgrade areas show up as green. Class 2 areas turn yellow. Class 3 serrated regions are orange. And class 4, the highest risk areas of high-grade dysplasia, they glow bright red.

So the algorithm essentially hands the pathologist a topographical map highlighting the most suspicious cellular architecture in bright red.

Exactly. So let's apply that heat map to the most complex false negative case in the study.

Okay, let's look at a miss.

There was one specific slide where the final expert consensus determined the tissue contain high-grade dysplasia, but Polaris had ultimately classified the overall slide as low risk.

It missed it.

Well, mathematically, yes. The algorithm's aggregated average for the entire slide fell just below the threshold to trigger the review bucket.

Okay, so it got routed to the no review pile,

right? But while the final categorization was technically a miss. The heat map told a very different story.

What did the heat map look like?

When the researchers pulled up the visual overlay Polaris generated for that slide, it featured a bright red cluster highlighting the exact region of tissue the human experts were intensely debating.

Oh, so the AI did see the suspicious cells. It highlighted them in red. But because the rest of the slide was overwhelmingly normal, the aggregated score was diluted.

Exactly. The dilution is a limitation of the aggregation math, but the biological detection itself was flawless. It saw the bad cells.

That's fascinating.

And here's the crazy part. Even the human experts were deeply conflicted about that specific red highlighted region. When the experts were unblinded months later and asked to review it again, the debate continued.

Wait, they still couldn't agree?

No. All pathologists noted intense architectural complexity and cyonuclear atypia in that exact spot, but they could not achieve a unanimous consensus on whether the cellular changes met the strict criteria for high-grade or lowgrade dysplasia.

So it was a true edge case.

Two pathologists lean lowgrade, one argued high grade.

It represents the ultimate diagnostic gray area.

But the crucial takeaway for any clinician using this tool is that the AI did not ignore the abnormality. If a human pathologist was utilizing Polaris as an active assistant, that glowing red cluster would immediately prompt them to zoom in and evaluate it themselves. Right. Regardless of the slide's overall numerical score and the authors are explicit about this dynamic, Polaris is engineered as an advanced decision support tool. It is a navigational aid. It's not an autonomous replacement for pathology.

It's an assistant. And by confidently identifying the clear-cut normal tissues and the standard routine low-grade tubular adenomas, the whole workflow can fundamentally change. Trained laboratory technicians or junior staff could potentially handle the triage of those mathematically secure routine cases. And that frees up the highly specialized pathologists to operate exclusively at the very top of their license.

They can spend their day solving the complex boundary cases, the severe atypias and the high-risisk lesions that the AI highlights in red,

which is exactly what they went to medical school for. But um moving from retrospective validation to a live clinical environment, that is the necessary next frontier, right?

Absolutely. The paper concludes by calling for prospective clinical utility studies and rigorous user acceptance trials. Because before a tool like this can be integrated into a national screening roll out, we have to measure its impact in the wild.

Exactly. How much time does it actually save a lab per week? How do human readers alter their workflow when presented with a pre-analyzed heat map?

Does it bias them?

Right. Or does diverting the simple cases cause cognitive fatigue for the specialists who are suddenly only looking at highly complex, difficult slides all day?

That's a great point. Integrating software into the delicate human ecosystem of a bustling hospital always reveals unexpected friction points. But the foundational evidence presented here validating against scanner bias with elastics. Testing against over 10,000 diverse external slides and achieving a near 99% sensitivity for severe disysplasia. It's just a massive leap forward for the field.

It changes the core mechanical nature of pathological review. The human job description transitions from an exhausting physical search across vast areas of normal tissue to to highly focused clinical interpretation of targeted anomalies.

We began this discussion by picturing a flooded pathology lab where the expectation of clean binary precision constantly collides with the exhausting reality of subjective human eyesight. Now imagine that same lab equipped with this AI

a totally different picture.

The flood is actively managed. Over half the stack of routine slides is safely cleared before a doctor even sits down. And the remaining complex slides arrive pre-h highlighted with the algorithm specifically pointing out the architectural complexities.

It elevates the baseline standard of care across the entire diagnostic pipeline.

It absolutely does. And that leaves us with a final provocative thought for all you trailblazers to consider as you head back to the lab or clinic. We saw clearly in the Discordance review that this system reliably caught subtle high-grade displas that experienced human eyes completely overlooked.

It found the typo.

It effectively corrected the human answer key, creating a new demonstrably higher standard for identifying ground truth in pathology. So if decision support tools can consistently perform at this level, catching the critical human misses and mathematically standardizing our subjective grading systems, how long will it be before not running an AI analysis on a slide is considered medical malpractice?

Yeah, it is the big question. The legal and ethical framework surrounding diagnostic liability will have to adapt to that exact question very soon.

Until that day comes, keep questioning the ground truth, keep analyzing the data, and keep pushing the bound. of the field. Thanks for joining us on this deep dive into digital pathology trailblazers. See you next time.