202: Deep Learning for Histopathological Classification of Salivary Gland Tumors Artwork

Digital Pathology Podcast

Aleksandra Zuraw from Digital Pathology Place discusses digital pathology from the basic concepts to the newest developments, including image analysis and artificial intelligence. She reviews scientific literature and together with her guests discusses the current industry and research digital pathology trends.

All Episodes

Digital Pathology Podcast

202: Deep Learning for Histopathological Classification of Salivary Gland Tumors

Subscriber Episode • March 18, 2026 • Aleksandra Zuraw, DVM, PhD • Episode 202

This episode is only available to subscribers.

Digital Pathology Podcast +

AI-powered summaries of the newest digital pathology and AI in healthcare papers

Send us Fan Mail

Paper Discussed in this Episode:

Deep learning-based histopathological classification and subclassification of benign and malignant salivary gland tumors. Weber A, Schuster D, Heyer J, Becker C, Burkhardt V, Werner M, Spörlein A, Bronsert P, Schulz T. European Archives of Oto-Rhino-Laryngology 2026.

Episode Summary: In this journal club deep dive of the Digital Pathology Podcast, we explore a chaotic microscopic landscape to see if artificial intelligence can master one of the most high-pressure diagnostic environments in medicine. We examine a groundbreaking 2026 study on rare salivary gland tumors, exploring how state-of-the-art AI models performed when tasked with distinguishing benign lesions from complex malignancies. We uncover where the AI achieved absolute perfection, where it catastrophically failed, and why its "mistakes" might just be a window into hidden biological truths.

In This Episode, We Cover:

• The High-Stakes Minefield of Salivary Glands: Why diagnosing these tumors is a delicate and complex task. With 36 potential entities and a practically zero margin for error, misdiagnoses can lead to devastating revision surgeries and permanent facial nerve palsy for the patient.

• Training the Machine: How researchers used 20 years of slide data and the "Reinhard color normalization method" to mathematically standardize color palettes. This prevented the AI from "cheating" by simply memorizing fading colors or specific lab stains.

• The AI Arsenal - CNNs vs. Vision Transformers: A look at the diverse algorithms deployed in the study, ranging from convolutional neural networks (like Xception and ConvNeXt) that scan local pixels, to Vision Transformers that analyze global image context, processing massive slides tile by tile.

• The Perfection of Binary Triage: The stunning success of the AI in the initial benign vs. malignant test. Models like Xception achieved a 100% Negative Predictive Value (NPV), meaning they never missed a single cancer, proving their potential as a flawless morning triage tool for pathology labs.

• The Subclassification Wall: Why the AI bombed when trying to identify the specific type of malignant tumor (like squamous cell or acinic cell carcinoma). We explore the deep learning rules of data volume and tissue heterogeneity, and why rare, morphologically chaotic diseases effectively starve algorithms of the data they need.

• Explainable AI & The "Clever Hans" Dilemma: By using Class Activation Maps (heat maps), researchers tracked the AI's "eyes". While it often smartly focused on proven biological markers like enlarged, hypochromatic nuclei for cancer, it sometimes made correct diagnoses by staring at random, non-traceable artifacts, raising severe trust issues for clinical deployment.

Key Takeaway: Deep learning models are currently fantastic, ultra-reliable screening assistants for binary benign/malignant triage, but they aren't ready to replace human pathologists for complex subtyping without massive, multi-institutional datasets. However, the AI's occasional focus on obscure visual data forces us to ask: is the machine just learning random artifacts, or has it successfully discovered subtle microscopic biological truths that human experts haven't even learned to see yet?

Get the "Digital Pathology 101" FREE E-book and join us!

Imagine looking at a microscopic landscape um so incredibly chaotic, so dense with overlapping cellular structures that even worldclass surgical pathologists have to just, you know, squint, step back, and order a whole battery of expensive chemical tests just to figure out what they're actually looking at.

Yeah. And you're staring down this microscope at a salivary gland and and the stakes of what you write in your diagnostic report, well, they are going to dictate whether a patient wakes up from surgery with their face intact or, you know, perman ly paralyzed,

right? It is arguably one of the most high pressure complex diagnostic environments in all of medicine. The margin for error is effectively zero.

Exactly zero. But the visual evidence you have to work with is just notoriously ambiguous.

Welcome trailblazers to a brand new Journal Club deep dive of the digital pathology podcast. Today we are setting out on a highly specific, totally fascinating mission. We are unpacking a groundbreaking new paper that was just published online on on March 5th, 2026.

And that's in the European archives of odor rhinolarangology, right?

Yeah, exactly. Yeah. The study is titled deep learning based hystopathological classification and subclassification of benign and malignant salivary gland tumors.

And um we should absolutely credit the team behind this. It was spearheaded by co-first authors Andreas Weber and Daniel Schuster, right? Alongside co-ast authors Peter Broner and Tobias Schultz. Because what this team has done is take some of the most advanced artificial intelligence architectures on the planet and they've aimed them squarely at this chaotic microscopic landscape.

Okay, let's unpack this because before we get into all the heavy computing, the neural networks, the pixel processing,

yeah,

we really need to understand the physical reality of the operating room.

Oh, absolutely.

Because for you, our trailblazers listening right now, whether you are a seasoned pathologist or a resident or, you know, an AI developer trying to understand the medical side of things, the stakes of these specific tumors are just astronomical.

They truly are. So um to set the stage, let's look at the core clinical challenge here. Salivary gland tumors are quite rare. In western nations, we are looking at an incidence rate of roughly 2.5 to 3.0 cases per 100,000 people.

Wow. So really rare,

very and you will typically find the benign ones showing up in the parotted gland. That's the uh the major salivary gland just in front of your ears. Oh yeah.

While the malignant ones, the actual cancers, they tend to hide out in the minor salivary glands that are just kind of scattered throughout the mouth and throat,

right? But, you know, rarity does not mean simplicity. In fact, it's like the exact opposite. I was reading through the 2022 World Health Organization classification guidelines that they reference in the paper. And the sheer volume of possibilities is staggering.

Oh, yeah. It's a huge list.

There are 15 different benign epithelial entities and 21 different malignant epithelial entities.

Exactly. So, you have 36 different potential diagnoses for type of tumor that a standard community pathologist might only see, I don't know, a handful of times a year,

right? Trying to confidently diagnose one of 36 rare variations on a standard H& you know, hemattoxilin and eosin stained slide. It sounds almost impossible.

Yeah,

it's like identifying a specific type of needle in a hay stack made entirely of other slightly different needles.

That is a perfect way to put it, which is exactly why that initial H& evaluation is often just the starting point. Because it's so morphologically chaotic, pathologists frequently have to run subsequent iminohistochemical staining,

right? Yeah.

Right. They apply specific antibodies to the tissue to see what proteins light up, which you know helps them narrow down the diagnosis, but that takes precious time and it costs a lot of money.

And meanwhile, you have a surgical team waiting on those results or a patient whose entire surgical plan hinges on that final pathology report because if you get it wrong, the physical consequences are literally devastating.

Yeah. If we connect us to the bigger picture, the nightmare scenario is this. A tumor is initially diagnosed as benign based on a quick look or a limited biopsy. The surgeon goes in, removes it conservatively, but then a week later, the final extensive pathology report comes back and confirms it was actually malignant.

Oh no.

Yeah. That patient now requires revision surgery. They have to go back under the knife to ensure clear margins.

And going back into the parotted gland is incredibly dangerous because of scar tissue. Right.

Precisely. The parotted gland intimately wraps around the face. ial nerve which controls well all the muscles of facial expression. With that first surgery, you create scarring. When a surgeon has to go back in for revision, that dense scar tissue makes the delicate branching facial nerve incredibly difficult to distinguish from the surrounding tissue.

So, the risk of severing or damaging that nerve just skyrockets.

It does, which leads to post-operative facial nerve pulsy, meaning half of the patient's face could be permanently paralyzed.

Wow. That is the clinical minefield. If we can get this diagnosis right the first time. Or better yet, if we have an automated screening tool that can rapidly flag malignancy from a basic H& slide while the patient is recovering or even still in the O, we save nerves. We save faces.

Exactly. And that is the exact gap this research team is trying to bridge with artificial intelligence

because the human eye takes time to process that kind of chaos. They wanted to see if a machine could do it faster and more accurately.

But to teach a machine to see cancer, you first have to feed it history. So the researchers gathered data from 184 patients. That's 131 benign cases, 53 malignant cases, and six cases of normal tissue,

right?

But what really caught my eye is that this data spans a 20-year period from 2003 to 2023. At first, I thought like, why use such old slides? But then it hit me fading in lab variations. A slide stained back in 2003 is going to look chemically totally different than one stained yesterday.

That is a brilliant deduction, and it is a massive hurdle in digital pathology because if you only train an AI on slides from say the last two years at one specific hospital. The AI often just memorizes the specific shade of pink that hospitals lab uses.

Right. It cheats.

Exactly. By using a 20ear span, you introduce 20 years of different staining intensities, fading, and different lab technicians. You force the algorithm to ignore the superficial colors and actually learn the underlying biology.

And they didn't just rely on the age of the slides to force that learning either. I noticed they also applied something called the Reinhardt method during their data prep-processing.

Ah yes, the Reinhard color normalization method. It's a mathematical technique that essentially standardizes the color distributions across all the images.

Okay.

It maps the color space of every slide to a target reference image. So whether a slide is faded pale pink or violently bright purple, the AI sees a standardized palette, further preventing it from cheating by just memorizing color artifacts.

Makes sense. So once they had the data normalized, they took these 335 full slide images, digitized them at a massive 40x magnification, and used an open- source pathology software called Quapath. Pathologists manually annotated the tumor regions, and then the software chopped those massive images up into tiny, manageable tiles of 250x 250 pixels.

Yeah. And you have to chop them into tiles because whole slide images are gigabytes in size.

Huge files,

right? No graphics card on Earth can process a full slide through a deep neural network all at once. You feed it tile by tile.

And the neural networks they fed these tiles into are essentially an all-star roster of modern computing. They didn't just try one approach. They deployed a whole suite of convolutional neural networks or CNN's specifically VG19, ResNet 50, Inception Resnet V2, Exception, and Condext.

A very solid lineup.

And then they pitted those against a vision transformer

which is a really important methodological choice for the trailblazers who might not build AI models every day. It helps to understand how these actually differ. A convolutional neural network A CNN is basically like a scanner. It moves a tiny mathematical window pixel by pixel across the tile looking for local patterns, first simple edges, then textures, then shapes like a cell wall.

But a vision transformer does something completely different. Right.

Exactly. A vision transformer or VIT breaks the image down into patches and uses an attention mechanism. It doesn't just look locally. It looks at the global context. It tries to figure out mathematically how a patch of pixels in the top left corner relates to a patch in the bottom right corner. It's a fundamentally different way of seeing.

So they unleash this diverse arsenal of AI on the tiles for their very first task, which is a simple binary classification. Look at this tile. Is it benign or is it malignant? And the results um I actually had to reread the table. The CNN model called exception hit a 100% balanced accuracy score. 100%.

That's a staggering number to see in a peer-reviewed medical paper.

Here's where it gets really interesting because I have to admit when I see 100% accuracy in any deep learning paper, an alarm bell rings in my head. I immediately think of overfitting where the AI just memorized the test answers rather than actually learning the subject because the real biological world is incredibly messy. Is a perfect score actually realistic?

That is exactly the right instinct. In clinical AI, 100% accuracy often looks like a mirage. That is why we have to look past the topline accuracy number and look at the underlying clinical metrics. What's fascinating here is how that 100% translates into a specific metric. called negative predictive value or NPV.

Right? How often is the AI right when it tells you something is not cancer?

Exactly. Both the exception model and the VG19 model achieved a 100% NPV. What that means in plain clinical English is that out of all the testing data, they never not even once mclassified a truly malignant tissue tile as benign.

Wow. So it might occasionally be a little overcautious and flag something benign as suspicious, but it absolutely never misses the actual cancer.

Exactly. And for you listening right now, Imagine how that changes your morning workflow in the pathology lab. You don't necessarily need the AI to replace your final diagnostic judgment. You need it as an ultra reliable triage tool.

Oh, that makes so much sense,

right? You log into your workstation and the AI has already scanned the hundreds of cases that came in overnight. Because of that 100% NPV, it pushes all the highly suspicious, potentially malignant cases right to the top of your queue. You know exactly which cases need your immed immediate attention and the lab can start prepping the aminoistochemistry for those specific slides before you've even finished your first cup of coffee.

It takes a chaotic hay stack and immediately hands you all the needles, but as the researchers make very clear, a binary yes no to cancer is just the first baby step. The real test of an AI's worth is subclassification. It's one thing to know the house is on fire. It's a completely different challenge to identify exactly what material is burning so you know which fire extinguisher to use.

And this is where the AI really gets its metal against the sheer complexity of biology.

Right? Let's start with the benign subclassification. The models were asked to differentiate between normal tissue more than tumors and pleomorphic adenomas. And honestly, they performed brilliantly here too. Exception hit a 95.57% balanced accuracy,

which is clinically vital. You might think, well, if it's benign, who cares which specific type it is? Just leave it alone. But the biological behavior of these tumors is very different.

How so?

Pleomorphic adenoma. have a much higher recurrence rate after surgery, typically around 2.9 to 6.7%. More importantly, if they recur over many years, they carry a distinct risk of malignant transformation. They can actually turn into cancer.

Whereas a worn tumor generally just stays put and stays benign. So knowing exactly which benign entity you're dealing with dictates the patient's monitoring schedule for the rest of their life.

Exactly. The AI handled the benign subtyping beautifully.

But then the AI hits the malignant wall,

a very very steep wall. Yeah. The researchers asked the AI to subclassify the malignant tumors into four specific categories. Squamas cell carcinoma, asynic cell carcinoma, adenoid cystic carcinoma, and muccoid carcinoma. And the performance just tanked across the board.

It really did.

The Inception Resnet V2 model was the best of the bunch, but it only managed a 71.51% balanced accuracy. And that vision transformer we talked about, it absolutely bombed on squamous cell carcinoma. It had an F1 score of just 1689

which is catastrophic. And you know it's important to understand what that F1 score actually means. Simple accuracy can lie to you if a disease is rare. If you have 99 benign slides and one malignant slide, an AI that blindly guesses benign every single time is technically 99% accurate. But it's completely useless to a doctor. Right?

The F1 score prevents that illusion.

It balances precision, meaning how many of its positive guesses were actually right with recall meaning how many of the actual positive cases it successfully found. A score of 0.1 1689 out of a perfect 1.0 means the vision transformer was essentially blind to that specific cancer.

Okay, so help me visualize this. Think of the binary test benign versus malignant, like asking the AI to look at a photo and tell you if there's a vehicle in it. That's relatively easy. But this malignant subclassification, well, it feels like asking the AI to tell the difference between a 2018 Honda Civic and a 2019 Honda Accord, but it's only allowed to look at a deeply scratched bumper. And it's only ever been shown three pictures of a Civic and entire life.

That is a brilliant analogy and it highlights the two foundational rules of deep learning that the AI collided with here. Data volume and data heterogeneity. If we connect this to the bigger picture, first there's the volume. Because these malignant salivary gland tumors are so rare, the researchers only had between 19 and 26 whole slide images per malignant entity.

That's practically nothing for an AI.

For algorithms like a vision transformer, which are notoriously data hungry and require massive data sets to build their global attention maps. That is practically starvation.

It just hasn't had enough reps in the gym to learn the subtle architectural differences between the scratches on the bumper.

Exactly. But the second and arguably bigger issue is the heterogeneity, the chaos of the tissue itself. These cancers don't just look one way. Take a synic cell carcinomomas or adoid cystic carcinomomas. Depending on the patient or even depending on which part of the tumor you slice, they can present with solid cellular patterns, tubular patterns, or crib reform patterns. which look like Swiss cheese under the microscope.

Oh wow. So the AI might see a tubular pattern in its training data and learn to associate that with a sync cell carcinoma. But then in the real world test data, it encounters a solid pattern of the exact same biological disease. To the AI, it looks like a completely different universe and it just throws its hands up.

Precisely. It hasn't seen enough variations of the broken part to recognize it in a novel context. The authors of the paper rightly conclude that the only way to solve this is through massive multi-institution utional data sets. You need hospitals across the world pooling their rare heterogeneous cases together to teach the AI the full chaotic spectrum of the disease.

Which brings up an incredible logistical challenge for the future. But for now, we need to talk about my absolute favorite part of this paper. We've talked about what the AI got right, and we know exactly where it failed, but how do we actually know how it's making these decisions? We constantly hear about the black box of AI where data goes in, an answer comes out, and The math in between is just a mystery,

which is the single biggest barrier to getting AI deployed in actual hospital settings. A doctor cannot stick a patient's life on a black box,

right? But the authors didn't just accept the black box. They use something called class activation maps or CAMs for the CNN's and attention maps for the vision transformer. I love describing these as eyetracking for algorithms.

Yeah, that's exactly what it is.

Through some very clever reverse engineering of the math, it literally generates a visual heat map over the tissue slide. glowing bright red or yellow over the exact pixels the AI was prioritizing when it made its diagnosis.

It's an indispensable technique. And what they found when they analyzed these heat maps is deeply revealing about how machine learning interprets human biology.

Let's break down what the AI actually saw. When classifying the benign tumors, the models zeroed right in on the interstitial connective tissue. It wasn't looking at the cells themselves as much as the structural scaffolding around them. But for the malignant tissue, Models like convexed and the vision transformer heavily prioritize the cell nuclei

which from a pathological standpoint makes perfect sense. Malignancy is biologically characterized by nuclear atypia. Because cancer cells are dividing uncontrollably their nuclei often become enlarged. They stain much darker a concept called hyperchromasia. And their borders become highly irregular.

So it's looking at the right things.

Yes. The fact that the AI naturally learned to focus on the high contrast edges of these hypochromatic I to detect cancer proves it's picking up on genuine established biological markers.

And it gets even better for the war than tumors. The heat maps show the AI smartly focusing exactly where a human would on the cellrich lymphoid stroma and that very characteristic enocitic epithelium. It's basically reading the pathology textbook.

It is until it isn't.

Ah yeah. And this is where we get into the weeds of why AI is still an assistant not a doctor. Because while the AI was looking at the right things a lot of the time The paper explicitly details a caveat about obscure waitings. The authors candidly admitted that sometimes even when the models got the binary diagnosis perfectly right, the heat maps showed the AI looking at non-traceable or random structures. Basically, the AI was staring at an empty patch of background glass or random preparation artifact and confidently declaring, "Yep, that's cancer."

This raises an important question and it strikes at the heart of the trust barrier in digital pathology. It highlights the double-edged sword of explain AI on one hand the model gives you a 100% accurate triage result but on the other hand if a human pathologist looks at the heat map and says um there is literally no biological reason for the AI to be looking at that specific patch of empty pixels what happens to their confidence in the system

it shatters completely because it means the AI is falling victim to the clever Hans effect like that famous horse that seemed to do math but was really just reading its trainer's body language. If the AI is diagnosing cancer by looking at a scratch on the glass that happens to only appear on slides from the cancer ward. It isn't learning pathology. It's learning the quirks of the data set.

Exactly. And while techniques like the Reinhardt color normalization help mitigate those confounding variables, they don't eliminate them entirely. In medicine, being right isn't enough. You have to be right for the right reasons. If a model relies on an obscure non-biological artifact, that model will fail catastrophically the moment you deploy it in a different hospital with different glass slides. pathologists must be able to audit the AI's biological reasoning.

So, what does this all mean? We've taken this deep dive into the clinical minefields, the vision transformers, the F1 scores, and the heat maps. Where does this leave our trailblazers heading into the clinic tomorrow?

Well, it leaves us at a highly promising but very clearly defined frontier. The core takeaway from this paper is that right now, today, these deep learning models are fantastic rapid screeners. Using an architecture like the exception model to binary sort, H& slides into benign or malignant in a post-operative setting or as an automated first pass in the lab workflow is incredibly viable. Though 100% negative predictive value proves it can catch the needles in the haststack.

But it is not ready to replace a pathologist for exact malignant subtyping.

Not even close. To overcome the subclassification wall, the field has to evolve. As the authors themselves conclude, the only path forward is massive multi-institutional data sets to feed these algorithms the sheer volume of rare edge cases they need to learn the heterogeneity. Furthermore, a standard H& stain might simply not contain enough visual information for an AI to subtype a highly complex asynic cell carcinoma. The future is almost certainly multimodal,

meaning we integrate the AI's visual analysis of the H& slide with the data from iminohistochemical markers and maybe even overlay patient mutational data. It's going to be a massive team effort. The AI, the human pathologist, and the molecular data all working together to map that murky landscape safely.

Exactly. It's an evolution of our diagnostic tools and not a replacement of the human expert.

I love that perspective. Well, Trailblazers, that brings us to the end of our journal club deep dive for today. But before you head back to the microscope or back to the operating room, I want to leave you with one final thought to mull over.

Oh, lay it on us.

We talked about those obscure waitings the AI used, how it sometimes made a perfectly accurate diagnosis by staring at supposedly random, non-traceable structures on the tissue slide. We rightfully assume that means the AI is making a mistake. ake or relying on a weird artifact. But think about this. When a deep learning algorithm capable of processing millions of data points and pixel relationships simultaneously looks at a seemingly empty space on a slide and successfully predicts cancer over and over again. Is the AI malfunctioning? Or has it discovered a subtle microscopic biological truth hidden in the tissue architecture? A truth that human pathologists just haven't learned to see yet. Keep pushing boundaries, trailblazers. We'll see you next time on the Digital Pathology Podcast.