Digital Pathology Podcast
Digital Pathology Podcast
208: A Comprehensive European Colorectal Cancer Cohort Dataset
This episode is only available to subscribers.
Digital Pathology Podcast +
AI-powered summaries of the newest digital pathology and AI in healthcare papersPaper Discussed in this Episode:
A comprehensive European Colorectal Cancer Cohort dataset. Holub P, Törnwall O, Garcia Alvarez E, et al. Sci Data (2026). https://doi.org/10.1038/s41597-026-06822-2.
Episode Summary: In this journal club edition of the Digital Pathology Podcast, we explore a monumental effort to clear up the diagnostic "muddy waters" of Colorectal Cancer (CRC). We examine a groundbreaking 2026 paper detailing a massive European dataset of 10,780 CRC patients that provides an unprecedented "playground" for artificial intelligence. This episode asks how we can accurately predict cancer recurrence years down the line, and explores whether a 70-terabyte multimodal dataset might help algorithms uncover hidden biomarkers that could make traditional tumor staging completely obsolete.
In This Episode, We Cover:
• The "Gray Area" of Oncology: Understanding Stage II Colorectal Cancer, where primary tumors are removed but clear lymph nodes leave oncologists gambling on whether highly toxic chemotherapy is necessary to prevent microscopic recurrence.
• A Continental AI Playground: A look at the sheer scale of the BBMRI-ERIC consortium's dataset: 10,780 patients from 26 biobanks across 12 countries, purposefully prioritized to include at least five years of clinical follow-up data.
• The Three-Dimensional Disease Map: How the dataset links standard clinical records (the "street addresses") with Whole Genome Sequencing blueprints and 26 terabytes of gigapixel Whole Slide Images (the "satellite view") to give machine learning models a complete biological picture.
• The Messy Reality of Raw Hospital Data: Why structural translation to OMOP and openEHR isn't enough. We highlight the terrifying logical errors caught by the consortium's automated plausibility scripts—from negative treatment durations to patients receiving chemotherapy after being marked as deceased.
• Hacking GDPR for Rapid Research: How the project uses envelope encryption (Crypt4GH) and a 14-day "time-limited veto" system to securely grant researchers global, free access, proving that patient privacy and rapid scientific speed can seamlessly coexist.
Key Takeaway: If deep learning algorithms trained on thousands of pristine digital slides and genomic blueprints can identify new morphological biomarkers and predict cancer recurrence with pixel-level accuracy, we may be looking at the beginning of the end for the century-old TNM staging system. This democratized dataset finally provides the massive statistical power needed to fundamentally redefine patient stratification
You know, usually when we talk about a medical diagnosis, there's this uh this expectation of absolute precision,
right? Like it's math or something.
Exactly. It feels almost like engineering. I mean, you break your arm, the X-ray shows that jagged white line on the bone, and the doctor just points at the film and says, "Well, there it is.
It's binary. It's totally clean. The problem is visible to the naked eye, and the solution is, you know, straightforward.
Broken or not broken." We really like things to be visible, to be perfectly categorized. But then You step into the world of oncology specifically predicting cancer recurrence and suddenly that X-ray machine is just completely broken.
Oh absolutely.
We're looking at a diagnostic landscape that is well honestly it's murky.
It is the absolute definition of diagnostic muddy waters. I mean you are no longer just looking at what is physically there right now. You're trying to predict what microscopic cellular events might happen say 5 years in the future based on these tiny fragments of biological clues.
And Finding clarity in those muddy waters is exactly why we are here today. So, welcome trailblazers to this journal club edition of the digital pathology podcast.
Glad to be here.
If you are joining us, you know that building a better predictive lens is basically the holy grail of our field.
And today we are looking at a monumental effort to build that exact lens.
We really are. It's a massive undertaking.
We're doing a deep dive into an article in press from the journal Scientific Data. It's titled a comprehensive European colurectal cancer cohort data set.
And we should clarify this isn't just um a small institutional study from a single hospital somewhere. Right?
This paper is authored by Peter Hol, Ari Turnwall, Eva Garcia Alvarez and just a massive consortium of researchers operating under the BBMRI Eric network
which is huge.
It is that's a European research infrastructure consortium that literally connects bio banks across the entire continent.
The mission for today's review is straightforward. But the implications are just huge. We are going to unpack how this central data set of exactly 10,780 colctal cancer patients or CRC for short is practically opening up a gold mine for digital pathology.
A total gold mine for AI research and biomarker discovery. Yeah.
More importantly, we want to give you the listener a road map for how you might actually use this massive resource in your own clinical research. Okay, let's unpack this.
Let's do it. Before we get into the terabytes of digital slides and all the genome sequencing, we really have to talk about the core medical problem that forced 26 different bio banks across 12 European countries to team up in the first place. I mean, why build this specific data set for this specific disease?
Well, it really comes down to a very specific, incredibly frustrating clinical gap. Colarctal cancer is uh it's the third deadliest cancer worldwide.
Wow. Third deadliest.
Yeah. Now, we do have screening tools. We have FIAT tests, those fecal immunochemical tests that look for in blood in the stool and of course we have preventative endoscopies.
Right. The gold standard.
Exactly. But detecting the cancer early is only step one. The real clinical hurdle we face today is inatient stratification. Particularly when you look at stage two colarctal cancer.
Wait, let me stop you there for a second. Stage one is usually pretty straightforward, right? The tumor is just confined to the colon wall. So you just do surgery.
And stage three or four, the cancer has visibly spread to the lymph nodes or other organs. So you are obviously looking at aggressive systemic treatments like chemotherapy. So why is stage 2 suddenly this mystery?
Because stage 2 is the ultimate gray area. Biologically the primary tumor has grown deeply through the wall of the colon. But
but the lymph nodes are clear.
Exactly. When the pathologist looks at the nearby lymph nodes under a microscope, they appear totally clear. There is no obvious spread.
Okay.
So you have a patient sitting in front of you. The primary tumor has been surgically removed, but you just don't know if microscopic cancer cells, these micro metastases are hiding somewhere in their bloodstream or their liver just waiting to grow.
So the oncologist is essentially forced to gamble.
Basically, yes.
The evidence right now is really lacking on whether that initial surgery alone is sufficient for these stage 2 patients or if you need to subject them to the severe toxic side effects of chemotherapy just to prevent a potential recurrence that might not even happen.
Right? And what's fascinating here is that the inherently slow progression of colurectal cancer from premolignant lesions actually makes it an ideal candidate for biomarker discovery
because you have a longer window to observe the changes.
Exactly. If we can find biological markers that indicate a high risk of those micrometastases, we can target the chemo only to the patients who actually need it
which saves everyone else from the toxicity.
Spot on. We already use some biomarkers today. For instance, we look for Kre mutations which act kind of like a broken on switch for cell division or micro satellite instability, which basically tells us that tumor's DNA repair mechanisms are failing.
But I'm guessing those aren't enough.
No, they're not. Those existing markers don't give us the full picture. We need new ones. We need better predictive footprints of the carcinogenesis process itself. And to find entirely new subtle biomarkers, you need immense statistical power. You cannot find a novel pattern in a data set of like 50 patients.
It's like trying to predict the weather. I mean, looking out your window once doesn't help you forecast a hurricane.
No, definitely not.
But having satellite data from 12 countries tracking the exact same storm systems for five straight years, that's how you build a predictive meteorological model.
That is a great analogy,
which is exactly why the BBMRI ERIC network had to aggregate data on a continental scale. They needed the statistical weight of 10,000 patients to let the algorithms actually find the hidden patterns.
And that five-year metric you just mentioned That is vital to how this data set was built. The consortium purposefully prioritized cases that had at least 5 years of follow-up data
because otherwise the AI doesn't know the ending of the story.
Exactly. If you want to train a machine learning algorithm to predict survival or recurrence, a slide from a surgery performed yesterday is useless,
right? Because we don't know what happens to them yet.
The AI needs to know what actually happened to the patient long term to establish a ground truth for its predictions.
So, they knew they needed massive statistical power and long-term outcomes. solve the stage 2 gray area.
Yeah.
They gathered 10,780 patients from 26 bio banks.
Yep.
So, what exactly are we looking at here? What are the actual files in this database that make it so valuable for the trailblazers listening today?
Well, first you have the deeply structured clinical data
and to guarantee that mature 5-year follow-up period we just talked about, most of these patients were diagnosed prior to 2016.
Makes sense.
The database meticulously tracks hystopathology. So, that includes your standard TNM aging in WHO grading, but it also integrates the molecular markers we rely on today,
like the KAS mutations you mentioned,
right? It has the status of K and NRS mutations for exxons 2, three, and four. It tracks microatellite instability, mismatch repair, gene expression, and even the H&PC risk situation.
That's lynch syndrome, right?
Yes, lymph syndrome based on the Amsterdam criteria. And it pairs all of that biological data with detailed treatment responses.
Okay, so we have the clinical records and the genetic flags. But if I'm a researcher building an AI for digital pathology, I can't just feed it a spreadsheet of K mutations in clinical notes, I need the AI to actually see the tumors.
Oh, and they definitely accounted for that. That is the main event of this paper. They didn't just collect clinical notes. They physically digitized 3,260 whole slide images.
Wow.
Yeah. These are from formalinfixed paraffin embedded or FFP colon tissues covering, 1433 of those patients. Just put that perspective for anyone not working directly with imaging servers. A single whole slide image scanned at 20x or 40x magnification is not a standard JPEG. We're talking about gigapixel territory.
Yeah, massive files.
You can zoom all the way down to see individual cell nuclei.
The scale is really hard to wrap your head around. That totals 26 terabytes of primary image data.
26 terabyte. That's wild.
And to top it all off, they integrated whole genome sequencing data saved in standard VCF files for 425 patients specifically pulled from the Absala Bio Bank.
So we are looking at a three-dimensional map of the disease. Essentially
pretty much
the clinical notes are like the street addresses telling us where the patient lived and what happened to them over 5 years. The VCF files, the genomic data are the actual structural blueprints of the houses showing the foundational DNA mutations.
I love that.
And those 26 terabytes of whole slide images are the highresolution satellite Google Earth view of the entire neighborhood. You really need all three layers to truly understand the landscape of the cancer.
If we connect this to the bigger picture, this multimodal data set is just the ultimate playground for artificial intelligence.
Oh, I bet.
Think about what a machine learning researcher can actually do here. You aren't just giving an AI a picture of a tumor and asking it to find other tumors. You are giving an AI the morphological satellite view, the genetic blueprints, and the long-term clinical addresses.
It's the whole package.
It really is. You can train an AI to look at the pixel level morphology of a slide and discover entirely new imaging biomarkers that correlate with a five-year survival outcome
because the AI might see things we can't
exactly. It could see subtle patterns in the collagen structure or immune cell grouping that human eyes literally cannot process.
Okay, I am completely sold on the vision. It sounds amazing,
right?
But as someone who has tried to get two different calendar apps to sync and just failed miserably, I have have to ask the obvious logistical question here.
Oh, I know where this is going.
You have 26 different regional bio banks. They're in 12 different European countries operating under entirely different health systems and languages.
Yep.
They are sending in 26 terabytes of gigapixel images.
Okay.
Plus clinical notes, genomic data, survival timestamps. I mean, how did this not turn into a complete chaotic mess of incompatible file formats? I imagine a hospital in Sweden uses a totally different software architecture than a clinic in Italy.
Oh. It was incredibly difficult. Honestly, the the data harmonization effort is arguably the most impressive technical achievement of this entire paper.
Really, the harmonization?
Yeah. To avoid that chaotic mess, the consortium rigorously adopted fair principles, meaning the data had to be findable, accessible, interoperable, and reusable.
Which sounds fantastic in a boardroom presentation, right? But what does interoperable actually mean when you are dealing with a dozen different medical languages and like legacy hospital software from the 2010s.
It basically means building a massive multi-layered translation engine. They had to painstakingly map the clinical data from whatever tabular formats the bio banks were using like simple Excel or CSV files into highly structured open EHR templates.
Open EHR. Let's try to avoid the alphabet sugore. We can break that down for me conceptually.
Fair enough. Open EHR is a standard that separates the meaning of the clinical data from how it's physically stored.
Okay.
Think of it as defining the universal clinical concept of blood pressure using a specific archetype rather than just relying on a specific column in a hospital's local spreadsheet that happens to say BP.
Oh, I see. So, it standardizes the concept itself.
Exactly. Once they had the concept separated, they mapped that data to the oom common data model.
And how does OOP fit into this translation engine?
Think of OOP as a universal electrical adapter. Hospital A is using a three-prongong plug from 1995. Hospital B is using a Modern USBC OM P is the adapter that lets the centralized AI system plug into both data streams without blowing a fuse.
That makes perfect sense.
It ensures semantic interoperability, meaning a diagnosis code in Italy mathematically means the exact same thing as a diagnosis code in Sweden. Finally, they converted all of it to HL7 FHIR standards, which is essentially the modern internet language for healthcare data. It allows the database to be securely and rapidly queried by outside researchers.
So they essentially built built a universal translator for European colurectal cancer data.
Yes.
But here's the thing about translators.
Standardizing the format of a document doesn't mean the information inside the document is actually accurate.
No, it certainly doesn't.
I mean, a perfectly formatted om standardized Excel cell can still contain a complete lie or just a really bad typo from a tired resident entering data at 3:00 a.m.
And this is where the quality control reality check hit the consortium hard. The central system didn't just accept the data. because it was formatted nicely.
They checked it.
They ran strict automated XSD validations which checked the structural integrity of the files and more importantly custom R script plausibility checks.
Plausibility checks.
Yeah. These scripts looked at the actual logic of the medical timelines.
Here's where it gets really interesting because the paper actually lists some of the errors these logic scripts caught. And honestly, they are terrifying if you consider that this is raw hospital data. data that AI models could theoretically be trained on without these checks.
The errors are a really stark reminder of the reality of medical data.
Yeah. One of the automated flags caught patients who were logged as surviving for over 4,000 weeks post therapy,
which is about 77 years, right?
For a post therapy colarctyl cancer survival time, that is uh highly implausible, meaning someone likely swapped a birth date with a treatment date.
Highly implausible is a very polite way to put it. They also caught treatments with negative durations.
Oh. The records literally claimed the cancer treatment ended before it even began. And the anatomical impossibilities were just wild.
Tell me about it.
They had records indicating a surgeon performed a left hemollectomy, removing the left side of the colon on a tumor that was explicitly located in the right colon.
That's a pretty big typo.
Or records showing a patient starting a brand new chemotherapy regimen after the patient had already been officially marked as deceased.
It sounds absurd, but highlights a crucial truth for the trailblazers listening today. An AI is literally only as good as its training data. Raw hospital data is surprisingly messy. The ground truth written in a chart isn't always true.
Imagine the administrative friction there, though. You have this automated central hub in Europe running these scripts, constantly kicking out error reports, and repeatedly telling regional bio banks, "Hey, your official clinical records are logically impossible."
Yeah, nobody wants to hear that.
The local bio banks were forced to dig back into their source hospital records, pull the original paper charts, and actually find the real truth.
The paper notes that some bio banks found this iterative feedback loop quite burdensome.
I can imagine
they actually wanted to contractually cap the number of data quality check iterations because it was demanding so many man-hour.
Wow. They wanted to limit how clean the data had to be.
Yeah. But BBMRI Eric had to hold the line. They compromised by focusing on the most critical errors first. first. But they didn't just accept garbage data. They forced the harmonization because an AI trained on a patient who received chemo after death is an AI that will fail in the real clinic.
Okay, so they fought through the friction. They did the hard work and built this 70 terbte masterpiece. It is pristine, highly structured, and cleanly harmonized.
It really is an incredible resource.
But medical data is heavily guarded for good reason. If I'm an AI researcher or an oncologist listening to this right now, how do I actually get my hands on it? Does this data just sit locked in a European server gathering dust because of GDPR regulations?
No. And that's the beauty of how they structured the governance. First, the data is entirely free for approved academic and industrial researchers globally.
Free globally
completely. There are no geographic restrictions. If you just want to visually explore the whole slide images, they've actually integrated the exopat viewer directly into a C bio portal instance.
So, you don't even have to download the images to look at them.
Exactly. You log and securely using multifactor authentication. And you can zoom in on those gigapixel slides and run statistical analyses right in your web browser.
That's convenient. But if I'm training a deep learning model from scratch, I can't just look at slides in a browser. I need the raw files. I need to download the bulk tiff slides and those VCF genomic blueprints directly to my local servers.
For bulk downloads, they implemented a highly secure envelope encryption scheme using an open- source tool called Crypford. GH.
Okay, let's do an ELI5 explain like I'm five for envelope encryption. How does that actually protect the data in transit?
Okay, imagine you want to send a highly sensitive blueprint through the mail. Instead of just putting it in a paper envelope, you lock it inside a titanium briefcase.
Okay, I'm with you.
Cry 4G essentially locks the gigapixel image data using the researcher's public cryptographic key before it even leaves the central server.
Oh, before it even ships,
right? It is mathematically protected while it travels across the internet. And you can only unlock the briefcase locally using your specific private cryptographic key.
So, it's locked with my specific padlock before it even goes in the mail. That handles the technical security, but what about the legal approval? BBNRi Eric might be the central hub, but those 26 regional bio banks still maintain legal sovereignty over their specific patient data, right?
They do. Yes.
If I request access, am I waiting for 26 different hospital committees to review my paperwork over the next 3 years?
No. And honestly, this is the most ificant administrative solution in the entire project. They instituted a one-month service level agreement managed through a platform called the BBMRI ERIC negotiator.
A one-mon SLA for 26 hospitals. How?
When the Central Access Committee approves a project's scientific relevance, they trigger what is called a time limited veto.
A time limited veto. Walk me through a real world scenario of that because that sounds pretty intense for a hospital administrator.
Let's say you submit your AI by bioarker project. The central committee says, "Yep, it looks scientifically sound." An automated notification goes out to the 26 bio banks. The clock immediately starts ticking.
Okay.
They have exactly 14 days to veto the release of their specific data to your project. But, and this is critical, they can only veto on legal or consent related grounds.
So, they can't veto the release just because they are studying the same thing. Yeah. And want to hoard the scientific novelty for their own local researchers.
Exactly. They cannot sit on the data. out of academic competitiveness. They have to provide a legitimate legal reason.
But what happens if a bio bank is just busy? I mean, they hit the email, the primary administrator is on vacation, and the 14 days just expire.
If they don't reply within 14 days, approval is assumed. The system automatically unlocks the data and releases it to you.
Wow. So, what does this all mean? It means this consortium essentially proved that strict GDPR compliance and rapid efficient crossber medical research can actually coexist. Yes, they absolutely can.
You really don't have to sacrifice speed for patient privacy if you build the infrastructure correctly from day one.
And this raises an important question. If this federated, standardized, rapidly accessible model works so well for 10,000 colal cancer patients, could it scale to all diseases?
That's the dream, isn't it?
It is. In fact, this infrastructure is currently serving as a pilot use case for the upcoming European Health Data Space or EHDS.
Right. The EHDS.
The EHDS is Europe's mass ive continentwide initiative to mandate such secondary use of health data. This CRC cohort is basically the proof of concept that the EHDS vision is technically and legally possible today.
That is incredible. A blueprint for the future of European healthcare data. All right, trailblazers, we are coming to the end of our time today, but I want to speak directly to you for a second.
Whether you are at a tech startup developing the next generation of machine learning algorithms for digital pathology or you are an oncologist hunting for that elusive new biomarker to definitively stratify stage 2 colon cancer patients. This data set is sitting there waiting for you.
It really is.
It's clean, it's massive, and it's free. It is a resource that can fundamentally accelerate your work.
It is the statistical power we have been waiting for. Completely democratized.
But before we sign off, I want to leave you with one final thought to mull over. Building on everything we just discussed today, let's go back to that AI playground.
Okay.
We currently rely on human pathologists to look at glass slides and assign a NM stage tumor nodes metastasiz. It is the absolute bedrock of oncology.
It is. We use it every day.
But if a deep learning algorithm trained on these 3,260 pristine digital slides manages to identify entirely new, incredibly subtle morphological biomarkers that human eyes have simply missed for decades. What happens to the traditional TNM staging system?
Oh wow, that is a disruptive thought,
isn't it?
Yeah. If a machine can predict recurrence with pixel level accuracy, based on patterns we can't even perceive, the old classifications really start to look a bit archaic.
Will the TNM system eventually become completely obsolete? Could we see a future where patient stratification isn't based on a human categorizing the visible size of tumor, but is replaced entirely by a purely algorithmic pixel-driven risk score?
It's very possible
if the machine can predict recurrence better than the traditional human classifications. The muddy waters don't just clear up. They completely change their chemical composition.
We might just be looking at the beginning of the end for the diagnostic frameworks we've used for a century.
A fascinating possibility to end on. Thank you to everyone for joining the deep dive today. Keep pushing boundaries, keep hunting for those hidden patterns, and we'll catch you on the next deep dive into the data. This has been the digital pathology podcast.