This Is Why Cross-Validation Doesn’t Work (And Nobody Talks About It) Artwork

Data Science x Public Health

This podcast discusses the concepts of data science and public health, and then delves into their intersection, exploring the connection between the two fields in greater detail.

All Episodes

Data Science x Public Health

This Is Why Cross-Validation Doesn’t Work (And Nobody Talks About It)

April 06, 2026 • BJANALYTICS

0:00 | 5:45

Cross-validation is one of the most common tools in machine learning.
It is supposed to give you a reliable estimate of how your model will perform.

But what if that estimate is quietly misleading you?

In this episode, we break down why cross-validation often fails in real-world healthcare and public health data. From data leakage and time dependence to population shifts and deployment mismatch, you will learn why validation strategies that look rigorous can still produce fragile models.

👉 Enjoyed the episode? Follow the show to get new episodes automatically.

If you found the content helpful, consider leaving a rating or review—it helps support the podcast.

For business and sponsorship inquiries, email us at:
📧 contact@bjanalytics.com

Youtube: https://www.youtube.com/@BJANALYTICS

Instagram: https://www.instagram.com/bjanalyticsconsulting/

Twitter/X: https://x.com/BJANALYTICS

Threads: https://www.threads.com/@bjanalyticsconsulting

SPEAKER_01 0:00

Welcome to this deep dive. Whether you are evaluating a new predictive AI tool for your work, or you're just insanely curious about why so many data science models like spectacularly crash when they hit the real world. Well, this is for you.

SPEAKER_00 0:15

Yeah, exactly. And today we are pulling from some really fascinating excerpts from a piece called Beyond the Fold Realistic Model Validation Strategies.

SPEAKER_01 0:24

Right. And our mission here is to uncover why models that look absolutely flawless in development so often fail upon deployment, and why the standard testing ritual the industry relies on cross-validation might actually be secretly setting us up for failure.

SPEAKER_00 0:40

Which is a huge deal. I mean, to understand the problem, you really have to think about building a predictive data model, like building one of those intricate ships in a bottle.

SPEAKER_01 0:47

Oh, yeah, I love that analogy. Like you spend months assembling the rigging and smoothing the sails, right?

SPEAKER_00 0:51

Right. Getting every single detail absolutely perfect inside this perfectly controlled glass environment. But you would never crack open that glass bottle, toss the tiny wooden ship into a real churning ocean, and then expect it to survive a hurricane.

SPEAKER_01 1:05

No, of course not. The environment inside the bottle doesn't have the chaos of the actual sea. So let's unpack how we're building these bottles.

SPEAKER_00 1:14

Well, step one for most data scientists is cross-validation. I mean, you have a data set and you don't want to waste any holdout data.

SPEAKER_01 1:21

Right. You split it into chunks, or folds as they call them.

SPEAKER_00 1:24

Exactly. So you train the model on some folds, test it on another, rotate them around, and then average the score. And honestly, it feels mathematically rigorous.

SPEAKER_01 1:34

Aaron Powell But it relies on a hidden assumption, right? Something called exchangeability.

SPEAKER_00 1:38

Trevor Burrus, Jr.: Yeah, exchangeability. This basically means assuming your folds of data are similar enough that simply shuffling them gives you a fair estimate of the future.

SPEAKER_01 1:46

Aaron Powell You know, it makes me think of studying for a history final by breaking the textbook into five chunks. You study four chunks, test yourself on the fifth, and then just rotate.

SPEAKER_00 1:57

That captures the mechanical flaw perfectly.

SPEAKER_01 1:59

Because it assumes the final exam will only ask questions perfectly proportional. It completely ignores the fact that the teacher might test you on current events.

SPEAKER_00 2:07

Right, and randomly splitting data completely ignores the messy reality of the world. I mean, it ignores time structures, geographic clustering, or even the fact that the exact same patient might have repeated records in the system.

SPEAKER_01 2:20

So if the data isn't perfectly interchangeable, that mismatch has to be what corrupts the model.

SPEAKER_00 2:25

Yeah, and that corruption is what the source material calls data leakage, which is basically when information from outside the training data set bleeds into the model.

SPEAKER_01 2:34

Oh wow. How does that usually happen?

SPEAKER_00 2:36

Well, a really common way is through pre-processing. So say you have missing blood pressure readings in your clinical data.

SPEAKER_01 2:43

Okay.

SPEAKER_00 2:44

If you calculate the average blood pressure of your entire data set to fill in those missing values before you split it into folds, your training data is suddenly using math influenced by the test data.

SPEAKER_01 2:55

Oh, I see. Because the test data is supposed to represent the unseen future, but you've just smuggled its characteristics into the past.

SPEAKER_00 3:01

Exactly. The model is getting a sneak peek at the answers. Or consider the repeat patient issue.

SPEAKER_01 3:08

Right. If you randomly shuffle records, a single patient with 10 hospital visits might end up in both your training set and your testing set.

SPEAKER_00 3:16

Yeah. So the model isn't learning the complex biological markers of a disease. It's simply recognizing that specific patient's unique data signature.

SPEAKER_01 3:26

And the source actually uses a great phrase for this confidence built on contamination.

SPEAKER_00 3:31

I love that phrase. Because the model isn't learning the underlying signal of the problem, it's just learning the convenience of the data set.

SPEAKER_01 3:39

And this contamination is dangerous anywhere, but it seems like it hits public health and healthcare the hardest.

SPEAKER_00 3:44

Oh, absolutely. I mean, public health environments are inherently unstable. Pathogens evolve, human behaviors shift in response to policies, testing availability fluctuates constantly.

SPEAKER_01 3:55

Right. And we usually deploy a predictive model precisely because those conditions are shifting.

SPEAKER_00 3:59

Yeah. So if you just test your model on a shuffled version of its own past, you know, a past where a different variant was circulating or testing was scarce, you aren't proving it works in the current reality.

SPEAKER_01 4:10

Aaron Powell Okay, let me push back on the practicality of fixing this though. If cross-validation is flawed and we need to validate against real-world shifts, does that mean a team building a flu forecasting model has to wait for an entirely new flu season just to test it?

SPEAKER_00 4:25

That is a totally valid concern.

SPEAKER_01 4:26

Aaron Powell Because wouldn't that completely paralyze development?

SPEAKER_00 4:30

Well, it doesn't mean cross-validation should just be thrown in the trash. I mean, it remains a highly valuable tool for internal comparison and for tuning the mechanics of the model during development.

SPEAKER_01 4:40

Okay, so it's a tool for iteration.

SPEAKER_00 4:42

The problem is treating it as a shortcut to trust.

SPEAKER_01 4:45

Ah, I get it. So we use cross-validation to build the engine, but we still need a real-world test drive before selling the car.

SPEAKER_00 4:52

That is exactly the solution. Validation has to actually match deployment.

SPEAKER_01 4:55

So a flu model needs temporal validation, like testing it on historical seasons that the model has literally never seen, not just shuffling last year's data.

SPEAKER_00 5:04

Or if it's a hospital risk model, you have to run site-level validation. So testing it across different hospitals to ensure it doesn't just work on one specific facility's patient demographic.

SPEAKER_01 5:15

The goal is to ensure the model survives contact with the real world rather than just testing a fantasy version of the environment.

SPEAKER_00 5:22

Exactly.

SPEAKER_01 5:22

So here is the core takeaway for you as you look at any predictive model in your life or work. Always ask if it was tested against reality or just a shuffled version of its own past.

SPEAKER_00 5:33

Which leaves us with a really interesting question. If our default scientific models inherently assume a neat static world, what other systems in our daily lives are secretly operating on a fantasy version of reality, just waiting for the glass bottle to break?