The Trick That Makes Observational Data Look Like a Clinical Trial Artwork

Data Science x Public Health

This podcast discusses the concepts of data science and public health, and then delves into their intersection, exploring the connection between the two fields in greater detail.

All Episodes

Data Science x Public Health

The Trick That Makes Observational Data Look Like a Clinical Trial

March 23, 2026 • BJANALYTICS

0:00 | 5:20

What if you could run a clinical trial… without randomizing anyone?

In this episode, we break down propensity score methods — one of the most important tools in biostatistics for turning messy observational data into something closer to a fair comparison.

You’ll learn:

Why observational studies are biased by default
How propensity scores balance treated vs untreated groups
The 4 major methods (matching, weighting, stratification, adjustment)
When they work… and when they fail

👉 Enjoyed the episode? Follow the show to get new episodes automatically.

If you found the content helpful, consider leaving a rating or review — it helps support the podcast.

For business and sponsorship inquiries, email us at:
📧 contact@bjanalytics.com

Youtube: https://www.youtube.com/@BJANALYTICS

Instagram: https://www.instagram.com/bjanalyticsconsulting/

Twitter/X: https://x.com/BJANALYTICS

Threads: https://www.threads.com/@bjanalyticsconsulting

SPEAKER_01 0:00

Welcome to this custom deep dive into your notes on propensity scores. So imagine trying to prove a new heart surgery works right. But you can't run a randomized trial.

SPEAKER_00 0:09

Because it might be unethical or you only have pass observational hospital data.

SPEAKER_01 0:14

Exactly. And since doctors naturally give the most aggressive treatments to the sickest patients, your data is completely rigged from the start. It is like trying to see which runner is faster, but the coach forced the strongest runner to take an uphill track.

SPEAKER_00 0:27

You definitely cannot just look at the finish times there.

SPEAKER_01 0:30

Well, you can't. The hidden differences, I mean the confounders just completely ruin the comparison. So today we are exploring how propensity scores mathematically level that track to solve this unfair data problem.

SPEAKER_00 0:40

And to level that track, you really need a single common metric, which is basically the breakthrough that Paul Rosenbaum and Donald Rubin published back in 1983.

SPEAKER_01 0:49

Oh wow. So this has been around for a while.

SPEAKER_00 0:51

They figured out a way to take a patient's entire medical history, so dozens of complex variables like age and blood pressure and kidney function and feed it all into a logistic regression model.

SPEAKER_01 1:02

Okay, let's unpack this for a second. So instead of predicting a health outcome, the model predicts the probability that the patient would receive the treatment in the first place.

SPEAKER_00 1:11

Exactly right. It spits out a single percentage.

SPEAKER_01 1:13

So if I have two completely different patients, but the math says they both have a 40% chance of getting the surgery based on their backgrounds, they essentially have the same odds at the starting line.

SPEAKER_00 1:22

Yes. Even if one ultimately got the surgery and the other just got medication, their identical score creates the foundation for a fair comparison. The math strips away the bias of who actually received the intervention.

SPEAKER_01 1:36

So if the score gives us this perfect baseline, what is the next step? Can't just stare at the probabilities. How do researchers actually inject this single number into a massive database of raw hospital records?

SPEAKER_00 1:49

Well, there are four main ways to do it. The most intuitive one is matching. You literally pair a treated patient with an untreated patient who has the exact same propensity score, and then you compare their outcomes.

SPEAKER_01 2:00

Wait, but if we strictly pair exact matches, aren't we throwing away everyone who doesn't have a perfectly paired score? That feels like trashing half your puzzle pieces just because the edges do not line up perfectly.

SPEAKER_00 2:11

It does discard a lot of data. That is a very real issue. And it is why researchers often prefer inverse probability treatment weighting or IPTW.

SPEAKER_01 2:20

IPTW. Okay. So how does that avoid tossing out the unmatched pieces?

SPEAKER_00 2:26

So instead of throwing them out, you mathematically weight the full sample. Think of it like cloning. Like if a patient with a very low propensity score somehow got the surgery anyway, they're statistically rare and highly valuable.

SPEAKER_01 2:38

Oh, I see. So we mathematically clone or like upweight them so they represent all the untreated people who look just like them.

SPEAKER_00 2:43

It creates a completely balanced pseudo-population.

SPEAKER_01 2:45

That makes so much more sense. We use the rare cases to balance the scales. What about the other two methods?

SPEAKER_00 2:50

Well, there is stratification where you divide the entire sample into five groups based on their scores and estimate the effects within those buckets.

SPEAKER_01 2:58

That sounds pretty straightforward. And the last one.

SPEAKER_00 3:00

That would be covariate adjustment, where you just toss the propensity score into your final outcome regression model as a control variable? I should say it is generally considered the least robust option. Got it.

SPEAKER_01 3:10

But if these math techniques are so powerful at simulating a trial from messy data, why spend millions on an actual clinical trial? Like, why not just mathematically wait an insurance database and call it a day?

SPEAKER_00 3:22

Well, here is the catch. Propensity schools only work on measured confounders.

SPEAKER_01 3:27

Ah, measured being the operative word there.

SPEAKER_00 3:29

If there is a variable that affects both the treatment and the outcome, like say a patient's underlying frailty or their specific diet and it just wasn't recorded in your database, the algorithm cannot calculate a probability for it. Unmeasured confounding is the absolute Achilles heel of this entire method.

SPEAKER_01 3:45

It is like a ghost in a machine. It invisibly skews the outcome without leaving a single trace in the data. Like you can balance the blood pressure and age perfectly, but if you miss the ghost, the results are still haunted.

SPEAKER_00 3:58

Haunted is a good way to put it. And beyond the ghosts, the math itself requires sufficient overlap. Scores cannot all be extremes of one and zero.

SPEAKER_01 4:07

You need a mix of patients with similar scores in both treatment groups, or there's nothing to compare.

SPEAKER_00 4:13

Exactly. Before even looking at the final outcomes, researchers have to run rigorous balanced diagnostics using metrics like standardized mean differences to check the variance between the groups.

SPEAKER_01 4:24

So if the standardized mean difference isn't close to zero across all your variables, your weighting or matching basically failed to actually balance the population?

SPEAKER_00 4:32

Correct. You have to mathematically prove the track is level before you ever peek at the finish line.

SPEAKER_01 4:37

This is wild because you know it isn't just some niche academic theory. This exact framework is how researchers estimated the COVID-19 vaccine effectiveness using real-world health records when trials just weren't feasible.

SPEAKER_00 4:49

And it is exactly how doctors right now are comparing surgical versus non-surgical cardiac treatment. Really is the engine running modern real-world evidence.

SPEAKER_01 4:59

Which is incredible. So to wrap up here is a thought for you to take away. The next time you see a flashy headline claiming a new diet or a new drug definitely works based solely on observational data, ask yourself what unmeasured confounder did the researchers completely miss.

SPEAKER_00 5:14

What is the ghost in their machine?

SPEAKER_01 5:16

Exactly. Thanks for joining us on this deep dive and keep questioning the data.