The 90% Problem: Why Clinical AI Pilots Don't Scale Artwork

The Clinical Realist

Healthcare innovation is broken. We have billion-dollar AI running on 1990s infrastructure. We have startups dying in "Pilotitis." And we have leaders frozen by analysis paralysis.

Dr. Sarah Matt (The Clinical Realist) is here to fix the disconnect between the tech stack and the trauma bay.

Join Dr. Matt—physician, strategist, and author of The Borderless Healthcare Revolution—as she cuts through the hype to reveal what actually works in modern medicine. No buzzwords. No fluff. Just the raw, unvarnished truth about how to lead, build, and survive in the future of healthcare.

If you are tired of the "Star Trek" vision and want the "Clinical Reality," this is your show.

Subscribe to The Sarah Matt Briefing for weekly insights on healthcare AI, access strategy, and the business of medicine: https://drsarahmatt.com/newsletter-signup

All Episodes

The Clinical Realist

The 90% Problem: Why Clinical AI Pilots Don't Scale

April 08, 2026 • Season 1 • Episode 11

0:00 | 15:46

You ran the pilot. The AI worked. The clinicians loved it. The outcomes were solid. Then you tried to roll it out system-wide and it fell apart. This is the 90% problem, and it is not a technology problem. It is an adoption problem. In this episode, Dr. Sarah Matt breaks down why pilot success is the wrong measure for scaling readiness, the four structural gaps that separate a winning pilot from a failed implementation, what clinical governance actually needs to look like before you scale, and the decision architecture that separates health systems that scale AI from the ones that do not. What you will take away: - Why measuring tool accuracy in a pilot tells you almost nothing about system-wide adoption - The four governance elements that must exist before any rollout decision - How to build the decision architecture before the pilot ends, not after implementation fails - The question health system leaders should be asking before they even design the pilot Website: https://drsarahmatt.com | Book a conversation: https://calendly.com/sarahmattmd | LinkedIn: https://www.linkedin.com/in/sarahmattmd/

—

Resources & Links:

📖 Get the Book: "The Borderless Healthcare Revolution" is available now on Amazon and major retailers.

💼 Work with Dr. Matt:
Looking for a keynote speaker or strategic advisor?
Visit: drsarahmatt.com

🔗 Connect on Social:
LinkedIn: https://www.linkedin.com/in/sarahmattmd/
YouTube: https://www.youtube.com/@DrSarahMatt-ClinicalRealist

📧 Subscribe to The Briefing: drsarahmatt.com/newsletter-signup

—
Disclaimer:
The views expressed on this podcast are those of Dr. Sarah Matt and her guests. They do not necessarily reflect the official policy or position of any affiliated institutions. This content is for informational and educational purposes only and does not constitute medical advice or a professional consulting relationship.

SPEAKER_00 0:00

Every AI pilot succeeds. And that's not actually a compliment. It's a design flaw. So when a health system announces that a pilot demonstrated positive results and they're expanding deployment system-wide, they've often proven very little about whether that tool is actually going to work in the real clinical environment they're about to send it into. The gap between what pilots prove and what health systems need to actually know before expansion is where most AI deployments go wrong. So I'm Dr. Sarah Matt, and this is the Clinical Realist, where we work through what AI and healthcare actually requires. Not what vendors promise and not what conference panels describe, but what the operational and clinical reality look like when you're the one accountable for making those tools work. So today we're talking about clinical validation in healthcare AI, specifically the structural reasons why most AI pilots and health systems are designed to succeed, and why that success actually tells you almost nothing about whether you should expand deployment. This episode is for CMIOs, CMOs, quality officers, physician executives, CNOs, executives that are clinical in any way, anyone who is the clinical validation process owner or who's being handed pilot results and asked to sign off on them system-wide for a rollout. So if you've ever looked at a pilot summary deck and felt like something was off but couldn't name exactly what it was, this episode's for you. So let me start with the structure of a standard healthcare AI pilot, because the structure is actually the problem. A typical pilot runs on a motivated clinical team, usually the team that requested the tool or champion the vendor relationship. And it runs on a controlled patient population over a very defined time frame, with the vendor providing implementation support that will not be present at scale. These conditions are not representative of the real-world deployment at all. They're actually optimal conditions. So the pilot is designed to demonstrate capability, not to predict performance under realistic operational pressure. Three forces combine to guarantee that most pilots produce positive results, regardless of how effective the underlying tool actually is. The first is what we call the Hawthorne effect in clinical AI. When clinical staff know they're being observed and evaluated, their behavior changes. They document more carefully, they follow protocols more consistently, they engage with the tool more deliberately. So the pilot results capture that behavioral change as much as they capture the tool's performance. So when the pilot ends and the observation ends with it, performance often regresses toward the mean. What the pilot measured as a tool performance improvement was partially a staff behavior improvement that was never sustainable at scale. The second is volunteer bias. The clinical teams who participate in AI pilots are almost always the teams most interested in the technology. They're not representative of the clinical staff who will be required to use the tool system-wide. Pilot results capture the performance ceiling for motivated early adopters. Rollout performance is going to reflect the median clinical staff members' engagement with a tool they did not volunteer to use, were not involved in selecting, and were told is now part of their workflow. Those are not the same people. They do not produce the same numbers. The third is vendor support as a confounder. During pilots, vendors are typically providing implementation support, prompting really quick troubleshooting, and active optimization. That level of support does not continue at scale. What the pilot measures as tool performance is often partially vendor support performance. So when the vendor steps back after Go Live, the tool performs differently. And not because the tool changed, but because a significant component of its effective performance was the vendor's involvement in making it work. And I want to give you a concrete example of what that looks like in practice. So a large academic medical center piloted an AI-assisted radiology workflow tool on one of their highest performing radiology teams. So the team had requested the tool, had a physician champion who had been following the vendor for two years, and received 12 weeks of embedded vendor support during the pilot. The pilot showed results that showed 22% reduction in turnaround time and strong clinician satisfaction scores. The system approved full radiology department rollout. Now, rollout performance at six months showed an 8% reduction in turnaround time on the original team and essentially no measurable effect on the other seven radiology teams. The tool had not changed. What had changed was removal of the three conditions that made the pilot work. The motivated team, the vendor support, and the observation effect. The pilot had proved that the tool could work under optimal conditions. It had proved nothing about whether it would work under normal conditions. And that distinction cost the system approximately 18 months of remediation effort and a second vendor engagement. Even when pilots are conducted in good faith, and most are, let's be honest, they're definitely there's no bad actors here. But there's still four common design flaws that are going to make their results statistically and operationally uninterpretable as evidence for system-wide rollout. Design flaw one is sample size and population match. Most health system AI pilots run for 60 to 90 days on one or two clinical units. And that sample size is rarely adequate to detect meaningful effect sizes for the outcomes that actually matter: adverse events, near misses, downstream clinical outcomes. Pilots are sized to demonstrate workflow integration, not to produce statistically valid clinical outcome data. So when a pilot summary reports improved outcomes, it's almost always reporting directional trends in small samples, not validated evidence of clinical effect. There's a meaningful difference between we saw a positive trend and we have validated evidence. Most pilot summaries describe the first and kind of present it as the second. Design flaw two, the comparator problem. So what is the AI tool being compared to? In most pilots, it's compared to the baseline performance of the participating clinical team before the tool was introduced. And that's not a rigorous comparator. The baseline performance of a team that just completed training on a new tool that is being actively observed and that has extra vendor support available is not the same as the counterfactual performance of that team without the tool. Without a true control group, a comparable clinical team in the same period operating without the tool, the pilot can't really establish that the tool caused the improvement. It can only establish that the improvement occurred. And those are not equivalent claims. Design flaw three, outcome selection. Pilots typically measure what is measurable in 60 to 90 days, throughput, documentation time, alert acknowledgement rates, clinician satisfaction scores. And these are not the outcomes that matter most to your patients or your malpractice carrier. They're the outcomes that are convenient to measure in a short time frame. The clinical outcomes that determine whether a tool is actually safe and effective, downstream adverse event rates, diagnostic accuracy at scale, long-term workflow challenges, those require longer time frames than most pilots run. A pilot that shows improved documentation time is not a pilot that shows improved patient outcomes. That equivalence is often implied in pilot presentations and almost never justified. Design flaw four, missing the workflow failure modes. So pilots catch workflow success. They rarely surface workflow failure modes because failure modes appear at low frequencies in edge cases and in the combinations of patient complexity and workflow pressure that are underrepresented in control pilots. The failure mode that matters in a real clinical deployment is the one that appears in the 400th patient, not the first 50. Pilots don't get to 400 patients under realistic conditions. So the failure mode goes undetected until it appears in the rollout at a point where the organizational commitment to the deployment is already fully made and reversal is politically difficult. So I've seen this pattern with clinical decision support tools specifically. The failure mode that tends to matter is not the tool giving a clearly wrong answer in an obviously complex case. Clinicians catch those. The failure mode that matters is actually the tool giving a subtly insufficient answer on a case that looks routine but is not. At the moment in a shift where workflow pressure is highest, at verification is likely the least. That combination does not appear in a six-day pilot on a motivated team with a vendor representative available for questions. It appears in month seven of a system-wide rollout on a night shift on a Sunday. So you don't have to run a randomized control trial before you deploy AI. I'm not saying that. So here's what that framework could look like. The most important single change is prospective criteria defined before the pilot starts. Before the pilot begins, define in writing what results would lead to expansion and what results would lead to stopping. Not, we'll review the data and decide. Specific performance thresholds defined in advance, agreed to by the clinical and operational teams. This forces the question of what you actually need to see before the vendor's pilot results are on the table and the organizational pressure to expand is already in motion. So when the pilot summary arrives, you're not deciding whether the results are good enough. You're checking whether results meet the criteria you already specified. Those are very different conversations with very different political dynamics. The second element is stage deployment, not pilot to full rollout. The binary of pilot succeeded, now we expand, is a false choice. A stage deployment, one unit, then two units, then one facility, then a second facility, it gives you multiple cycles of performance data under increasingly realistic conditions. Each stage allows you to see whether the pilot results hold when the conditions are made, that the pilot succeeded are progressively removed. So you're not betting the organization on a single 90-day data set. You're building an evidence base for expansion decisions. The third element is pre-specified failure mode identification. So before deployment, ask the vendor for the documented failure modes their internal testing identified. Ask what patient populations the model underperformed on in validation. And ask what workflow conditions produce the worst outputs. This information exists. Vendors generate it in the course of building the tool. Most health systems never actually ask for it. So the ones that do can design monitoring that specifically watches for the failure modes the vendor already knows about rather than discovering them when they appear in the deployment. So I've pushed a vendor on this in a pre-deployment conversation once. And the initial response was that their tool had performed consistently across all patient populations in their validation study. I asked for the validation documentation. And the documentation showed that their validation study had a 43% exclusion rate for patients with more than three comorbidities, which was absolutely the patient population we were deploying into. And so the tool had been validated on a population that was structurally less complex than the patient panel we wanted to use it for. And that's not a vendor deception. It was a validation scope limitation that would not have surfaced unless I had actually asked for the documentation. So we negotiated a modified monitoring structure that specifically tracked performance on high comorbidity patients in the first 90 days of the deployment. The failure mode appeared on schedule at roughly week 11, and we caught it before it became a patient safety event. The fourth element is clinician-reported friction as a leading indicator. The earliest signal that a tool will fail at scale is often the friction clinicians report during the pilot. Not friction as complaint, but friction as workflow signal. Where does a tool create more work than it saves? Where does the output require the most verification? And where do clinicians report overriding the tool most frequently? Those friction points in the pilot are the preview of the failure modes in the rollout. Most pilot evaluation frameworks treat clinical friction as a change management problem to address with training. A more useful frame is that friction is diagnostic data about where the tool is not working as designed, or where it's working as designed and it was designed poorly. Raising clinical validation concerns about an AI pilot, it's not anti-I and it's not anti-progress. It's the clinical leader's job. The challenge is framing it so it lands in that way. The framing problem is real. And when clinical executives, physician executives, raise clinical validation concerns about AI pilots, they're often read as skeptical about AI in general, or as protecting turf or slowing progress. And that reading is wrong. And it's also completely unnecessary. The frame that works is this. That requires us to know what the pilot actually proved. Here's what I need to see in the data before I can sign off on expansion. That's not a no. That's a conditional yes with specified conditions. The difference matters because it positions the clinical executive as a partner in making the deployment succeed rather than as an obstacle to get to the launch mode. The distinction between technical validation and clinical validation is also worth naming explicitly in these conversations. IT invers are equipped to validate technical performance, uptime, integration, output generation, they're not necessarily equipped to validate clinical performance. Whether the outputs are clinically appropriate, whether the workflow integration is safe, whether the failure modes are acceptable in the patient population you're serving. So clinical validation is not a check the box exercise. It's a substantive review that requires real clinical judgment. So the clinical executive who owns clinical validation, they're not doing duplicative work. They're doing work that no one else in the organization can do. What to ask for from the vendor before signing off on expansion? Well, the full validation documentation, not the pilot summary deck, the performance data by patient subgroup, the failure mode documentation from their internal testing, the comparator methodology the pilot used. If the vendor can't or will not prove these, that is information about the tool's readiness for expansion. Specifically, that the vendor does not have the documentation that would justify the expansion you are being asked to approve. And let me address the institutional standing question directly, because I know what's going to come up. Physician executives, nursing executives, and other clinical executives who are accountable for clinical outcomes have both the standing and the obligation to ask for validation evidence that supports that accountability. This is not optional due to diligence. If a patient harm event occurs after expansion and it emerges that the validation data supporting expansion was adequate, that clinical executive who signed off, they're in the review meeting. That context is the backing for every hard validation conversation. So you're not being difficult when you ask for the evidence base. You're doing the job that your accountability actually requires. So the goal of this episode is not to make you more skeptical of AI. It's to make you more precise about what validation evidence you actually need before your accountability is attached to a deployment. Pilots are not evidence of readiness, they're evidence of possibility. The gap between these two things is governance, specifically the governance decisions that determine what you need to see before expansion begins. Who has authority to slow or stop a deployment if the post-launch data diverges from the pilot? And how clinical validation concerns get to the decision table in time to matter. So if your health system is currently in a gap between a successful pilot and a system-wide rollout decision, this is the moment to ask for the data. Not after. Not after the expansion is approved. Now. So I'm Dr. Sarah Matt. New episodes of the Clinical Realist come out every Wednesday. And if this was useful for you, share it with a clinical leader who is about to be handed a pilot results deck and asked to sign off the expansion. They need this conversation before that meeting, not after. See you next time.