Heliox: Where Evidence Meets Empathy πŸ‡¨πŸ‡¦β€¬

✍️ When the Machine Writes Your Story: AI Scribes, Hallucinated Medicine, and the Patients Left Behind

β€’ by SC Zoomers β€’ Season 7 β€’ Episode 6

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 41:56

Send us Fan Mail

πŸ“– Read: https://helioxpodcast.substack.com/publish/post/198763613

There is a small, unassuming phone sitting on a doctor's desk. It is listening to everything.

It listens to the fear in a mother's voice as she describes her son's episodes. It listens to the careful hedging of a retired schoolteacher who doesn't want to be a bother. It listens to the pauses, the restarts, the coughing β€” the whole ungainly music of a human being trying to communicate the most frightening things they have ever had to say. And when the appointment ends and the door clicks shut, the phone does something remarkable: it writes it all down.

Or rather, it writes down what it thinks it heard. What it predicts should have been said.

And this, quietly, is the problem.

References
Perspective: Listening to Users when Auditing Medical AI Scribes

and 15 others

This is Heliox: Where Evidence Meets Empathy

Independent, moderated, timely, deep, gentle, clinical, global, and community conversations about things that matter.  Breathe Easy, we go deep and lightly surface the big ideas.

Support the show

Disclosure: This podcast uses AI-generated synthetic voices for a material portion of the audio content, in line with Apple Podcasts guidelines. 

We make rigorous science accessible, accurate, and unforgettable.

Produced by Michelle Bruecker and Scott Bleackley, it features reviews of emerging research and ideas from leading thinkers, curated under our creative direction with AI assistance for voice, imagery, and composition. Systemic voices and illustrative images of people are representative tools, not depictions of specific individuals.

We dive deep into peer-reviewed research, pre-prints, and major scientific worksβ€”then bring them to life through the stories of the researchers themselves. Complex ideas become clear. Obscure discoveries become conversation starters. And you walk away understanding not just what scientists discovered, but why it matters and how they got there.

Independent, moderated, timely, deep, gentle, clinical, global, and community conversations about things that matter.  Breathe Easy, we go deep and lightly surface the big ideas.

Spoken word, short and sweet, with rhythm and a catchy beat.
http://tinyurl.com/stonefolksongs



You know, usually when we talk about a medical diagnosis, there's this expectation of like complete precision. Right. It feels very objective, very structural. Exactly. It feels like engineering. You fall off your bike, your arm hurts, you go in and the x-ray shows this jagged white line on the radius. And the doctor just points at the screen and there it is. It's broken. And honestly, that's comforting. We want our medicine to feel like math. But then you leave the radiology department and you step into the world of human communication. Which is an entirely different landscape. It really is. You step into neurodevelopment, cultural nuance, psychiatric trauma, and suddenly that x-ray machine is completely useless. We're looking at a diagnostic landscape that is just entirely murky. Because, well, you aren't dealing with bone density anymore. You are dealing with stories. You're dealing with the highly unstructured, incredibly messy data of human speech. Yeah. And every pause... every hesitation, every regional slang word, I mean, all of that contains vital diagnostic value. And this collision, you know, the collision between the messy reality of human conversation and the really rigid mathematical demands of the medical system, that is where we are spending our time today. It's a fascinating intersection. It is. Welcome to the deep dive. We are tracking the journey of a group of researchers who stumbled into this massive blind spot at the intersection of medicine and artificial intelligence. We're looking at a stack of sources today, really centering on a pivotal paper titled Perspective, Listening to Users When Auditing Medical AI Scribes. Right. And we're looking specifically at the work of researchers like Dr. Allison Koenigke from Cornell Tech and Dr. John Jose Nunez. And Dr. Nunez has this fascinating dual perspective because he is a clinical psychiatrist at the University of British Columbia, but he is also an AI expert. Which is exactly the kind of cross-disciplinary perspective you need for this. He and his colleagues started looking very closely at something that is basically sweeping through hospitals right now. Medical AI scribes. And they started to realize that this shiny new tech, the stuff that's meant to save doctors from burnout, it's actually misinterpreting patients in ways that are subtle, bizarre and honestly, potentially dangerous. Very dangerous. OK, let's unpack this. Because to understand how they got to this realization, we have to understand the crisis that made these AI scribes seem like a miracle in the first place. Right. Because Nunes and his team, they didn't start off as skeptics. They were looking at a health care system that is fundamentally... Drowning. Drowning in administration. I mean, let's look at the sheer scale of this in the Canadian context, just as an example. Right now, in the province of Ontario alone, over 2.5 million people do not have a family physician. 2.5 million people. That is a terrifying number. It's staggering. Those are people relying on chaotic walk-in clinics or, you know, sitting in an emergency room for eight hours just to get a basic prescription. refilled. Exactly. And when you ask why this shortage exists, a massive driver of the exodus from primary care is the administrative burden. The data shows that family physicians are spending roughly 15 to 20 hours a week just on non-essential administrative tasks. Wait, 20 hours? That is literally half a standard work week. Yeah. And when you aggregate that lost time across the entire system, It equates to 55.6 million patient visits worth of time annually. Oh my God. Right. That time doesn't exist for patient care anymore. It just vanishes into the ether of paperwork. Doctors are spending over half their workday documenting in electronic medical records the EMR systems, and only about a quarter of their day actually interacting with patients. It's like being a highly trained detective who spends, I don't know, 75% of their time formatting the margins on a report about the crime scene rather than actually solving the crime. That's a great way to put it. It defeats the entire purpose of their decade of medical training. So how are they coping with this? Because they still have to see patients. Well, because there simply aren't enough hours in the clinical day to see a full roster of patients and write the incredibly detailed notes required for modern medical illegal standards, doctors just take the work home. Right. The medical community actually has a term for this. They call it pajama time. Pajama time. Yeah, I have heard doctors talk about this. They put their kids to bed and instead of winding down, they open the laptop on the couch 9 p.m. and just chart for three hours. It completely arose their personal lives. It destroys their work-life balance, accelerates mental fatigue, and ultimately it drives them out of their profession entirely. Which, of course, makes the system even more short-staffed, meaning more work for the remaining doctors. It's just a catastrophic feedback loop. Exactly. So if you are Dr. Nunes looking at your colleagues burning out and someone hands you a piece of software that promises to eliminate pajama time. You're going to be incredibly optimistic. Absolutely. And that software is the AI scribe. But we need to clarify what an AI scribe actually is mechanically. Because I think people hear scribe and they think of old school dictation. Right. This isn't the basic dictation software from 10 years ago where a doctor holds a microphone and says, patient presents with knee pain, period. Right. It's not a speech to text transcriber just repeating what the doctor dictates verbatim. It is far more complex than that. Modern AI scribes utilize natural language processing and machine learning. They are designed to act as ambient listeners. Ambient listeners. Yeah. So a phone or a tablet just sits on the desk. Yeah. And it passively records the natural, unstructured, back and forth conversation between the doctor and the patient.- Okay, so I'm talking to my doctor. We spend a minute talking about how my knee hurts when I walk upstairs. Then we spend like three minutes talking about how our kids are doing in the same school district.- Right, and the AI processes that entire audio file.- Yeah. - It isolates the voices. transcribes the speech, and then uses a large language model to identify and extract only the relevant clinical data. Oh, wow. Yeah, it completely ignores the small talk about the kids, takes the data about the knee pain, and automatically formats it into a highly structured clinical note. Usually a SOAP note. Which stands for Subjective, Objective Assessment and Plan, right? Exactly. It can even draft a referral letter to an orthopedic surgeon based on that one single conversation. Okay, that sounds like actual magic. If I'm a burnt-out doctor, I'm throwing my money at that immediately. Did it actually work in the real world when they started testing it? The initial pilots were honestly staggering. Dr. Nunes and others looked closely at initiatives like the Doctors of BC pilot that ran recently. They had over 30 physicians test these ambient AI technologies. And what were the results? The doctors reported an average of 2.7 hours saved per week. And in the broader literature, some studies show a 70% to 90% decrease in documentation time. 90%. The human impact of that must be massive. The sources actually noted that 97% of the participants in that pilot reported reduced mental fatigue. They used the phrase, it brought joy back into practice. Because they could finally look away from the glowing rectangle of the computer monitor. They could turn their chair around, look the patient in the eye, and just have a real conversation. And the patients felt it too. The data showed 78% of patients felt their doctors actually paid more attention to them. So, I mean, on paper, this is the ultimate silver bullet. It crushes the administrative burden. It restores the human connection. It frees up hours of clinical time. And this is a massive, but this is where the story shifts. Because researchers like Konecki and Nunes don't just look at the high-level marketing data. Right. They started looking at the microscopic level of how these machines are actually interpreting speech. And we have to remember, these tools are no longer contained in small lab pilots. One platform mentioned in the sources, Nabla, has already transcribed over 7 million patient visits. Right. Seven million. That's millions of incredibly intimate, complex, life-or-death conversations that are currently being filtered through these algorithms. And as the researchers audited the transcripts generated by models like OpenAI's Whisper, which, by the way, is the foundational backbone for many of these commercial AI scribes, they found a critical systemic flaw. What was it? The AI wasn't just making typos. It was hallucinating. Okay, the hyperactivated antibiotics example from the Konecki paper, when I read this, it literally stopped me in my tracks. It's chilling. So a patient in the data set is talking to their doctor, and the patient simply says, quote, I became ill with a fairly serious strain of viral something. That is the entirety of the audio recording. It's a very human, somewhat vague statement. Viral something? But the AI scribe doesn't just transcribe viral something. The official medical note it generates says, I became ill with a fairly serious strain of viral something, but I didn't take any medication. I took hyperactivated antibiotics, and sometimes I would think that was worse. worse. It invented an entire narrative. Yes. It documented a specific medical action the patient never took. And it named a medication that literally does not exist in pharmacology. How does a multimillion dollar piece of software just invent hyperactivated antibiotics? I mean, mechanistically, what is going on in the code that makes it lie like that? To understand why it lies, you have to understand that large language models are not logic engines, and they are not fact databases. They are, at their core, phenomenally advanced predictive text engines. They operate on token prediction. Like when my iPhone keyboard suggests the next word in a text message, but just on steroids? Precisely. They have ingested terabytes of human text from the Internet. So when they process a prompt, they map out the statistical probability of what the next word should be based on the patterns they learn. Right, so it's a math equation. Yes. They exist in a mathematical space where words are turned into vectors, essentially directions and magnitudes of meaning. In this statistical space, the concept of a viral illness is very closely clustered near the concepts of medication and antibiotics. Ah, right. So when the patient vaguely trailed off with "viral something," the AI's neural network basically felt an irresistible statistical urge to complete the pattern. Yes. It recognized that in millions of internet documents, a sentence about a serious viral illness is usually followed by a discussion of medication. So it generated a statistically plausible sounding continuation. It wasn't trying to document reality at all. Not at all. It was just trying to generate text that looks like a medical sentence. That is terrifying. It's aggressively guessing what a doctor's note should sound like rather than documenting what physically occurred in the room. And the researchers found that these specific types of hallucinations occur about 1% of the time with the whisper model. Well, 1% feels low until you remember we just talked about 7 million patient visits. Exactly. 1% of 7 million visits means tens of thousands of fabricated medical details. Fake medications, invented symptoms, false negative test results are quietly slipping into the permanent health records of real people. Wait, if the AI is hallucinating fake drugs, how did hospital IT boards and health authorities approve this software in the first place? Like, how did it pass the tests? Because the test itself is fundamentally flawed. This is actually one of the most vital insights from Koenig and Nunez. They exposed the problem with how the tech industry currently audits speech recognition software. How do they audit it? The absolute unquestioned industry standard metric is something called the word error rate or where are. Word error rate. Conceptually, how are they calculating that? It's a rigid mathematical formula based on something called Levenstein distance. Basically, you take the human-generated perfect transcript of the ground truth, and you compare the AI transcript to it. Okay, that makes sense. You count every substituted word, every deleted word, and every inserted word. Then you divide that by the total number of words in the ground truth, and it spits out a percentage. I mean, that sounds reasonable on the surface. You're just counting the mistakes. It's reasonable for counting typos, sure, but it is disastrous for measuring semantic meaning. The word error rate treats every single word in the English language as having the exact same mathematical weight. Oh, I see. So if the AI misses an or an ah or changes going to into GONA, that is mathematically identical to it dropping a clinical term. Yes. Think about a patient saying, I am not having chest pain. Okay. If the AI scribe misses the word "not" and transcribes "I am having chest pain," the word "error rate" is incredibly low. It's just one single deletion out of six words. The software basically gets an A+ grade. But the clinical reality is completely inverted. You just sent a healthy patient to the cardiology ward. The metric gives a completely false sense of security. It's evaluating the software like it's a spelling bee, totally ignoring the medical context. That is wild. That's why these researchers are sounding the alarm. They argue that word error rate is an obsolete yardstick for medical AI. We need entirely new metric suites. Like what? We need software engineers to design audits that specifically track medical term recall rates. Meaning, out of all the actual drugs and diagnoses mentioned in the room, what percentage did the AI successfully capture? And we need specific isolated tracking for hallucination rates. Because right now, the system is totally blind to its own lies. Yes. But, okay, if an AI is struggling to accurately document a standard medical conversation without inventing terms, what happens when the human being speaking aren't communicating in a standard way? Because, you know, nobody actually speaks. speaks in perfect text fix sentences. Which introduces the sociolinguistic layer of this problem. And this is where the research gets incredibly fascinating, particularly regarding cultural differences in how we speak. Here's where it gets really interesting. There is this deep dive in our sources regarding the cultural communication differences between patients in the United States and patients in Canada. Right. And this isn't just an interesting sociological quirk. It is actively breaking the AI's ability to document care. To understand why the AI breaks, we have to look at the human behavior it is trying to model. American medical culture has been heavily shaped for decades by a multi-payer, insurance-driven, highly privatized system. Right. Time is money. Exactly. That system places a massive premium on time, efficiency, and billing codes. Doctors have very tight windows and patients know they were paying a premium for that time. So the communication style adapts to that pressure. Exactly. American medical communication, generally speaking, is highly low context, meaning the raw information is stated directly and explicitly. Sociologists call it the American summary. OK, so how does that look in a clinic? A patient walks in and immediately states their primary, most severe concern. Doctor, my chest hurts on the left side. It started two hours ago, and the pain is an 8 out of 10. Boom. Data delivered. Very efficient. But the sources point out that Canadian norms are entirely different. They are influenced by British and French traditions, but more importantly, by a publicly funded single-payer health care system. Yeah. Canadian communication tends to be high context. In high context communication, the raw data is heavily embedded in the social relationship, the pleasantries, and the shared cultural understanding. So instead of the immediate American summary, you get the Canadian narrative. Right. It's chronological. It starts from the beginning of the week. And crucially, it is heavily padded with politeness markers. And this leads to a specific linguistic behavior the researchers highlight called hedging. Hedging, yes. This is a critical data point that frequently causes misdiagnosis. Let's break down how it functions in a clinical setting. Sure. Imagine a Canadian patient walks into a clinic. They have a severely injured knee, but they aren't going to walk in and say, my knee is an agonizing pain. They're going to sit down and say, well, you know, it's been a bit tender lately. Or I was wondering if maybe if you have time, you could take a quick look at my leg. They use these softeners. Why do they do that? Because in a publicly funded system, there is a deep seated cultural anxiety about being a botherer. You know the system is strained, you know the waiting room is full, and you don't want to drain public resources unless you really have to. So you downplay your own suffering to be polite. Now, think about the translation error that occurs when an American doctor or an AI trained heavily on American data encounters, that Canadian patient. Right. If an American patient says, "It's a bit tender," the doctor, accustomed to the low-context American summary, takes that literally. They assume the pain is a two out of ten. But in Canadian English, a bit tender can be polite code for, I am in significant eight out of ten pain, but I'm trying to be stoic so I don't seem demanding. It is a massive clinical mistranslation, even though everyone in the room is ostensibly speaking English. And Dr. Nunez and his colleagues saw this happening in real time. Because of physician shortages in Canada, clinics frequently hire newly arrived American doctors. And the researchers observed these American doctors getting incredibly frustrated. Because they thought the patients were just wasting time. Exactly. They were cutting off their Canadian patients too early because they thought the patients were just rambling or making small talk rather than get into the medical issue. Which triggers a well-documented phenomenon known as the 90-second rule. Observational studies in these clinics show that if a doctor interrupts a Canadian patient within the first 60 seconds of the appointment, they will almost certainly miss the primary symptom. Because the patient is still building up to it. They are still establishing that polite, high-context relationship. And if you cut off that chronological narrative, it inevitably leads to what physicians call the doorknob comment. Yes, the infamous doorknob comment. I found this concept so relatable. So the doctor has listened for two minutes, thinks they've identified the issue, say a minor wraps up the appointment, stands up, puts their hand on the doorknob to leave the room. And then what happens? The patient suddenly says, oh, by the way, sorry to bother you, but I've also been coughing up blood. They waited until the absolute last possible second when they finally felt the social relationship was secure enough to drop the actual terrifying reason they made the appointment in the first place. Exactly. Now take that entire complex high wire act of human sociology and feed it into an AI scribe. These foundational AI models, like Whisper, are trained largely by scraping the entire Internet. And the Internet is vastly disproportionately American. Right. So if the AI model is fine-tuned to listen for a direct, low-context American summary right at the beginning of the audio recording, How does it process a heavily hedged, high context Canadian narrative? It's going to summarize the polite small talk at the beginning, decide the appointment was about a minor rash, and completely miss the doorknob comment. It's going to formally document that the patient's knee is a bit tender, totally stripping away the cultural context that implies severe pain. This is a major area of concern for Dr. New. he is actively seeking research grants to study this exact sociolinguistic blind spot. He is asking a fundamental question. Do these multimillion-dollar AI scribes actually understand the unique colloquialisms, the hedging, and the unspoken cultural scripts of the populations they are being deployed on? Because right now, hospital systems are buying these tools without knowing the answer. No. For the human doctors trying to bridge this cultural gap, the sources highlight a behavioral workaround called the LEARN model. Right. LEARN stands for listen, explain, acknowledge, recommend, negotiate. It's a framework for culturally sensitive care, and it specifically relies on something called the final check technique. How does the final check operate in practice? Well, instead of an American doctor wrapping up and saying any other questions, which a polite Canadian will instinctively answer no to, just to get out of the doctor's way, the LEARN model trains the doctor to ask an open-ended question. Is there anything else on your mind today? Oh, that's smart. Right. It actively invites the doorknob comment before the doctor actually stands up. It's an elegant workaround based on human empathy and cultural adaptation. But an AI scribe does not possess empathy. It does not possess cultural awareness. It only possesses statistical weights based on its training data. Which opens up an even more alarming realization. If the most advanced AI struggles to accurately document the symptoms of a polite, able-bodied Canadian speaking fluent English, what happens when it encounters someone with a severe speech disorder? This brings us to the most urgent section of the Konecki and Nunes paper. The researchers conducted rigorous audits on how these models perform across different types of speech diversity. And what did they find? The results reveal a massive disparity in technological reliability. The AI performs significantly worse for people with dysphonia, which is a medical hoarseness of the voice. It performs worse for people with dysarthria, which involves weakness or paralysis in the muscles used for speech production. It also fails disproportionately for people who stutter, people with aphasia recovering from strokes, and individuals who are deaf or hard of hearing and have unique vocal cadences. Mechanistically, why does the AI fail these patients? It comes back to the training data. If the massive data sets scraped from the internet do not contain sufficient phonetic vectors representing a dysarthric voice, the neural network literally doesn't know how to map those sounds to text. So just, what, it just fails? It just outputs garbage or hallucinate? It's like automobile manufacturers in the 20th century designing cars where the crash test dummies were all based on a 50th percentile male. The seat belts and airbags only worked perfectly if you were exactly 5'9" and weighed 170 pounds. The technology only protected the standard user, leaving everyone else at massive risk of injury. That's a perfect analogy. And in the healthcare system, the people who fall outside that standard vocal profile are often the exact populations who require the most frequent, highly accurate medical documentation to manage complex chronic conditions. Right, the stakes are so much higher for them. But beyond just failing to transcribe accurately, the researchers pose a really profound philosophical dilemma regarding conditions like stuttering, they ask. What is the actual ground truth of a stutter? I struggled with this concept. If you are a software engineer building the data set to train the AI, and the patient and the audiophile stutters, how are you supposed to type out the perfect transcript? Let's play it out. If a patient says, I, I, I, I need help, should the AI be programmed to transcribe those stutters verbatim? Well, if they're sitting in the office of a speech-language pathologist, yes. The pathologist needs that verbatim transcript to diagnose the severity and specific phonetics of the stutter. But what if that same patient is seeing a cardiologist for a heart murmur? The patient might find a verbatim, highly literal transcript of their stutter sitting in their permanent medical record to be deeply undignified or embarrassing. Because it has no clinical relevance to their heart. Exactly. They might strongly prefer that the AI clean it up and just document, I need help. So which version is the truth? It's completely subjective. It depends entirely on who the specialist is and how the patient feels about their own voice. An AI applying a blanket system-wide rule is going to violate the patient's dignity half the time. This tension between literal transcription and clinical judgment becomes incredibly high stakes when we shift our focus to psychiatry. Which is Dr. Nunez's wheelhouse. Yes. Dr. Nunez's clinical background is vital here. Psychiatric illnesses don't just change what a person talks about. They profoundly alter the physical mechanics of how speech is produced. Right. The sources note that a patient experiencing a manic episode might exhibit rapid, pressured, tangential speech, jumping wildly from topic to topic without taking a breath. Conversely, a patient with severe treatment-resistant depression might have incredibly stilted, monotone, slow speech with long, agonizing pauses. And then there is the actual content of the speech, particularly in emergency cases involving psychosis, paranoia, or delusions. The paper provides a brilliant, slightly terrifying illustration of the gap between a human docker's clinical judgment and a machine's literal transcription. Right. Let's look at the Martians' example. Yes. Imagine a patient comes into the psychiatric emergency room in the middle of a severe psychotic break. They are terrified. They sit down and spend five chaotic minutes aggressively explaining to the attending doctor that alien beings are hunting them through the city. A human clinician understands the delicate nature of that moment. They know how to extract the clinical data while protecting the patient's long-term dignity. Right. A human psychiatrist will take that entire five-minute rant and summarize it into a single, respectful, objective sentence in the chart. Patient endorses a persecutory delusion involving aliens. Right. It's clinically accurate, but it's guarded. It's protective. But the AI scribe doesn't have a medical license. It doesn't understand the concept of dignity. It is a transcription engine optimized to capture data. So what does it do? The AI might transcribe the entire psychotic rant verbatim. The permanent medical record will literally read, the patient is currently being pursued by Martians, seeking to use his DNA for a galactic cloning project. And the reason that level of verbatim detail is so dangerous ties into recent changes in medical law. In the United States, we have the 21st Century Cures Act, and there are similar open notes movements advancing in Canada. Which mandate transparency. Right. These regulations mandate that patients must have direct, unimpeded digital access to read their own clinical notes. Oh, wow. So six months later, that patient is... stabilized on medication, they're recovering, and they log into their patient portal on their phone. And they read a highly detailed, verbatim, algorithmic account of the darkest, most terrifying psychotic break of their life. Reading that could be profoundly distressing and triggering. But the sources point out an even more dangerous scenario. What if the patient logs into that portal while they are still highly vulnerable or actively psychotic? Ah, yeah. If a paranoid patient reads their own delusion, typed out officially in a hospital-sanctioned document, it can act as confirmation bias. It makes it real to them. It could actively reinforce the delusion. They might think, look, the AI documented the Martians. The hospital knows they are real. It's a conspiracy. The technology meant to aid their care actually deepens their pathology. It's a nightmare scenario. And yet the tech industry is aggressively marketing these AI scribes specifically to mental health clinics. Because mental health professionals are also drowning in paperwork. The market demand is astronomical. But the guardrails simply do not exist yet to handle the extreme nuance of psychiatric documentation. And it's not just vulnerable patient populations being left exposed by this rapid rollout. It is entire geographic regions. I want to bring in the commentary we have from Tiana Bresson and her team at the Northern Ontario School of Medicine. This is a great pivot because while Dr. Nunez is looking at clinical nuance in Vancouver, Bresson is looking at the sheer infrastructural reality of deploying this in rural medicine. We are talking about the digital divide. It's very easy to run a highly successful pilot program in a massive, well-funded urban research hospital in downtown Toronto, where you have fiber optic internet and an IT department on standby. Rural and remote clinics are operating in a completely different reality. The sources outline this perfectly. These AI scribes aren't just simple apps running locally on a phone. They are heavily cloud-based. They rely on real-time API connections to massive server farms to run the natural language processing. And if you are running a clinic in a remote northern community, you often don't have stable high-speed broadband. When the weather gets bad, the connection drops. And when the API connection fails in the middle of a patient visit, the rural doctor doesn't have a dedicated tech support staff. to call. The system just crashes. Furthermore, these rural clinics, which are usually publicly funded and operating on razor-thin margins, often cannot afford the massive recurring enterprise software subscription fees these tech companies demand. Bresson and her colleagues invoke a really powerful sociological concept in their paper to explain this disparity. They call it the theory of fundamental causes. This theory is a cornerstone of health sociology. It posits that whenever a new, highly effective health technology or intervention is introduced to society, it almost always benefits the most advantaged, highly resourced populations first. Because the wealthy urban centers have the immediate infrastructure and capital to absorb the technology. Precisely. So inadvertently, a technology that is designed to improve the health care system overall actually ends up widening existing health disparities. Let's look at the outcome of that. An urban doctor in a wealthy zip code gets 2.7 hours of their week back. They use that time to see more patients or they go home and rest, preventing burnout. Meanwhile, a rural doctor in northern Ontario whose clinic can't afford or run the AI is still drowning in pajama time. That burnout drives them to quit and move to the city, which worsens the already critical doctor shortage in the rural community. The rich get richer, the poor lose their doctor. And even if a rural clinic magically secures the funding and the broadband to run the AI, Brassad points out another massive hurdle. We talked about how these models are trained on standard American-dominated Internet data. What happens when you drop that exact same algorithm into a clinic in northern Ontario that serves predominantly indigenous populations or Francophone populations or communities that speak a fluid mix of French, English, and regional dialects? The foundational models lack the specific phonetic and cultural training data for those groups. The error rates will skyrocket, the AI will fail to recognize key terms, and the hallucination rate will likely increase as the model desperately tries to map unfamiliar sounds to its American-centric data. database. The technology quickly transforms from a time-saving tool into an administrative burden. OK, let's look at the totality of what we've uncovered so far. We have AI scribes that hallucinate fake antibiotics. We have models that completely misinterpret polite Canadian hedging. We have algorithms that fail people with stutters, misdocument psychiatric delusions, and structurally disadvantage rural and indigenous communities. That's a lot. With all of these massive systemic risks actively deployed in the wild, who is holding the leash? Who is legally responsible when the AI inevitably gets a critical diagnosis wrong? This brings us to the crucial final act of our exploration. Guardrails, governance, and the immense friction of the EMR. How is the medical establishment attempting to regulate this technological Wild West? The regulatory response in Canada has been remarkably strict, particularly compared to other tech sectors. The College of Physicians and Surgeons of British Columbia, the CPSBC, and the Canadian Medical Protective Association, the CMPA, have drawn a hard line in the sand. The central directive is incredibly clear. The autonomous use of AI scribes is strictly prohibitive. Meaning the AI cannot just passively listen to a visit, generate a note, and independently hit save to finalize the permanent medical record. Absolutely not. The human physician must remain the final, unyielding arbiter of truth. The doctor is legally, ethically, and professionally required to manually review, edit, and cryptographically sign off on every single note generated by the algorithm before it becomes an official medical document. Because if a patient is harmed, let's look at a worst case scenario. The sources note that behavioral health providers report AI frequently struggles to capture subtle but critical elements of a suicide risk assessment. Right. If the AI misses a patient's vague reference to self-harm and the doctor signs the note without catching the omission and the patient tragically takes their life, the liability does not fall on the tech startup in Silicon Valley. The legal and moral liability falls entirely on the human doctor. The buck stops with the physician. The AI is treated as a highly fallible assistant, not a replacement. But, you know, this strict governance introduces a fascinating, almost tragic paradox into the entire techno-optimistic narrative of the AI scribe. I was just thinking about this. The entire pitch was to eliminate the administrative burden. But if a doctor knows that the AI is prone to inventing hyperactivated antibiotics and they know they will be sued for malpractice if they miss that hallucination, they have to meticulously, neurotically proofread every single dense multi-paragraph transcript the AI spits out. Doesn't that immense cognitive load just recreate the exact burden we were trying to escape? That is the fear circulating in the medical community. Are we genuinely saving time or are we just trading pajama time, charting for pajama time proofreading? If the friction of auditing a hallucination prone AI is higher than the friction of just rapidly dictating the note yourself, the tool is a failure. And the friction isn't just about proofreading. The sources highlight massive structural friction regarding privacy and basic workflow integrations. The Office of the Information and Privacy Commissioner, the OIPC, is raising major red flags. Because fundamentally, these AI scribes are ambient microphones broadcasting the most sensitive, legally protected conversations of your life to a third-party server. The privacy questions are staggering. Where does that audio file physically go? Is it stored on a server in a different country with different privacy laws? Is the tech company legally permitted to use your intimate clinical conversation to train the next iteration of their language model? And the researchers uncovered this incredibly frustrating catch, Tommy, too, regarding privacy. Some of the more responsible tech companies, like Nabla, want to be hypersecure. So they designed their system to permanently delete the audio recording the absolute millisecond the text transcript is generated. Which sounds great for patient privacy. But from a medical auditing perspective, it is a nightmare. because if the audio is deleted immediately, it is literally impossible to audit the AI after the fact. Oh. If a doctor reads the transcript and thinks, wait, did the patient actually say they took that medication? Or is the AI hallucinating? They have no tape to go back and check. They have the transcript, but the ground truth is gone forever. It's maddening. And then, practically speaking, inside the actual clinic, there is the problem of clipboard vulnerabilities. Many of these shiny, cutting-edge AI scribes are essentially just standalone apps running on a doctor's personal smartphone. But they do not integrate smoothly with the aging, clunky electronic medical record systems that hospitals actually use, like OSCAR or MedAccess. So how does the doctor actually get the data from the AI into the official record? They literally have to copy and paste it. You're kidding. No, imagine a chaotic emergency room. The doctor has the AI app open on their phone. The AI generates the note. The doctor highlights the text with their thumb, hits copy, walks down a busy hallway to a desktop computer, logs into the hospital EMR, and hits paste.- Cybersecurity experts refer to this as a clipboard vulnerability. When you are manually moving highly sensitive, protected health data across different operating systems and clipboards in a chaotic environment, the risk of a data breach skyrockets.- Or more simply, human error occurs. A distracted doctor accidentally pastes patient A's psychiatric notes into patient B's orthopedic file. The friction of the technology creates new avenues for catastrophic errors. So we have outlined a massive, complex web of challenges. We have technological flaws, cultural blind spots, systemic biases and regulatory friction. If you are Dr. Nunes and Dr. Koenigke, you aren't just pointing out the flaws, you are trying to build a safer system. What is their call to action? How do the researchers suggest we fix this? The authors make a very compelling structural argument. They state that we need to stop treating AI software like a consumer app and start treating it exactly the same way we treat a new pharmaceutical drug or a physical medical device like a pacemaker. We need rigorous, mandatory post-marketing surveillance. Meaning the tech company doesn't just get approval in a lab, sell the software to a hospital and walk away to count their money. They have to constantly monitor the software in the real world to see if it starts developing unexpected side effects. Exactly. They are advocating for standardized ongoing audits that specifically track the AI's performance across diverse demographic groups. They want mandatory public reporting of hallucination rates. But their most profound recommendation is a concept called participatory design. This was my favorite part of the reading. Community-driven audits. It's the idea that you don't just have a room full of software engineers in Silicon Valley deciding what constitutes a good medical transcript. You actively bring the marginalized speakers into the design process. You sit down with a patient who stutters or a patient with a severe regional dialect or a psychiatric patient and you ask them directly. What does a successful, accurate, and dignified medical transcript look like for you? You design the technology to serve the complex, diverse reality of the human beings, rather than forcing the human beings to conform to the rigid limitations of the technology. That is such a powerful shift in perspective. If you want to dive deeper into these sources and read the specific papers from Nunez, Konak, and Bresson we've been referencing, you know where to go. Let's take a step back and look at the whole picture. We started today with a medical system in a state of administrative collapse, doctors drowning under 55 million visits worth of paperwork. We examined a highly complex, seemingly magical AI solution that promised to eliminate that burden and bring the joy back to the practice of medicine. But as we peeled back the layers, we discovered that this cutting edge technology is actually a mirror. It is reflecting our complex cultural quirks, our Canadian politeness, and our American directness. It is reflecting our systemic biases against nonstandard speech and vulnerable populations. And it is exposing the fragile reality of our healthcare infrastructure, from psychiatric wards to remote rural clinics. It serves as a profound reminder that inserting technology into a human system rarely just solves the problem. It almost always amplifies the underlying human complexity. City. So what does this all mean for you the next time you walk into a doctor's office? We have to remember the ultimate goal of health care. It's not just data collection. It is the therapeutic relationship, the fundamental bond of trust and empathy between a patient and a healer. Right. And the entire marketing pitch of the AI scribe was that it would finally remove the barrier of the computer screen so the doctor could turn around, look you in the eye, and be truly deeply present with you in your moment of vulnerability. But here is the provocative thought I want to leave you with, connecting all the way back to those murky waters of human communication we talked about at the very beginning. Let's hear it. Imagine you were sitting in that exam room. The doctor is looking you in the eye. But in the back of their mind, they are secretly worrying about whether the ambient microphone on the desk is accurately translating your polite Canadian hedging... or they are stressing about whether the algorithmic model is currently misinterpreting your stutter. Or they are distracted, wondering if the machine is silently hallucinating a fake medication into your permanent chart. If that is what is happening in the doctor's mind, are they actually truly present with you? That is the core unresolved tension of this technology. As we rapidly invite these invisible algorithmic third parties into the most intimate, frightening, and vulnerable conversations of our entire lives, we are forced to ask ourselves a very difficult question. Will artificial intelligence eventually evolve to understand the messy, beautiful, high-context reality of what we actually mean? Or will we slowly, unconsciously, begin to change the way we speak to accommodate the limitations of the machine? Will we become the ones who are forced to adapt? Thank you for joining us on this deep dive. Keep asking questions, keep challenging the tools we build, and keep exploring the fascinating intersections of technology and humanity. We'll catch you next time.

Podcasts we love

Check out these other fine podcasts recommended by us, not an algorithm.