Ignition by RocketTools

When AI Knows the Diagnosis But Misses the Action

Dan McCoy, MD Season 1 Episode 7

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 3:23

Mount Sinai just published the first independent safety evaluation of ChatGPT Health — and the findings should change how you think about AI in healthcare.

Published in Nature Medicine, researchers ran 960 patient interactions across 21 medical specialties. What they found wasn't that ChatGPT gets medicine wrong. It's that it gets the diagnosis right, then tells you to do the wrong thing about it.

In this episode, we break down:

  • Why ChatGPT told patients to wait in over half of true emergencies — after correctly identifying the danger in its own explanation
  • The inverted suicide crisis alerts that fired for sadness but went silent when patients described specific plans for self-harm
  • The sycophancy problem: why ChatGPT is 12x more likely to agree when you downplay your own symptoms
  • Where ChatGPT actually performs well (93% in semi-urgent cases) — and why that makes the failures harder to spot
  • What this means for anyone using, building, or recommending AI health tools

Sources & Links:

Primary Study — Nature Medicine, Feb 2026
https://doi.org/10.1038/s41591-026-04297-7

Mt. Sinai Press Release
https://www.mountsinai.org/about/newsroom/2026/research-identifies-blind-spots-in-ai-medical-triage

Forbes: "ChatGPT Provided Wrong Advice In Over 50% Medical Emergencies Tested"
https://www.forbes.com/sites/brucelee/2026/03/08/chatgpt-provided-wrong-advice-in-over-50-medical-emergencies-tested/

NPR: "ChatGPT might give you bad medical advice, studies warn"
https://www.nhpr.org/2026-03-11/chatgpt-might-give-you-bad-medical-advice-studies-warn

Related: AI Chatbots and Medical Misinformation — Communications Medicine, 2025
https://doi.org/10.1038/s43856-025-01021-3

Full research brief and deep dive on Substack:
danmccoymd.substack.com

SPEAKER_00

40 million people a day are asking ChatGPT for medical advice, and Mount Sinai just ran the first independent safety evaluation of this thing. Published in Nature Medicine last month, 960 patient interactions, 21 specialties, three independent physicians grading every answer. The headline finding ChadGPT gets the diagnosis right, then tells you to do the wrong thing about it. Which, if you think about it, is worse than getting it wrong entirely because you read the explanation, it sounds smart, and you trust the recommendation. Except the recommendation might kill you. Here's what I mean. In an asthma scenario, ChatGPT identified elevated CO2 levels, an early sign of respiratory failure, wrote it right there in the explanation, then recommended the patient wait and observe. The correct answer was the ER immediately. How about diabetic ketoacidosis? They flagged the elevated blood sugar and the ketones. They recommended home observation. The correct answer again was, well, the ER. In over half of true emergencies, 51.6%, ChatGBT told patients to wait 24 to 48 hours, not because it missed the signs. It described the signs correctly. It just couldn't connect the knowledge to the action, which is a bit like a smoke detector that correctly identifies the fire and then suggests you make tea. But here's the finding that should genuinely concern you. ChatGBT is built-in suicide crisis alerts. These are prompts that are supposed to direct at-risk users to the 988 lifeline. The researchers found that these alerts fired more reliably when someone expressed general sadness than when someone described a specific plan to hurt themselves. The safety system was inverted relative to the clinical risk. The 988 alert triggered in only four of the 14 relevant scenarios. Mount Sinai's chief AI officer said it plainly. When someone describes exactly how they would hurt themselves, that is a sign of more danger, not less. And the system got that backwards. Now the common reaction here is well, ban it, shut it down, AI can't do health, but the data is really more interesting than that. Semi-urgent cases, guess what? 93% accuracy. Textbook emergencies like stroke and anaphylaxis, they were nearly perfect. It fails in the gray zone. The ambiguous cases were clinical judgment matters most. And when patients downplayed their own symptoms, ChatGPT was 12 times more likely to agree with them. That's sycophancy. It's architecturally embedded in how these models are trained. The problem isn't that AI can't help with health, it's that the failure mode is invisible. The explanation reads like it came from a physician. The action plan reads like it came from a chatbot, trying not to upset you. The question isn't whether AI should be used in healthcare. That ship has sailed. 40 million people a day are have already decided. The question is whether anyone's going to test these systems independently before the next 40 million sign up. I put together a full research brief on the study, every data point, every source, the complete methodology breakdown. You'll find it linked in the article below. And if you're listening to the podcast version of this, the same link is in the show notes. If this is the kind of analysis you find useful, subscribe so you don't miss the next one. And until next time, I'll see you then.