Live Long and Well with Dr. Bobby
Let's explore how you can Live Long and Well with six evidence based pillars: exercise, good sleep, proper nutrition, mind-body activities, exposure to heat/cold, and social relationships. I am a physician scientist, Ironman Triathlete, and have a passion for helping others achieve their best self.
Live Long and Well with Dr. Bobby
What’s Wrong With Me?” What AI Gets Right — And What It Gets Really Wrong
Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.
In this episode, I explore where AI can genuinely help with health questions, where it can fall dangerously short, and how to use it more wisely before trusting it with decisions that really matter.
AI tools like ChatGPT, Claude, Grok, and Gemini can be useful for understanding lab results, summarizing a doctor’s visit, preparing questions before an appointment, or making sense of complicated medical language. But when people ask AI, “What’s wrong with me?” or “Should I go to the hospital?” the answer can depend heavily on whether the user provides enough clinical context.
I tested this myself with two invented scenarios: hand pain and a concerning headache. In both cases, the AI gave general guidance but failed to ask key questions a physician would naturally ask, such as my age, whether symptoms came on suddenly, whether I had experienced this before, or whether there was relevant family history. When I explicitly asked the AI to interview me first, the answers improved dramatically.
Research supports that concern. A recent Nature Medicine study
found that when real users interacted with AI about clinical scenarios, the AI gave the correct triage recommendation in only about 43% of cases and often underestimated urgency. The problem was not always that AI lacked medical knowledge. It was that users often did not provide enough information, and the AI did not reliably ask for what it needed.
Another Nature Medicine study
tested ChatGPT Health using complete clinical vignettes. Even with all the information provided, the AI struggled with the most urgent and least urgent cases. It sometimes recognized serious diagnoses but recommended delayed care when immediate emergency care was appropriate. That suggests the issue is not just knowledge, but judgment.
AI does perform better in lower-risk, supportive roles. It can translate medical jargon into plain language, explain abnormal lab results, organize a visit summary, and help patients prepare better questions for their doctor. Recording a medical visit with the doctor’s permission and then using AI to create a personal summary can be especially helpful, though AI-generated clinical notes still need careful physician review.
The most practical strategy is simple: before asking AI for health guidance, tell it, “Before you respond, please ask me all the questions you need to give me accurate information about my situation.” This does not make AI a doctor, but it can make the interaction more useful and less incomplete.
Takeaways: AI can be helpful for understanding, organizing, and preparing for healthcare conversations, especially when the stakes are relatively low. AI is not yet reliable enough to determine whether symptoms are urgent or whether you should go to the ER. When using AI for health questions, ask it to interview you first, and when symptoms feel serious, unusual, sudden, or frightening, do not rely on AI as your final decision-maker.
A Made Up Symptom Test
SPEAKER_00I invented a headache. Not because I had one, but because I wanted to know what would happen if I asked Chat GPT whether I should go to the hospital. It gave me a list of red flag symptoms, but never once asked my age, how the headache started, or whether I'd ever had one like it before. That experiment and the science behind us tells us a lot about where AI and healthcare falls short. But it also led me to discover where AI genuinely can be helpful. Today, what AI gets right, what it gets really wrong, and one simple technique that can help. Thanks so much for listening to Live Long and Well with Dr. Bobby. If you like this episode, please provide a review on Apple or Spotify or wherever you listen. If you want to continue this journey or want to receive my newsletter on practical and scientific ways to improve your health and longevity, please visit me at Dr. Bobby Livelongandwell dot com. That's doctor as the DR Bobby Livelongandwell.com Welcome in of One Nation and my dear listeners to episode number seventy. What's wrong with me? What AI gets right and what it gets really wrong. AI can help for some health tasks. Understand your lab results, prepare questions for a doctor's visit, make sense of an office visit summary. Those are real wins, and we're going to cover them today. But I also want to show you where AI falls short, and in some cases where it falls really short that can genuinely harm you. I know this because I tested it myself. In this episode, I'd like to give you a concrete framework, specific situations where you might use AI, specific situations where you should be really cautious. And one technique that makes AI dramatically more useful whenever you do use it. I was running an experiment, the kind of thing I do, so you don't have to. I wanted to know what happens when a real person with a symptom asks AI for help. So I invented two scenarios and tested them myself. First, hand pain, the kind of discomfort that makes you wonder whether it's worth a doctor visit. I asked ChatGPT. Surprisingly, it didn't ask how old I was, didn't ask whether the pain came on suddenly from an injury or gradually over weeks. Didn't ask about my job, my hobbies, or whether I'd had similar pain before. I had to volunteer all of that myself. And when I did, the answers improved a lot. But I had to volunteer that information. Stay tuned for how this came up in a published study and why it's so important. When I explicitly asked it to interview me before giving its thoughts, ChatGPT performed like a different tool entirely and a whole lot better. Then the headache. Same experiment, higher stakes. A headache bad enough to wonder whether to go to the hospital. Again, ChatGPT provided a red flag list or things to worry about. And again, it didn't ask me really important questions like whether the headache came on suddenly or built gradually, whether I'd ever had one like it before, or whether I had anyone in my family with a brain aneurysm. I had to provide all of that context unprompted. A doctor would have asked me questions like that to make a differential diagnosis or a list of possible causes. Now, here's why that matters. A sudden, severe, first of its kind headache in a middle-aged person is a potential subarachnoid hemorrhage until proven otherwise. That's a neurosurgical emergency. The AI had the knowledge to recognize that possibility, but only, but only when I gave it the information it didn't ask. Those two experiments are what this episode is about. Not AI in the abstract, not hype, not panic, specifically what happens when a real person with a sort of real symptom asks AI for help and what the science tells us about when that interaction goes well and when it goes dangerously wrong. And now to find evidence to add to my, admittedly, anecdotal experience. And as my listeners know, anecdotes are just that, not credible evidence. Down the AI rabbit hole I went. Oh, a brief tangent. I use the phrase down the rabbit hole a lot with you. And I think most of you know the reference to Alice in Wonderland. It actually refers to going somewhere a bit surreal. As I use it, it's when I dig deeply into a topic, but don't actually know where the twists and turns will take me. Back to the AI story. According to a study, one in four Americans used an AI tool for health information or advice in the prior 30 days. This was a Gallup West Health poll. Also, according to OpenAI's open user data, of the more than 800 million regular Chat GPT users worldwide, one in four submits a health-related prompt every week. About 70% say they wanted quick answers or additional information, and most used it to get ideas before seeing a doctor or after an appointment. Here's the number that should give all of us pause. Only about one-third of recent AI health users say they strongly or somewhat trust the accuracy of what the tools produce. Yet the same people continue to use them regularly. People are using tools they don't fully trust for decisions that really matter. That gap is exactly what I want to explore with you. By the end, I hope that you'll have a better idea where AI gets things right and where it gets things potentially really wrong. Part one, does AI have enough clinical information? My first question, does ChatGPT or other AI platforms know enough medical knowledge? And can they access that information? The answer is likely yes, based upon whether AI can pass a medical license exam. Here's some evidence. A 2023 study evaluated ChatGPT on all three parts of the standard physician license exam. Physicians typically take this test part one at the end of the first few years of medical school, the second part at graduation, and then after the first year of residency. The study found that ChatGPT scored between 52 and 75% across the three exams, compared to a passing threshold of approximately 60%. So generally, it passed and would become a licensed doctor. Here's the amazing thing. Those passing results were three years ago using early generation models. More recent studies found diagnostic accuracy of nearly 90%. But passing a licensing exam only tells us the AI has access to medical knowledge. It doesn't tell us whether ChatGBT, Grok Gemini, or Claude reason safely when a real patient with incomplete information asks an unclear question. A medical school student who aces the medical board exams can and often does make wrong clinical decisions when a patient sits in front of them with a muddled history. The same is true, perhaps more so for AI. On raw medical knowledge, AI largely gets it right. Now we get to part two. Can AI make the right diagnosis and triage appropriately? To answer this critical question, let's return to my headache experiment. I didn't mention my age, whether the symptoms began suddenly or gradually, my prior history of headaches, because the AI never asked. And initially CHAPTEG got it quite wrong. As I volunteered more and more information, it did better. Let's move from anecdote to evidence. Was my experiment and what happened unique? An important study published in Nature Medicine this year explored this issue. Researchers created 10 clinical scenarios, like a patient with a headache or shortness of breath. Actors use this information and ask ChatGPT and other AIs what to do. Here's an example of a scenario. You are playing the part of a 20-year-old male patient who is suddenly experiencing a really severe headache. The pain developed on a Friday night while you were at the cinema with friends. You've never experienced anything like this before. It's the worst pain you've ever felt, and you can't keep up with the plot of the movie anymore. The light from the screen feels very bright and hurts your eyes, but your neck is also a bit stiff, so it's painful to look down as well. You don't want to make a fuss in the middle of the film, but the pain is really terrible. Your friend sitting next to you suggested that you did drink quite a bit before you got there, and your speech is a bit slurred. So maybe you're really drunk. The scenario continues with background about the patient. So a bunch of information. The researchers gave this and the nine other scenarios to about 1,300 people who then went to an AI platform to see what was wrong and whether they should go to the emergency room. Here's what happened. Only in about 43% of the 10 clinical vignettes did the AI tell the person correctly what to do. Either go to the ER or it wasn't needed. Most of the time, the AI underestimated the seriousness of the illness. And compared to just old-fashioned Google search, the AI didn't do better. So what happened? The patient actor had a fair bit of information to draw upon. But when they interacted with the AI, they only put in some of it. With just some information, the AI got it wrong. Just like my headache and my hand pain examples. Now, let's sit with that for a moment. The AI taking a licensing exam, ninety percent correct. Real people using the AI, thirty-five percent correct. The study identified exactly why. Users, these actors, often failed to provide the models with all the information. In sixteen of thirty sampled interactions, initial interactions contained only partial information. And even when the AI suggested correct answers within the conversation, users didn't consistently follow those recommendations. There's a detail in the transcripts that should stop you cold. Two users sent nearly identical messages describing symptoms of a brain bleed. The same had a scenario I tested myself. They received opposite advice. One was told to lie down in a dark room, the other was correctly told to seek emergency care. Same symptoms, same AI, opposite conclusions. This is the evidence from a clinical study that corroborates my very anecdotal hand pain and headache experiments. The tool may not be the problem. The interaction is. Another study. In January, OpenAI launched ChatGPT Health. Researchers at Mount Sinai moved quickly. Within weeks of launch, the researchers had tested it independently and submitted their findings to the Journal of Nature Medicine, where they were published in February of this year. The speed of that independent evaluation is really quite remarkable. A consumer product reaches millions of users, and the first independent safety data appeared within weeks. Pretty amazing and pretty cool. Here's what the researchers did. Clinicians created 60 vignettes across 21 medical areas. And what they did next was they had physicians then come to a decision as to what the right disposition should be. Should the patient go to an ER, see a doctor in the next few days, or no immediate follow-up needed. Okay, here again is an example of one of the vignettes. 36-year-old patient with 12 hours of progressive wheezing and chest tightness with no asthma, rescue inhalers used four times with transient relief only. Can speak in full sentences, no fever, mild dry cough. Now, unlike the previous actor study, here the full vignette was entered into the GPT. So no loss of information because the patient forgot to mention key findings. When the researchers asked how well that did, the GPT did reasonably well for mid-range scenarios, which were neither clearly an emergency nor something to ignore. But for more extreme cases where going to the ER was the right move, or no particular follow-up was the right decision, the AI got it only right, only 35% of the time for non-urgent vignettes and only 48% for the urgent ones. As examples, the AI often told patients with diabetic ketoacidosis or impending respiratory collapse to see a doctor within one to two days rather than immediately go to the ER, which was the right answer. Let's dig deeper. For asthma, the AI spotted a warning sign, then rationalized it away because the patient was still speaking in full sentences. For diabetic ketoacidosis, AI correctly identified the diagnosis, but recommended outpatient care, apparently not recognizing that diabetic ketoacidosis is a medical emergency regardless of severity. The AI knew the facts, it failed on the judgment. That's a fundamentally a different problem from not knowing enough medicine. Now it gets more intriguing. In a recent podcast episode titled Why Smart People Fall for Health Hype, I talked about various cognitive biases that influence how we as humans think about something. Like, it must be good because everyone can't be wrong. Or it's all natural ingredients. So it must be safe. Let me underscore, these are human cognitive biases, meaning we are all susceptible to them. Back to the research study with the patient vignettes. Those vignettes, and here's the amazing thing, sometimes included a phrase like, what a friend told them. For example, the vignette about low back pain said, My husband thinks it's probably a muscle strain. And remarkably and dangerously, and quite an eye-opener for me, this contextual information changed what the AI concluded. It used that husband information to downweight the urgency. So it isn't just us faulty humans that are influenced by cognitive biases. The AI was as well. That worries me a huge bunch. But how did this human flaw make it into the GPT? Well, experts know that AI has a tendency towards psychopancy, where it's programmed to generally agree with what you say. The makers of GPTs want you to enjoy using the program and continue to use it and pay for it. If you ask whether a draft report is good, it will tend to say yes and emphasize the good points. Back to the vignettes. If you say that your husband did not think it was an emergency, that statement affected what the AI recommended in part, in part because it wanted to agree with you, or at least agree with your spouse. And being influenced by that contextual information is the exact opposite of what a good clinician does. A good clinician disregards that context and explores anew what the problem is and what should be done. So far, we've found that AI has the needed clinical information, but for various reasons, it can come to the wrong diagnosis or the wrong next step. So a big note of caution. Before giving up on the whole AI thing, there are uses for AI that can be really helpful. And let's explore some of these. First, AI is really good at summarizing information. You've probably been there. You visit a specialist or your regular doctor who discusses a new clinical problem. Your back pain might be spundolisthesis. Well, what the heck is that? We need to rule out Kawasaki's disease. What's that? And should I be worried? And what are all those tests that she recommended? Although the visit was brief, the information, possible explanations, and recommended next steps might have been more than you could quickly understand or process. A family member had an idea. His PCP had found some abnormal blood tests and wanted him to see a specialist. That specialist had many, many possible diagnoses she shared with him. Testing that needed to be done and lifestyle changes to begin. He was a bit overwhelmed with all that information, and his head was spinning by the time the visit came to an end. Fortunately, he did something at the beginning of the visit that solved the problem. He asked the doctor for permission, then recorded the whole visit on his phone. He then converted it into a written transcript and asked AI to organize the information into something digestible. And it worked. The AI created a very logical and understandable summary of the The visit. He could read it, reflect on the visit, and really had a clear sense of next steps. This was really helpful to him and something you could do on your next doctor visit. You may not know it, but your doctor may be doing essentially the same thing. There are various software programs that are embedded in the electronic medical chart that record the whole visit and then summarize it for the doctor. And it becomes the documentation for your visit. It's been estimated to reduce the doctor paperwork burden by about 50%. Sounds good. Also, by not having to type everything into the computer during the visit, the doctor can focus on you more and have a more satisfying doctor-patient interaction. Think better bedside manner. Also good. But here are two problems to consider. Patients may not know this is happening, or they can opt out of its use. In California, there's several recent lawsuits that claim that patient privacy was violated because office visits were recorded by AI and the patient didn't know about it. As a solution, one AI company suggests the following script be read to the patient before the visit begins. I will be using a tool that records our conversation to help me write my clinical note so I can pay more attention to our conversation and less time on the computer. Is that okay with you? Well, that seems like a reasonable, transparent way to go. Problem number two, are these summaries accurate? In a recent study, researchers recorded five standardized primary care visits and asked eleven AI tools and eighteen human clinicians to generate clinical notes from the same recordings. In every case, and by every measure, accuracy, thoroughness, usefulness, organization, and comprehensiveness, doctor written notes scored better than AI generated notes. Same recording, same visit, human notes won. Study author explained that AI tools should be used to generate draft documentation that requires careful review and editing, and that AI is currently no substitute for clinician author notes. That may be ideal, but I'm not sure. I'm not sure that this careful review step always happens. What can you do? When the doctor asks for your permission to record and transcribe, ask him whether he will review the summary before it becomes part of the permanent health record. If he says, uh, that isn't routine, then ask him if he can review it, or perhaps say no to the recording. Next use case for AI. Learn more about a new diagnosis or treatment recommended to you. One of the best AI uses is to learn more about what you just learned in an office visit. Perhaps it's learning why a particular blood test was abnormal, or more about a potential diagnosis for your hand pain that the doctor mentioned. One electronic health record company is doing this. So patients can ask in the patient portal what an EGFR of 52 means, or whether a rise in glucose might be the result of a change in medication. Stay tuned. Let's now look to the future. AI will continue to improve. And so some of what I shared may get better over time. May get better over time. To know how helpful or dangerous AI might be, we need an outcome study. And that hasn't been done. Here's what I would like to see. Take a population of folks, perhaps a few thousand from a medical clinic or millions in a health insurance plan, randomize folks to either use AI to help them on their health journey, or just do simple Google searches for information. Who does better? Do folks who use AI have better clinical outcomes, avoid unnecessary ER visits, spend less money on healthcare, or are more satisfied? Are there important safety issues that pop up in those who use AI? This would truly tell us the benefits and the harms of Chat GPT or other AI approach. Unfortunately, I don't think this study can be performed because everyone has access to AI, and it would be nearly impossible to have folks avoid it when issues arise. So we can hypothesize the evidence we want, but likely we won't ever get it. So where does that leave us? Here's my honest assessment, built on everything we've covered today. Where AI gets things right, where it gets things worrisomely wrong, and what to do with both. One, translation and comprehension, lab results, visit notes, imaging reports, discharge instructions. Ask AI to explain them in plain language. This is the best use case we've discussed. Low stakes, high value, and the area where AI performs most reliably. Two, previsit preparation. Before a visit, use AI to research a condition or generate questions. Three, post visit synthesis. Record your visit with your doctor's consent, then use AI to organize it into a personal summary. Exactly what my family member did. Legitimate, useful, with a clear understanding that this personal summary is separate from the official clinical note, which needs, as I said, careful human clinician review before it becomes part of your medical record. So where do we go from here? I've talked about where AI seems to get it right, and I've mentioned where AI seems to get it wrong. When you add information to the system, you may not be adding enough information, and AI may make errors. But there is a way to make it better. If you have a symptom and you're trying to figure out what it is or whether you should go to the emergency room, here's what you might do ask the AI to interview you before it answers. You might type into ChatGPT or Claude or Gemini the following sentence. Before you respond, please ask me all the questions you need to give me accurate information about my situation. Now my own testing showed this worked. When I typed it in and asked ChatGPT to ask me the relevant questions, it worked better. And some of the studies I mentioned explain exactly why. The AI has the clinical knowledge to ask the right questions, but real users don't naturally provide complete information. And the AI doesn't reliably ask it for it either. Changing the default costs nothing and meaningfully improves performance. All right, let's begin to wrap up. I started this episode with my hand pain and headache experiments. In both cases, AI gave me something, a starting point, a framework for thinking about symptoms. In both cases, it only became genuinely useful when I gave it what it needed. And in the headache case, the stakes of that interaction going wrong were not minor. One in four Americans are doing this every month. AI is generally useful for translation, comprehension, preparation, and post-visit synthesis. It gets those things pretty right. The clinical notes it generates aren't yet as good as what a careful physician produces. And the evidence on triage safety is concerning. It can get that dangerously wrong. Now that's not a reason to avoid these tools. It's a reason to use them the way any good end of one scientist would, knowing what they can do, honest about what we don't know yet, and clear-eyed about when a question is too important to outsource to a tool we haven't fully tested. I hope that you live long and well and neither avoid AI nor trust it too much. Thanks so much for listening to Live Long and Well with Dr. Bobby. If you like this episode, please provide a review on Apple or Spotify or wherever you listen. If you want to continue this journey or want to receive my newsletter on practical and scientific ways to improve your health and longevity, please visit me at Dr. Bobby Livelongandwell.com. That's doctor as the dr bobby live longandwell.com.