
Dirty White Coat
Mel Herbert, MD, and the creators of EM:RAP, UCMAX, CorePendium, and the collaborators on "The Pitt" and many of the most influential medical education series present a new free podcast: “Dirty White Coat.” Join us twice a month as we dive into all things medicine—from AI to venture capital, long COVID to ketamine, RFK Jr. to Ozempic, and so much more. Created by doctors for clinicians of all levels and anyone interested in medicine, this show delivers expert insights, engaging discussions, and the humor we all desperately need more of!
Dirty White Coat
AI vs. Doctors: Navigating Medicine's Future with ChatGPT and Human Expertise
Mel interviews Graham Walk of "MD Calc" fame. This episode explores the evolving role of AI in healthcare, focusing on a study that compares ChatGPT's performance to that of human doctors in managing complex medical cases. We discuss the implications of these findings, the potential for misinformation, and the future of AI integration in clinical practice.
• Examination of the BMJ study on AI vs. doctors
• Real-world application of AI in patient care
• Concerns around AI misdiagnosis and misinformation
• Future prospects of AI in healthcare settings
• Impacts of AI integration on workforce and private equity
• Human-AI collaboration as a path forward
All right so tell us who you are. I am Graham Walker. I am an emergency physician here in sunny San Francisco, california. I'm about halftime clinical now, mel. These days the other half of my time is spent doing kind of AI tech transformation for the Permanente Medical Group, which is the medical group you know 10,000 physicians for Northern California Kaiser Permanente and then created a website, mdcalc, used by a lot of ER doctors as well, and a new company called OffCall, trying to improve physician burnout, entrepreneur, troublemaker and I like talking about AI as well. That's the other new exciting thing that I think we're all trying to figure out.
Speaker 2:I like that very humble MD calc, that a few people use A handful Dude. It is amazing what you guys have done. I've watched it from when you were babies to now. It's one of the most used things in all of medicine not just in medicine. It's incredible. So good work. Thank you Should be very proud. Let's talk AI. I followed you for a long time. You are a geek in the best sense of the word, and so I'm loving that you're into AI, because obviously this is the next big thing here on the show. We've been talking about AI. We talked with some Israeli students all our medical students when they did the study they're now in residency about AI and test taking and their study that found that it was better than Israeli docs at sitting exams. So this new paper came out and it's titled ChatGPT versus Doctors on Complex Cases of the Swedish family medicine specialist examination and observational comparative study. This was in BMJ open and the date I'm trying to find, which I cannot find.
Speaker 3:Accepted November 22nd 2024. Great.
Speaker 2:Okay, so can you tell us about this study and then we'll get sort of your general insights about where you think this is going? Because since the last one we did have a lot of people asking what does this mean for medicine in the years ahead?
Speaker 3:Yeah.
Speaker 3:This paper is almost a year old at this point, mel. It looks like it was received March 6th of 2024. And, as you know, ai is moving so quickly that even really the publication lag time, which I think is improving in some of these studies, is still insufficient right. So this used GPT-4, which you can still technically access, but ChatGPT actually won't use GPT-4 anymore. They'll now use 4.0, which is a more recent, more modern model. And they changed the question they asked, which I thought was really important here. They stopped asking multiple choice questions. They stopped giving these models access to a question stem and then, you know, pick A through D as an answer choice, which the models are pretty darn good at in many specialties, including emergency medicine. Instead, they simulated what is a way more realistic scenario which is, hey, we're going to give you a question stem and ask you hey, how would you manage this patient or what would you do next? So you know, I think there are a couple examples.
Speaker 3:I looked in the supplementary content. So you know, I think there are a couple examples. I looked in the supplementary content. One's like, I think it's like a four-year-old with constipation or like an elderly patient with pneumonia, and some goals of care has been discussed, but not a ton. Those are what you and I practice with every day. We have kind of an open-ended, undetermined patient. We don't know what's going on with them yet.
Speaker 3:These were graded, not on a correct answer, but you essentially got points in this Swedish family medicine exam for tackling a particular subject. So you might get a point for asking the four-year-old about their diet and asking a family history. You might get another point and you might get another point for asking you know stuff about, specifically about whatever milk intake or cheese intake or whatever four-year-olds get constipated for. And so instead of there being just a right answer they were Swedish family medicine docs get graded on the comprehensiveness of their answer and that's going to include all the stuff. That is going to include differential diagnosis and social determinants of health and social factors.
Speaker 3:And it found that GPT-4, again a slightly older model, did worse than even average family medicine doctors and did way worse than kind of expert family medicine doctors. And to me I highlighted this case, this study. I liked it a lot because this is so much more representative of what we do every day. In addition, you have to be these tools are being compared to physicians much of the time, but it takes a doctor to get the content out of the human being patient. So you have to know the way to ask the question, to ask if they're having pleuritic chest pain, and so it way more simulated a real-life patient encounter and it felt way more face face valid to me that these tools actually aren't as good as all the other headlines are saying.
Speaker 2:I think that's a really great point. They've gotten so good at exams and, of course, they've gotten good at exams because they have the entire world's knowledge at their fingertips.
Speaker 3:And they're trained on exams too yeah. Yeah.
Speaker 2:Yeah, that's fine. Okay so, and I was thinking exactly this same thing how many times have you had a patient come in that says my ear hurts, my hair hurts? I've got this bump over here. Yesterday I had crushing retro-sternal chest pain. You're like what? What? Pick one, maybe two, yeah yeah.
Speaker 2:And you do spend so much time trying to tease out what are the things that matter. So I'm really interested in this. Moving it to the real world circumstance where maybe the AI listens to the patient and you listen to the patient and you come up with your differential and it comes up with its differential. You know, part of me wants the humans to win. I really want to kick AI's ass, but then the other part of me is like no, the way we get better at looking after people is if AI gets really fricking smart and can help us and work with us. So I'm sort of torn. I want the humans to win, but we need all the help we can get. If we can reduce the number of misses, that's great.
Speaker 2:Do you think that we are getting to a place with the new models? I mean, there's 4.0, there's 4.0 mini. There's the newer models than that. I just was saying to somebody hearing Sam Altman or somebody saying some of the newest, latest, greatest models cost about a thousand to three and a half thousand dollars per prompt, because that's how much effing electricity they use. So where are we headed? Give us your prognostication.
Speaker 3:I've heard several prognostications. One is that we are going to need medical-specific models. You're paying a price to ChachiBT because ChachiBT can answer questions about Lynch syndrome, but it can also give you a recipe for meatballs, and it can also tell you about Napoleon and it can tell you about Napoleon dynamite, because it can do all of these things. You're paying a comp, a computational and an electricity price for that, and so not only would a medical model potentially be cheaper and could run on smaller hardware, but it would also potentially be more accurate, because it's not going to have knowledge about, about. You know other things besides, besides medical practice. I've heard two other things. One is that you know there are, there are some rumors that, like GPT-5 didn't go very well, there was some model collapse, and so they're having to use different techniques to get further advancements as well, and I would agree with you. I think the best of both worlds is a human that is using AI to help us make sure we're not missing stuff. You know, I want like a little guardian angel on my shoulder that's kind of watching over me, and it's like whoa Graham, you're about to discharge somebody that did you consider they might have X or Y or Z.
Speaker 3:The challenge is, mel, the medical training system is so good right now, and has been for many years, that most of the time the doctor is right.
Speaker 3:I mean, how often when you see somebody with an ankle sprain, how often are you wrong that they have an ankle sprain, especially then if you get an x-ray that they may or may not need? Now you've kind of confirmed, yeah, you have an ankle sprain, and so my worry is that these tools, if not used properly or not used by trained medical professionals, they might generate like more work and more, more cost and more waste. Because, you know, maybe it will say, well, hey, did you consider it's not just an ankle sprain, but it's a septic joint or something right, something kind of crazy that probably you know is is not really within the realm of the differential diagnosis. But you know, you could imagine if somebody doesn't know how to differentiate an ankle sprain from a septic joint, they might just listen to the LLM and say, well, we should consider this. Therefore, I'm going to stick a needle in your ankle, I'm going to order all these additional tests instead of the right thing to do, which is reassure the patient. Here's some Motrin, here's an ACE wrap.
Speaker 2:Yeah, I think that's a significant concern, particularly once we drop the lawyers in there. So if you've got ai scraping the data of the chart that's being created by you and the ai and the ai says, well, pe is a possibility, in which patient is pe not a possibility, it's always possible and you always see, like the medical student, the really smart medical student, like this could be a pe and you're're like it could be, but no, we're not going down there. The sensitivity of these LLMs could be so high, like you said, it could actually make things a lot worse. We're trying to do less testing. We're trying to focus in on what's possible and our miss rates are low enough on this stuff and we don't want to go back to scanning everybody. So that is a concern and maybe that gets fixed over time. What about in your work on NDCalc, for example? Are you finding utility for AI helping you create that massive database that you have?
Speaker 3:We are working on some generative AI tools that are still in kind of internal development. They are not necessarily around building new scores, because generative AI is really not going to be good at that. That's more kind of on the other side of AI, which is predictive AI, if you think of kind of two different camps, but helping people find the right scores or helping people find information in this exponentially accelerating, growing body of medical knowledge.
Speaker 2:that is harder and harder to manage and understand we are internally and we'll release it soon on our textbook using ai, llm, gpt, zero mini to help us with search, and it's been really good. But there's one thing that one has to remember is that it is trained on the internet on all of the knowledge. So we say, just look at our textbook, and it refuses to do that.
Speaker 3:Oh interesting.
Speaker 2:It just cannot stop looking elsewhere. So we have to keep tweaking and tweaking and tweaking. And there was a great example that Mike Weinstock came up with not on our search, but on another LLM and said what is a good muscle relaxant in pregnancy? And it came back with rocuronium, which is Technically true, absolutely true, and also terrifying For the non-medical people listening. Rocuronium is actually a paralytic agent. We use to paralyze people in order to put the tube in to breathe for them. So if you gave a pregnant woman rocuronium, yes, indeed, you would relax their muscles and they would stop breathing and die.
Speaker 2:So we keep finding these issues where when you train on a big data set, even if you ask it to look just at a small data set, it can't help itself. It sort of jumps out of the textbook. We've asked it please again, tell us where you got this information, and sometimes it doesn't. It's like I don't know what you mean. I can't tell you where I got it, I just know it. So are the newest models any better at this hallucination stuff, or is this just intrinsic to?
Speaker 3:the technology. Both they are intrinsic to the technology, right? These tools are just trying to predict, it's just math. And then we don't see the math. They convert the math back into English. But you know, these tools are just trying to predict the best next word based on your query or your prompt.
Speaker 3:So while I remember GPT 3.5 had a lot of hallucinations and Google especially Google Gemini I found really guilty of halluc had a lot of hallucinations and Google especially I Googled Gemini I found really guilty of hallucinating a lot. It does seem like I'm finding hallucinations less and less, especially hallucinations that I can confirm right. I mean, I'm not seeing it come up with fake journal articles as much, even when I'm trying to trick it and force it to do that. So I think that it's improved but it's certainly not perfect. I kind of think of this and maybe the future versions as they improve, a little bit like Wikipedia, Mel, in that anybody can technically change Wikipedia. So you could go to the Wikipedia page on appendicitis right now and someone could have edited it and said changed it to oh, it's in the left lower quadrant, not the right lower quadrant. But the odds are that it is actually unlikely that Wikipedia is wrong and so that like that's how I kind of think about this is you have to have a little bit of a, you have to have a little bit of skepticism, but that much of the time it is going to be right. But it's kind of a. I wouldn't call it trust but verify, but I'd say like, consider but verify, because you can't yet say how accurate it is, but that much of the time it's correct.
Speaker 3:My colleague, jonathan chen, is an informaticist down at stanford, talks a lot about this not being hallucinations but confabulations, meaning these things don't know that they're making a mistake. There are some techniques you can do to drive down the confabulation rate by actually sending its response back to itself and saying, hey, does this seem correct? And again, this is the weirdness of this technology is often, if you give its response back to itself, it can say, oh, no, boy, oh, we've made a mistake, and it can self-correct itself. So that's one of the examples of behind the scenes how the tech companies are reducing the hallucination or the confabulation rate by not sending the data directly back to you.
Speaker 2:There's kind of an intermediary step that's being used and that's I think part of the reason why the time from prompt to answer is increasing, If I understand it correctly. I don't really understand the technology, but they basically are giving it more time to think. Yeah, and part of that more time to think is that internal stuff. It's like asking it that same question back to it a different way or already checking it before you even check it. Is that sort of what's happening in the background there?
Speaker 3:Yeah, and you know there's this concept of prompting, where you kind of give it an instruction and tell it how you want it to behave. So you know, one prompt could be respond like Donald Trump, another could be respond like Yoda, and it's going to behave differently. You're going to get a different response. And so one of the things that they are doing is say you type in a medical question. It will then do several analyses of that and one might say, hey, this appears to be a medical question and so it might now give two. You know, it might kind of have multiple steps in the prompt, so it's not responding immediately, but it might say, okay, we've identified.
Speaker 3:Number one, this is a medical question. Number two the question appears to be about appendicitis. Number three I'm going to give an answer to the question Number four. Create a prompt to test it. Hey, you're an emergency physician, you're receiving this prompt about appendicitis. Does this seem accurate? And then, if the tool says, yep, that seemed accurate, then on step five is actually send it back to the user. So that's one of the ways that they're trying to get rid of hallucinations is by doing some internal self checks. And can you?
Speaker 2:explain you mentioned it before this idea of the collapse of the LM. I've heard this term and like it's like run out of data, and is that what it means? It's like it can't get any smarter because we've given it everything, or is there something else going on?
Speaker 3:I don't know the details of it, especially because it starts to get highly technical, and I think only people that are actually CS people know this level of specificity. But yeah, it does seem like these tools require massive amounts of data to kind of quote-unquote learn or be trained, and to some degree, they've run out of that data. Now lots of people are talking about who has the most human-generated content. Well, it's social media, right? You have all these human beings providing free content all the time. Some of that data is starting to get dirty with a bunch of AI garbage in it too. So there are certainly some kind of tech commentators who are saying, well, we've already poisoned the well. Saying well, we've already poisoned the well. Even if Facebook wanted to use all of its own data, they'd have to cut it off in maybe 2022 or something before AI took over, Because if you look on Facebook now, there's so much AI-generated garbage and actually, if you click on the comments section, a lot of people think that many of those comments are just bots, which is terrifying.
Speaker 2:Yeah, so they talk about training these, having the LLMs create their own content so that then they can train on their own content. That to me seems like incest, and we know how that works.
Speaker 3:I agree. It also seems like it would eventually remove all creativity and all nuance or different ways of describing the same thing. As you're training the model on itself, you know it's starting to see more words that are more and more popular, are going to become more and more, it's going to use more frequently and if you run this enough times, you know maybe you you'll never have the word chest pressure again. Every single time it'll just be chest pain and it actually will lose the ability to understand what is chest pressure or understand that that's kind of a synonym for chest pain, because it's heard the word chest pain so many times and it's starting to forget the word chest pressure almost.
Speaker 2:The idea that these will be trained on social media is terrifying.
Speaker 2:The idea that these will be trained on social media is terrifying. So you've got the bot problem, but also you have the human problem, which is people act like complete a**holes on social media because there are no social filters. And we have this on our program all the time and we're mostly physicians and clinicians and people will say the most horrible things when they don't realize that there are humans behind this. So they'll say, mel, this program is a piece of shit. And then I'll email them like why do you think it's a piece of shit? And please don't talk to our support team like that. And they're like oh my gosh, I'm so sorry, I didn't mean to say it like that, I was just a bit tired after a night shift, but that's the kind of crap that happens on social media all the time. So it's like we're training it. If we do this and I'm sure that everybody is, and I'm sure that Elon Musk is with Grok we're training it on the most toxic part of humanity. And so what comes out of that?
Speaker 3:I could not agree more, mel. There was a paper that just came out, I want to say a week ago. It looks like it's called Medical.
Speaker 2:Large Language Models Are Vulnerable to Data Poisoning Att attacks and I can confirm that that came out January 8th 2025 in Nature Medicine, the open access version.
Speaker 3:So from NYU. Lead author is a medical student that I emailed who was just like oh my God, you're killing us. It's awesome. So there's this thing called the pile, which is a big imagine like billions of documents on the internet, and a lot of the large language models are trained on this, because they've all kind of agreed this is like a good amount of content that you can use to create a large language model. So what these authors did is they took the pile and then I think the pile has like I don't know a billion documents in it, right, and they added just 50,000 medical misinformation documents.
Speaker 3:A tiny I mean it's 0.001% of documents were intentionally misinformation. So you know, whatever Ivermectin cures cancer Vaccines don't work. They came up with some even more erroneous ones, like you know, I don't know, like, oh, beta blockers can be used for GI bleeds, whatever it is. And then they ran these models and asked them questions and they found, with just that tiny fraction, 0.001% of data you could get the models to tell you with full confidence, dangerously incorrect medical information. And so I found that absolutely fascinating that if we are training it on Twitter or X and we have a little bit of white nationalist, nazi stuff in there. It's going to have sentiments that no one would want as truthful or accurate or, you know, like dangerous information in there, and this was specifically on a medical model. So you could imagine very dangerously if doctors were using a tool like this and a bad actor wanted to cause problems or cause misinformation. You could relatively easily with just inserting just a tiny bit of poison destroy the entire system.
Speaker 2:That's fascinating that so little data can have such a huge effect, and I think it can obviously go both ways. So the white nationalists could get in there. We saw from Google's one of their first iterations that they just were so politically correct it couldn't make a white George Washington. Yeah, that is so weird. So it works both ways ways. It seems that this is a it's kind of a terrifying revelation that if you feed it a little bit of misinformation whatever that misinformation is it seems to highlight it for some reason.
Speaker 3:Well, and you know even the stuff that wasn't. It was just silly. Like you know, the initial version of Google Gemini was trained on a bunch of Reddit data and you know, sure, there's some dark sides of Reddit. There's also just been some people who are just joking and being silly and being stupid, and you know, I mean there's tons of screenshots online of, like you know, google's AI recommending you add glue to pizza, that pregnant women should eat no more than two rocks per day, I mean, and you can. Actually it's interesting. You can actually go into the Reddit threads and find the users who said this stuff in the context of making a joke or being sarcastic or something like that. But this has somehow been bubbled to the top of these tools, maybe because Google said oh hey, trust Reddit, listen highly to Reddit, maybe for a good reason. I mean, often Reddit's a good place to hear, you know, read about product reviews or get opinions from real human beings. But yeah, you can see there are major downsides if we don't understand this, or do this correctly.
Speaker 2:You can imagine like somebody saying, yeah, put glue in your pizza, it makes it taste better. And I'd click on that and say that's so funny. Click, but the.
Speaker 2:AI is like oh yeah, oh, this is what humans must do. It's like no, we're joking, crazy. What do you see? So you work at Kaiser? How do you see implementing AI in the next few years? We've heard about AI scribes starting to listen into our conversations and helping us with our chatting. We've heard about the AI going through your charts and doing what we said at the beginning. He was like Mel, did you think about PA in this patient or something else? Where do you see this going in the next few years?
Speaker 3:I would love to have AI scribes that not just write my note but help me with all the other stuff you have to do during an encounter, which is like write orders, come up with a diagnosis, add in your billing codes, add in your quality metrics junk of like. Oh, here's the reason the STEMI was delayed, or here's the reason I didn't give 30 per kilo of fluids for sepsis, and I think those tools will be able to help us with that piece to make it less onerous, and they'll be kind of teed up. I mean I would love to tell a patient with my AI scribe listening I am worried that you have appendicitis Ding, it's now collected, you know, suggesting appendicitis. When I get back to my desk and I'd like to, you know, do a CBC, a chemistry panel, give you a gram of IV Tylenol, some four milligrams of IV Zofran and do a CT scan to evaluate for appendicitis. And I come back to my desk, I review those orders Looks good, I click sign. That saves me another 10, 20 clicks and it maybe, you know it adds my sepsis fallout thing for my note as well. So I think that's.
Speaker 3:Those are all pretty low hanging fruit that are not. You know, they're, they're kind of what, what I would consider low risk. They're, you know, human in the loop, right, it's not doing something for me, it's teeing it up. And then I I'm still the one that's clicking, sign on those orders or confirming the diagnosis and and it's something that is annoying part of my day but also helps get the work done of seeing a patient in the ER or the clinic or whatever. So I think those are the easy wins. And then I think this year we'll start to see more people admitting that they're using ChatGPT to review cases and kind of deciding how they want to evaluate ChatGPT further.
Speaker 3:You know, you could think I mentioned the Wikipedia example. The other example I've thought about is like, you know, maybe ChatGPT is like a pretty good intern or a decent art, you know a really good R2 or something like that, where, oh, it has some really good ideas, but you as the attending are still deciding if you should listen to those ideas or not. I always think the best type of med student or the best time an intern is like they wanna do too much. And then me, as the attending, I'm like no, no, we're good, we don't need to do a lactate or we don't need to do blood cultures. I like that you're considering sepsis, but in this case let's not do that. So I like the idea of these models helping me consider all possibilities, not miss something, but then I'm obviously still the one in charge deciding yeah, we're not going to pursue mesenteric ischemia in this patient.
Speaker 2:Well, thanks for your insights. I reserve the right to call you again when some more studies come out. It is frustrating with this literature because what you said at the beginning the delay between submission and publication and this is going so fast that there's new models all the time. So we always feel like we're talking about stuff that's six months or a year old and that makes it a little difficult. But so that's why it's nice having you sort of prognosticate about what might happen next. I was hoping that this would just give us more time with the patients, that if it can do a lot of the busy work, we could have more time with the patients. But then I thought all of those for-profit hospitals are going to say that's great. Now let's fire three emergency physicians and you have to see 35% more patients.
Speaker 3:I'm like you. Yeah, that's my fear. I don't think that in my medical group we are physician-owned and physician-run and I don't think that's the intent at all. The intent is actually to keep our physicians from not being burnt out and staying with our medical group and not leaving the group or leaving medicine. But you could imagine a private equity group or something having a particularly different opinion about that.
Speaker 2:Yeah, I think it's a real problem. That's a whole other discussion, which we're going to have with some experts soon about private equity in medicine and is it destroying it? So that will be coming up soon. Graham, thank you so much for your time. Thank you for all that you've done for emergency medicine for MD Calc, which is an amazing thing. If you haven't seen it, people go check it out online come on, come on come on, it's a stupid thing to say thank you. Mel, I'll talk to you soon. Sounds great.