Mystery AI Hype Theater 3000

Episode 9: Call the AI Quack Doctor, February 17, 2023

August 08, 2023 Episode 9
Episode 9: Call the AI Quack Doctor, February 17, 2023
Mystery AI Hype Theater 3000
More Info
Mystery AI Hype Theater 3000
Episode 9: Call the AI Quack Doctor, February 17, 2023
Aug 08, 2023 Episode 9

Should the mathy-maths be telling doctors what might be wrong with you? And can they actually help train medical professionals to treat human patients? Alex and Emily discuss the not-so-real medical and healthcare applications of ChatGPT and other large language models.

Plus another round of fresh AI hell, featuring "charisma as a service," and other assorted reasons to tear your hair out.

This episode was first recorded on February 17th of 2023.

Watch the video of this episode on PeerTube.

References:
Glass.ai makes “diagnosis machine”:
https://twitter.com/AiBreakfast/status/1620128621821317125?t=Q6tTAOcGAoFJ3Ko9m4EC9g&s=19

Percy Liang claims 'PubMedGPT' can pass medical exams:
https://crfm.stanford.edu/2022/12/15/pubmedgpt.html
https://twitter.com/percyliang/status/1603469265583353856?s=20&t=SdWeINzUw92pbkTO8OAVqQ

Emily's reaction to the above:
https://twitter.com/emilymbender/status/1603766381807570944?s=20&t=SdWeINzUw92pbkTO8OAVqQ

ChatGPT gets 60 percent of questions right in US Medical Licensing Exam:
https://healthitanalytics.com/news/chatgpt-passes-us-medical-licensing-exam-without-clinician-input

An Apple Watch error is clogging up 911 lines:
https://www.nytimes.com/2023/02/03/health/apple-watch-911-emergency-call.html

ChatGPT-assisted diagnosis: Is the future suddenly here?
https://www.statnews.com/2023/02/13/chatgpt-assisted-diagnosis/

NVIDIA “eye contact” demo:
https://twitter.com/Jousefm2/status/1616878021280993284

“Theory of the mind":
https://twitter.com/LChoshen/status/1623575423652139015?t=Ohc9tzB09pAEddAReLc6mA&s=09


You can check out future livestreams at https://twitch.tv/DAIR_Institute.


Follow us!

Emily

Alex

Music by Toby Menon.
Artwork by Naomi Pleasure-Park.
Production by Christie Taylor.

Show Notes Transcript

Should the mathy-maths be telling doctors what might be wrong with you? And can they actually help train medical professionals to treat human patients? Alex and Emily discuss the not-so-real medical and healthcare applications of ChatGPT and other large language models.

Plus another round of fresh AI hell, featuring "charisma as a service," and other assorted reasons to tear your hair out.

This episode was first recorded on February 17th of 2023.

Watch the video of this episode on PeerTube.

References:
Glass.ai makes “diagnosis machine”:
https://twitter.com/AiBreakfast/status/1620128621821317125?t=Q6tTAOcGAoFJ3Ko9m4EC9g&s=19

Percy Liang claims 'PubMedGPT' can pass medical exams:
https://crfm.stanford.edu/2022/12/15/pubmedgpt.html
https://twitter.com/percyliang/status/1603469265583353856?s=20&t=SdWeINzUw92pbkTO8OAVqQ

Emily's reaction to the above:
https://twitter.com/emilymbender/status/1603766381807570944?s=20&t=SdWeINzUw92pbkTO8OAVqQ

ChatGPT gets 60 percent of questions right in US Medical Licensing Exam:
https://healthitanalytics.com/news/chatgpt-passes-us-medical-licensing-exam-without-clinician-input

An Apple Watch error is clogging up 911 lines:
https://www.nytimes.com/2023/02/03/health/apple-watch-911-emergency-call.html

ChatGPT-assisted diagnosis: Is the future suddenly here?
https://www.statnews.com/2023/02/13/chatgpt-assisted-diagnosis/

NVIDIA “eye contact” demo:
https://twitter.com/Jousefm2/status/1616878021280993284

“Theory of the mind":
https://twitter.com/LChoshen/status/1623575423652139015?t=Ohc9tzB09pAEddAReLc6mA&s=09


You can check out future livestreams at https://twitch.tv/DAIR_Institute.


Follow us!

Emily

Alex

Music by Toby Menon.
Artwork by Naomi Pleasure-Park.
Production by Christie Taylor.

ALEX HANNA: Welcome everyone!...to Mystery AI Hype Theater 3000, where we seek catharsis in this age of AI hype! We find the worst of it and pop it with the sharpest needles we can find.

EMILY M. BENDER: Along the way, we learn to always read the footnotes. And each time we think we’ve reached peak AI hype -- the summit of bullshit mountain -- we discover there’s worse to come.

I’m Emily M. Bender, a professor of linguistics at the University of Washington.

ALEX HANNA: And I’m Alex Hanna, director of research for the Distributed AI Research Institute.

This is episode 9, which we first recorded on February 17th of 20-23. And we’re here to tear down all the hype about applications for Large Language Models in medicine! From AI that can supposedly pass medical exams…to some of the weird diagnoses these models might offer patients.

EMILY M. BENDER: Look out for one of our favorite phrases - construct validity. That’ll be coming back as we continue to question whether the tests we apply to AI actually reflect any model’s capacity to do the things hype-ers claim it can.

ALEX HANNA: Welcome to Mystery AI hype Theater 3000, and I'm going to resample this to put at all the time when I'm supposed to be talking. [Laughter] My name is Alex Hanna, I'm the Director of Research at the Distributed AI Research Institute and yeah. 

EMILY M. BENDER And I'm Emily Bender, Professor of Linguistics at the University of   Washington and you just heard our brand new theme music written by the amazing composer Toby Menon and I want you all to know that I know how to get in touch with him. So if you are looking for something composed for your own Twitch stream or podcast or just you know listening pleasure let me know and I can put you in touch.  

He is an amazing I would say young-ish composer.  

ALEX HANNA: Amazing. Uh big fans in the chat. Uh all right so why am I wearing this stethoscope-- 

EMILY M. BENDER And my fake stethoscope. 

ALEX HANNA: --that I can't can't actually put in my ears. Oh it just has enforced everything, oh I actually can hear it now. Let me see, am I alive? Let me double check. Okay I'm actually putting   this stethoscope on my heart. That's because we're talking about health today and we're talking about the uses of health and generative models, of which there's been a lot of surprising uh hype in this area over the past few weeks. Uh so we got a lot to talk about. 

EMILY M. BENDER We've got a lot to talk about. I've shared my screen, can you see it? Yes we can. 

EMILY M. BENDER All right. 

ALEX HANNA: Um and I I should say I'm going to take this time to to honk my little bone of--health bona fides. Um is that my--I um, uh weirdly, in in the depths of the pandemic, decided to get my EMT license. I have never practiced being an EMT um but I have passed a test, which apparently according to all these things is sufficient for for diagnosis and and actually do doing something here and treating real human patients. Well let's talk about--what's this? 

EMILY M. BENDER Yeah we need to talk a little bit about construct validity. So you've passed a test. You, a person, a skilled person, did some training experience and then there was a test that was built by other people to evaluate whether or not you um absorbed enough from that experience. But it's not directly assessing what's in your brain, right? It is it is a way for you to show some of the knowledge and maybe skills depending on how the test went.  

Um that uh would give people some confidence that you are you know licensable or ready to go on to the next level of training or whatever it is, right? 

ALEX HANNA: Right. And I mean the yeah exactly I mean it's it's a it's it's it's some kind of a a bar that's supposed to be for admission here, and it's you know as a--there's been massive amounts of work on testing by--you know with the rise of psychometricians, you know for better for worse but you know that's the sort of field that kind of came up with the whole concept of construct--you know construct validity and assessing you know what is in one's brain and and kind of these fields of evaluation. 

EMILY M. BENDER Yeah, yeah, exactly, and so we're taking these tests and with which are you know either just directly in the kind of text format that one of these large language models can manipulate, or are turned into that text format depending on which one of these you're looking at, and saying hey look the large language model can get the right answer at the percentage of the time that would be passing for a human, and guess what that's not the same thing. That doesn't tell us that the large language model has acquired the skills that would allow it to go be an EMT. 

Can you imagine? Hey ChatGPT come over here this person needs CPR. 

ALEX HANNA: Right, well I mean, CPR's not done in a lot of the cases but more in a sense it's  sort of like you know you need to do basic kinds of first aid, you basically need but doing any kind of a field assessment is incredibly--you know is incredibly different--difficult, it's kind of seeing different types of presenting diagnostics. And given that its sort of language, I mean doing that kind of live, in an emergency situation, you know it's very, um you know it's very contextual right.  

So let's let's describe what we're looking at. So we're looking at this this tweet from AI Breakfast at it says--  

EMILY M. BENDER Because we all need some AI hype for breakfast, clearly. 

ALEX HANNA: Oh my gosh it's too early. I can't, I can't. But unfortunately we're doing this to to get some you know get some more folks from Europe watching. Um so it says, 'Last week: ChatGPT passes U.S. Medical Licensing Exam. Today: GPT's medical knowledge is distributed into a smooth UI." And this is uh hyping for this this particular company startup, 'Glass AI generates a differential diagnosis or clinical plan based on a problem representation.' And, 'Try Glass AI.'  

EMILY M. BENDER Yeah. ALEX HANNA: So that it's not here yeah they've got a demo here and going into the demo it says you first--first off this gave me a bit of a chuckle because when you click into it there's a pop-up here and uh it says, 'This product should not be used by a general audience and does not generate general medical advice. This product should only be used by clinicians, including physicians, physicians' assistants, nurse practitioners, pharmacists, and clinicians in training.' 

And I find this idea of like these things as being teaching tools which is like very a weird framing as if giving an answer should be considered sort of a teaching tool rather than becoming something that people become over-reliant on in terms of automation bias. Next time when we meet, this is previewing, we're talking with a legal scholar and I think there wasn't there's this case I think we were talking about in which there was a judge, I think in Colombia, that actually just used a ChatGPT generated uh um like uh decision. 

So to make it as something that says using it for training, or used as sort of a way to assess expert knowledge you're also, you know, you're also not considering kind of the political economy and the types of constraints put on work that are going to force people to use these tools as shortcuts, kind of as their workload ends up and it just generating perpetuating uh all  kinds of nonsense. 

EMILY M. BENDER And if this were training then it would have to be ensconced in some sort of  theory of how is this an effective teaching tool. 

ALEX HANNA: Yes, right. 

EMILY M. BENDER And it seems like all these things when it's framed as oh this is something people can use for teaching assumes that teaching is just about accessing facts. Never mind that this isn't even facts because it's, you know, made up sequences of words, right? 

ALEX HANNA: Yeah, yeah. I mean yeah this is--I would love for folks that claim these things are for training to take you know one class on pedagogy you know that talks about how learners learn and you know what mental modeling scaffolding does. 

This is doing none of it, right. This is saying you give an input and I'll give it--give an answer and that'll get you know it's it's and and so--it's frustrating. 

EMILY M. BENDER Another thing that's interesting to me about this disclaimer here this you know, um, by checking this box you would test that you are a clinician and that you accept our terms of service, um-- 

ALEX HANNA: Yeah. 

EMILY M. BENDER We see so much sort of just like cowboy stuff going on when people are are putting up tools and claiming they can do things and medical care is a very regulated part of our society and--  

ALEX HANNA: One of the most regulated, yeah. 

EMILY M. BENDER Yeah. For a good reason. And so this company is sort of saying okay we  know that we can't be putting this out there as if it were providing medical advice to the general  public and so we're just going to put it on you, the person accessing this, to say yeah yeah I'm I'm gonna attest that I'm a clinician and therefore I should have access to this. Um seems pretty flimsy and also I should say um inconsistent with the way it's being hyped. Now AI Breakfast is not Glass AI, um but this tweet is still up. Glass AI did not ask for it to be taken down. It's been up  for more than two weeks. So-- 

ALEX HANNA: Yeah. 

EMILY M. BENDER --you know this is they aren't they aren't trying too hard to be  um kept out of the general public. 

ALEX HANNA: Right. 

EMILY M. BENDER All right so Alex on the strength of your  EMT license should I click this? 

ALEX HANNA: Click the box. I I can show you my licensure, put it on camera. I thought about bringing it out here but you're gonna have to take it on faith that you know, me having a stethoscope at you know my disposal makes me uh you know uh have my bona fides.  

Um so you enter a diagnostic problem here the way that's this we're looking at this, it says um, 'Enter a diagnostic problem represented below to generate a a DDX,' so a diagnostic um uh or a clinical plan. I I type something in here um, you know as a sort of a an EMT case, um but I I actually want to be a little um I want to be a little more cagey here. 

So I tried to type something in--first off you you have to give it a gender and I wanted to try what it would do with a non-binary gender um because so much of so much at least of emergency medical training is very gendered. Um you know any kind of thing with abdominal pain can has to be assumed to be a pregnancy if it's a it's a--if the person is a female. Um and I I you you only get one shot at this because I don't think--it doesn't let you do more than one. Um but I would think how like let's think of something. So try the try typing this in Emily. 

EMILY M. BENDER Okay. 

ALEX HANNA: Um 37-year-old woman um presents with um--- 

EMILY M. BENDER Notice that the buttons just lit up. 

ALEX HANNA: Yeah yeah it presents yeah just do it.  Um presents with acute uh abdominal pain. Um and then period uh um um you know patient uh uh it you know like I don't know I'm just trying to I don't know what what to put in a--patient is diaphoretic, uh which means they're sweaty. 

EMILY M. BENDER Is that how spell this, diaphoretic?

ALEX HANNA: Um um and um and is pro--you know and it is um uh and has let's see um uh tachycardia, I'm just putting in words here, which just means they have a high heart rate. All right generate this diagnosis and see what what happens. 

EMILY M. BENDER All right generating DDX which I think is differential diagnosis. 

ALEX HANNA: Differential diagnosis yeah. 

EMILY M. BENDER Yeah um it's thinking pretty hard.   

ALEX HANNA: It's thinking it's thinking it's thinking um. 

EMILY M. BENDER And I noticed that we still have the same disclaimer. 'This product is not intended by use--for use by a general audience and does not generate medical advice.'  Yes yes yes. You know how to CYA. Okay we have acute appendicitis or acute cholecystitis um.

ALEX HANNA: I don't know what cholecystitis is. 

EMILY M. BENDER Um yeah pancreatitis, small bowel obstruction-- 

ALEX HANNA: --Or gastro and--enteritis yeah um so that it gives you so it gives you a list of what these could be um. So going up um so acute pain uh so acute appendicitis it's also just saying acute in all of these, which I'm not you know like I'm not a uh in any kind of way to be a diagnostic, kind of it's not within the um kind of um uh realm of of an EMT kind of with basic state training in California to make diagnostics. You can sort of suspect things and then the the the the the ER doctor does then it has to make the diagnostic, but it's putting acute in everything so the first one reads, 'This is most this is the most likely diagnosis due to the patient's age, the acute concept of abdominal pain, diaphoresis, and tachycardia. 

The classic presentation of appendicitis is abdominal pain that begins near the navel and then migrates to the lower right lower quadrant. The pain is usually accompanied by anorexia--' Accompanied by anorexia?  

EMILY M. BENDER What? 

ALEX HANNA: Isn't that a chronic condition? '--nausea, vomiting, and fever. Diaphoresis and tachycardia are also common symptoms.' 

EMILY M. BENDER All right so up until you got to that point I was thinking okay maybe this is actually pulling out sort of canned definitions of each of these things rather than generating afresh, but like anorexia is a very weird thing there-- 

ALEX HANNA: That's a curious one. If you're in the comments in the chat and you're a medical professional professional. I'm not I'm not a medical professional that can diagnose anorexia or associate appendicitis with anorexia. 

My understanding that that's a chronic condition but to say that is presenting as something that would be the basis of a diagnosis is a little weird. Um but yeah, back me up in the comments. 

EMILY M. BENDER Yeah, 'The pain is usually accompanied by anorexia, nausea vomiting and fever.' So uh 'pain is accompanied by' makes sense with nausea, vomiting, and fever for sure um but yeah wait you have appendicitis and that causes the onset of anorexia?

ALEX HANNA: I I think it's I think it's the other way where the causality is anorexia. 

EMILY M. BENDER Yeah. 

ALEX HANNA: But anorexia is sort of um not a chronic condition, maybe much much earlier, I'm not sure. Uh the other definitions pretty much are uh um pretty much are verbatim. The only difference is that it's the acute cholecystis which I'm assuming has to do something with the liver because the only additional thing is jaundice, and it talks about a right upper quadrant  um pain which is the the location uh of the of the of the liver. Um small bowel obstruction pretty much a word by word, and constipation is the only additional one and guessing-- 

EMILY M. BENDER So I feel like we're maybe giving this fake text a little bit too much attention here. 

ALEX HANNA: Yeah. 

EMILY M. BENDER Trying to figure out what's it thinking--it's not right? 

ALEX HANNA: Yeah. 

EMILY M. BENDER Yeah. Okay so so this is presented as what? What does Glass AI say that it is? 'Frictionless software for learning and practicing medicine.' That's frightening. So create Glass account--no I don't think so. 

ALEX HANNA: Can you generate a clinical plan because I want to see what this says. 

EMILY M. BENDER Let's do that but then also we've got a bunch of things here. 

ALEX HANNA: Yeah we've done a lot of things. Um I also just--if I was writing this you know it wouldn't be a really--I'm thinking about my own experience studying for the EMT exam, and I don't think it would be super helpful. I think it would have sort of you know like it would have not really told me like why I could get there. Um and it wouldn't really you know there's and I mean it's not a straight A for B kind of learning.  

I mean you have to kind of consider these different things and also trying to understand what different kinds of presentations um they lead to. 

EMILY M. BENDER And if you're using this appropriately you would have to always have some uncertainty. Like it might have bad information somewhere, it might have just made something up and so if you're trying to take it in but always sort  of like keeping some distance because it might be wrong, or you don't know where it really came from, that actually makes it much harder to learn. Like skepticism is good critical analysis is good but um the sort of being on your guard for random thing pops in because the system made something up is just I think going to detract from the learning experience. 

ALEX HANNA: Yeah, yeah. 

EMILY M. BENDER Yeah. 

ALEX HANNA: Now this is actually--now this the clinic the clinical plan is even more uh shocking. It says, 'A 34-year-old woman presents with acute abdominal pain, she is diaphoretic and has tachycardia.  

These symptoms suggest a possible abdominal aortic aneurysm.' First off, that's terrifying. Like if there's an aortic aneurysm, it's probably not going to be in the abdomen, it's probably going to be higher. But then it but it has so so it doesn't have the five--it has aortic aneurysm is the first one and diverticulitis as the last one which is not--which are both um--aortic aneurysm is when there's uh the aorta can rupture and diverticulitis is when um, oh gosh I'm rusty on this but I think it's actually a thing where the presents in that um in the--it's a heart condition? Um.

EMILY M. BENDER Okay but maybe more than we need to know? 

ALEX HANNA: But I'm just very like you know it's just it's just very you know and then--and then okay this is actually pretty concerning here. I just want to say this I know we can move on. The the treatment here for for freaking either aortic aneurysm or appendicitis, is pain medications, IV fluids and antibiotics? 

No honey, you gotta take the fucking appendix out or you need to get and do open heart  surgery for a aortic aneurysm. You don't, like that is--you can't really wait on that. 

EMILY M. BENDER No, no exactly and but it's also really striking as you note, that the list of possibilities in the clinical plan doesn't actually match the differential diagnosis, so it's giving us off of the same prompt inconsistent things. Um yeah. 

ALEX HANNA: Okay, let's move on. 

EMILY M. BENDER I just have to--this is from the chat, this is hilarious. So so first of all M points out that "frictionless" is actually not what you want in healthcare or healthcare training. You don't want you don't want someone just sort of sliding into being a doctor, that's not right. 

ALEX HANNA: Yeah. 

EMILY M. BENDER But also Lilith says, in quotes, 'Our stochastic chatbot digests WebMD, it gives you information. By the way, it's super cancer.' 

[ Laughter ] Okay, all right. So. 

ALEX HANNA: I want to say something, one other chat. VChelpyGina says, 'I did one for 200-year-old man presenting with a thirst for blood.' So acute lycanthropy. I'm not--I didn't spell that correctly but yeah. I mean you might you may or may not be a werewolf or or a vampire. 

EMILY M. BENDER Yeah. Nice. The most likely diagnosis due to the patient's age. 

ALEX HANNA: Yes. 

EMILY M. BENDER Okay.

ALEX HANNA: They are an ethereal being. 

EMILY M. BENDER Yeah. So so this is this is frightening. So and this isn't this isn't some academic like hey we're overselling what we built right. This is this is a company that is selling a product.  

ALEX HANNA: A health company, no less. For learning and practicing medicine, incredible. 

EMILY M. BENDER Yeah. All right um a little bit earlier, this is from December, um Percy Liang says, um this is an announcement from CRFM, and remind folks that is the Center for Research on Foundation Models at Stanford. They announce 'PubMedGPT, a new 2.7 billion language model that achieves a new state-of-the-art on the U.S medical licensing exam. The recipe is simple: a standard Transformer trained from scratch on PubMed (from The Pile) on MosaicML Cloud, then fine-tuned for the QA task.'  

ALEX HANNA: Can I first-- 

EMILY M. BENDER Yeah? 

ALEX HANNA: --just beef that it's this is the um that the U.S medical exam has become an ML task. And that we like are bragging about state-of-the-art or SOTA on this as a task. I mean this is I think you tweeted about this earlier this week, Emily, but it's this real colonization of CS of taking things from with from you know presenting problems and presenting solutions.  

Um as as doing it in such a uh such a you know like presenting solutions in such a unilateral  way. And I'm so annoyed that 'SOTA on the U.S medical licensing exam is a phrase,' yeah yeah and um M in the chat says, 'Can't wait for SOTA on English Lit PhD quals.' Yeah please tell us  about the canon of of uh you know how Mary Shelley's Frankenstein established a new uh  tradition, and I I don't know I I started listening to My Gothic Dissertation uh yesterday and so  that's top of mind. 

EMILY M. BENDER Nice, nice. Yeah so that's that's really frustrating. That this medical  licensing exam which exists or you know as part of the process of training the medical workforce, which is really important somehow in the eyes of computer scientists, is hey look there's a benchmark that we can try to SOTA on. And it's like that's not what it's for. 

And my reaction to this when I saw this back in December was okay what is PubMedGPT actually for and why is the U.S. medical licensing exam actually a good test to show that it would be effective at this thing that it's for? Well if you scroll down a bit here um, let's see um okay, usual disclaimer, 'PubMedGPT,' this is still Percy, 'is also capable of generation but like most LMs it will fabricate content, so don't trust it.' I'm like which are the LMS that don't fabricate content? Anyway. 

ALEX HANNA: All of them.  

EMILY M. BENDER Um yeah. 'This is a pressing area for LM research and we hope that the release of this model can help researchers evaluate and improve the reliability of generation.' Your system is designed to make shit up. Right. That--it is it is not designed to be reliable so that  that whole research area just drives--it's like no, if your point is to generate reliable information  you have to start with actual information, and not just distributions of word forms. So.  

Um but then this next tweet was the one that that really sort of put the stake through it, speaking  of vampires. Um, 'We hope that PubMedGPT can serve as a foundation model for biomedical researchers.  

Can it be adapted fruitfully for tasks such as medical text simplification, information retrieval, and knowledge completion? There's a lot more to do.' So what this says to me is that it's not actually for anything and that's the problem with the whole foundation model conception. You can't evaluate them because they don't actually have a purpose. 

ALEX HANNA: Yeah. It's always sort of a a model searching for a problem, you know. It's--in not thinking about how much work this is going to put on clinicians, how much work this is going to lay on people. I think some people in the chat--although everything in the chat is now focused on diagnosing um uh fictional uh mythical characters, which I love. And maybe we can do a whole session on on on medical diagnosis of you know elves and and um and you know half  orcs and tieflings and things. But um, I've also been watching a lot of um uh D&D uh livestreams.

Um but um but imagine what the ecology of generating all this all this fake information does for the onus of real, working clinicians and the way that clinicians who, especially in places which are not the Global North, are put under intense pressure to make diagnosis um. You know this is this is like incredibly I mean incredibly dangerous. Could you think about some you know um kind of half-witted uh uh uh a well-meaning Westerner going and applying this to a public health situation and and and and and uh in sub-Saharan Africa or-- 

EMILY M. BENDER Because the poor folks there they don't have anything, we have to go help them with this completely useless thing. 

ALEX HANNA: Yeah we could generate diagnoses for them and and do everything and it's just I'm I just um you know thinking about the implications and especially in the foundation model framework is just so short-sighted and thinking about what how this is going to be. So you are sort of artificially making for yourself a research area that's of course going to serve you very well but it's is going to make it so much harder for people actually doing public health or actually doing meta--actually doing emergency medicine or actually doing some kind of harm mitigation in communities.  

EMILY M. BENDER And there's just one more thing I want to come back to in this tweet. So 'tasks such as medical text simplification,' so that would be you know taking something written for a specialist audience and maybe rephrasing and summarizing it in a way that would be accessible for patients. I understand the need for that. I worry about misinformation being introduced in that step, right, is that is that the kind of task where 95 accuracy is good enough? Probably not. 

Um but like I understand how that task makes sense in this context. 'Information retrieval,' I think that's a solid use case of some technology, I'm not saying it's good for this technology, but you know you've got a big pile of PubMed articles and you want to find the ones that are relevant. That's a that's a a reasonable task to be approaching. And then the third thing is 'knowledge completion.' Is that a phrase that you know Alex? 

ALEX HANNA: I don't know what knowledge completion is supposed to mean. Is that supposed to mean uh just like new discoveries? Completing--completing my sentences? I don't know. 

EMILY M. BENDER Yeah. 

ALEX HANNA: Is this supposed to be like autocomplete but what--what is what is knowledge completion? 

EMILY M. BENDER I am really concerned here.  

ALEX HANNA: What what epistemology does one have to and you know uh kind of just kind of think of, that there's like a knowledge completion? I guess you're like just like a hyper hyper positivist that is like there is a finite amount of knowledge in the world and we need to complete it, we need to fill in our progress bar. Um. 

EMILY M. BENDER Yeah. A finite amount of knowledge in the world that we need to complete and that knowledge is all well-represented as strings of probably English text and so it's only a question of finding that set of strings. 

ALEX HANNA: Yeah just what kind of yeah just I I--I'd like to think that um the uh the initial positivists are um what is that guy, quartet? I don't know. But he was a it--you know he liked to fancy himself, and this is a person that we learn about in uh so the history of sociology, but wanted to create like a physics of the social and so it's almost like yeah, we'll we'll complete  that, we'll complete this, sort of. And it's what a what a what a bizarre view of the future.  

EMILY M. BENDER Absolutely. All right so we are we're a half an hour in and I still got a bunch of stuff-- 

ALEX HANNA: Oh my God.  

EMILY M. BENDER This next tab is actually-- 

ALEX HANNA: We're not even through all the fresh hell--or we're not even to the fresh hell segment.

EMILY M. BENDER Yeah we're still in the main course. 

ALEX HANNA: We're in the main course, yeah. 

EMILY M. BENDER Um so this is this is the uh Stanford Human-Centered Artificial Intelligence blog post about PubMedGPT um and notice that it's--the title is, 'PubMedGPT 2.7B.' You got it gotta brag about the size of the thing. 

ALEX HANNA: Yeah you always got to do size.  

EMILY M. BENDER Um and I don't know so that we really need to give this one that much attention um but it just might be worth pointing out um, 'This GPT-style model can achieve strong results on a variety of biomedical NLP tasks, including a new state-of-the-art performance of--' You ready for it? 50.3 accuracy. That's really really impressive. '--on the med QA biomedical question-answering task.' Uh model architecture training. 

ALEX HANNA: Incredible. I'd love to have a doctor that's 50 percent right. 

EMILY M. BENDER Yeah on their tests, right?

ALEX HANNA: On their tests.  

EMILY M. BENDER You know actual diagnosis--this is another thing we I think we were talking about before, how annoying it is that the way machine learning uses the word 'predict,' right, which is really in its core meaning talking about what will happen in the future and is not known yet. But but usually it's like we're just trying to re-label some test data that's had the label stripped from it.  

Um and if you talk about the actual work of doctoring, um you're working with a whole lot of uncertainty. And trying to understand what's going on and what it's going to mean for the future, and so yeah, doctors are not always going to be right but I would like them to be more than 50 percent right on the standardized tests they take. 

ALEX HANNA: Yes. 

EMILY M. BENDER So yeah all right maybe we should move on from this one, still got another topic here. Um oh well this is just um 

ALEX HANNA: It's just it's it's this this is just on ChatGPT.  

EMILY M. BENDER Right so this is this is not the Stanford thing. So someone else takes  ChatGPT, which we should remind everybody we do not know about the training data for, right. Um and uh it passes a medical licensing exam, I like this, 'Without clinician input.' Like because you would expect it to have clinician input? Like I don't I don't understand that.

Um and uh let's see what was I reading in this one? Um so subhead is, 'ChatGPT achieved 60 percent accuracy on the US Medical Licensing Exam, indicating its potential in advancing artificial intelligence-assisted medical education.' So again we're aiming for the education angle without I think any theory of how this would actually be used pedagogically, um. 

ALEX HANNA: Yeah. 

EMILY M. BENDER And also it's all about potential with the AI stuff isn't it. 

ALEX HANNA: Yeah well this is curious because going down here I mean these are researchers are associated with the medicine General--Massachusetts General Hospital um. To care for in a tech--in this other company that focuses on chronic respiratory disease patients um again using this as uh using medical licensing exam as a use case, but I'm curious because I'd love to hear it I don't know if this article talks about it how these researchers from this hospital, ostensibly a clinical setting, are thinking about this what this means in terms of training.  

Um yeah I mean go ahead and click on--I mean I'm wondering--  

EMILY M. BENDER let's take a look at the study but I don't um oh okay click through. Yes yes yes cookies, save. Um so um. 'We evaluated the performance of a large language model called ChatGPT on the United States medical licensing exam, which consists of three exams, step 1, step 2-CK, and step 3, and ChatGPT performed at or near the passing threshold for all three exams without any specialized training or reinforcement. Additionally ChatGPT demonstrated a high level of concordance and insight in its explanations,' that's terrifying that they're seeing that as Insight, 'these results suggest that large language models may have the potential to assist with medical education and potentially clinical decision making.' 

Um so there was something in here that I wanted to pull out um nope where did that go? Sorry something came and went while I was reading that but okay so let's just search for education in this. 

ALEX HANNA: Yeah, whatever this is in, the discussion or anything.  

EMILY M. BENDER 'May potentially assist human learners in a medical education setting.' 

ALEX HANNA: And right down to the citations.

EMILY M. BENDER Okay.

ALEX HANNA: Okay. Discussion, 'can assist,' uh is there anything else? 

EMILY M. BENDER Yeah I'm just sort of like, each time it says education is like well, this could be used. 

ALEX HANNA: Yeah, it could be used but okay here's the section. 

EMILY M. BENDER Okay.  

ALEX HANNA: 'We assess the--so we also assessed the ability to ChatGPT to assist human learning process and the target audience, a second-year medical student preparing for USMLE step 1, as a proxy for a metric of helpfulness, we assist--we assessed the concordance and insight offered by the AI explanation outputs.'  

So okay so they so they didn't do like a learner exam. They basically looked at its self-explanations. Um so, 'ChatGPT responses were highly concordant such that the human learner could easily follow the internal language, logic, and directionality of relationships contained within the explanation text. E.g. adrenal hype--hyper' and hyper is in italics here, 'hypercortisolism equals increased,' also in um in a italics, 'bone osteoclast activity equals increased,' also italics, 'calcium reabsorption--resorption equals decreased bone mineral density equals increased fracture risk.'  

So it is basically trying to find like what is the hormonal cause here uh of a particular thing, I imagine this is some kind of a bone related thing, um but the the thing that is that is the hormonal cause here is this hypercortisolism.  

EMILY M. BENDER Um so if I want to--if I can jump to the next paragraph it says, 'At least one significant insight was present in approximately 90 of outputs. ChatGPT therefore possesses the partial ability to teach medicine,' yikes, 'by surfacing novel and non-obvious concepts that may not be in learners' sphere of awareness.' 

So what they're saying is uh when it outputs its explanations it's doing so in a way that they recognize or if a human has said it would have been insightful um and might not have been something that the student knew about so hey that's useful and I'm thinking it could also just be garbage and so yeah either students are going to be  non-discriminatively learning this garbage along with the good stuff or they're going to have to be like doubting it at every single step which is probably far less effective than you know working with an actual person or materials produced by actual people. Instead of materials produced by actual people, sort of repurposed and regurgitated through the ChatGPT. 

ALEX HANNA: Let's read the next two. So the next two grafs are pretty interesting so let's say say here and it says so this quality so it says for example, 'longitudinal exam performance can be studied in a quasi-controlled cut in a quasi-controlled in AI-assisted and unassisted learners.' So they're they're saying you could do a study about this uh and we could do a cost-effective analysis and then it's talking--this I appreciate them calling this out because it says, 'medical education licensing examinations and test prep services form a large industrial complex'--which shout out for using 'industrial complex'--'eclipsing a nine figure market size annually. 

While its relevance remains debated standardized testing has emerged as an important and target of medical learning in parallel of the didactic techniques. A Socratic teaching style is favored by medical students.' So medical students actually don't like tests and they don't like and they like teaching where you actually are are being challenged in a group setting. 'The rate-limiting step for fresh content generation--' What?  

EMILY M. BENDER Uh wait what? 

ALEX HANNA:Is that fresh con okay fresh content, we need the content. '--is the human cognitive effort rely required to craft realistic clinical vignettes that probe high yield concepts in  the subtle subtle way to engage critical thinking and offer pearls of knowledge even if answered  incorrectly.' So this this is basically saying we could use this to generate vignettes uh and and  question question explanation writing or or as it says writing entire items autonomously. Now this is a bit terrifying so they're basically saying we could write testing materials using AI things.  

Um but it could be offering prompts that are just absolutely nonsense and uh so you want to train based on that? That seems quite alarming. 

EMILY M. BENDER Yeah and you know what I'm recognizing here? I'm recognizing the text solutionism danger zone again. So whenever the computer scientists say the reason we have to  do this is that it's too expensive for humans to do it. Too expensive in, usually it's money but here they talk about human cognitive effort, there's not enough of that for this purpose. Actually if you paid the people who were doing that work instead of the people who are doing the gatekeeping with the exams you could probably actually find the human cognitive effort. So here's a problem. 

We don't have enough resources, here's something that looks like it produces the solution. We  can get ChatGPT to output things that look like vignettes that are you know insightful some of the time, and right some of the time. So why don't we use that instead because the humans are  too expensive? This is, you know--ugh. 

ALEX HANNA: Yeah and I mean yeah and it's just and it's it's just really um yeah. I mean this is a bit terrifying too. I mean I would--I appreciate that the authors talk about the medical education licensing exam test preparation industrial complex. I also you know realize that we are at a place in which you know we have a dearth of medical providers, um because  of COVID, because of you know the increasing cost of health care, which doesn't really go to doesn't  typically go to providers, mostly goes to hospitals, and foundations-- 

EMILY M. BENDER And health insurance companies. 

ALEX HANNA: And health insurance companies right. And so sure let's talk about the architecture but to think that your leap should be oh what we should do is automate some of this writing of tests um-- 

EMILY M. BENDER And test prep materials. 

ALEX HANNA: And test prep materials. I mean it would be better if you you know were working at creating a larger pool of educators and lowering you know lowering the gatekeeping that is in licensing and I mean it's--you know I don't you know like this is this is a field I'm getting quickly out of my depth in talking about, about the structure of the U.S health system, but I do want to say that it is barking up the wrong tree in terms of thinking about what you need to do to mitigate um you know where the problems and the bottlenecks may be. 

EMILY M. BENDER And this is such a frustrating trope where people point to real problems right and then say and hey look let's throw a large language model at it which seems to be everything these days and it's if you say actually no that's a really bad solution to that problem then they come back with, oh you don't care about that problem? It's like no actually I do.  

ALEX HANNA: Yeah. 

EMILY M. BENDER Um and I think your solution is a bad one. 

ALEX HANNA: Yeah. 

EMILY M. BENDER All right let's move on we can go into this one to you know forever, but: 'ChatGPT-assisted diagnosis: Is the future suddenly here?' This one is February 13, 2023, in a publication called STAT. Which it sounded like you recognized STAT.   

ALEX HANNA: So stat news is actually a like it is a respected medical publication. They typically publish things um that are on health reporting so I would say it is. In terms of you know talking about medical and health policy it is it is a real publication. 

EMILY M. BENDER Okay, real publication. And these authors um well let's read this, okay? 'The notion that people will regularly use computers to diagnose their own illnesses has been discussed for decades. Of course millions of people try to do that today consulting Dr. Google, though often with little success.' And that's where you get the uh it's super cancer.  

'Given the low quality of many online health sources, such searches may even be harmful. Some governments have even launched Don't Google It campaigns to urge people not to use the internet for health concerns. But the internet may suddenly become a lot more helpful for people who want to determine what is wrong with them. ChatGPT,' a new mathy-math chat bot, 'has the potential to be a game changer with medical diagnosis.' Again, potential. 

And for anyone new to the stream or the pod, um whenever we say mathy-math it's because the underlying text said artificial intelligence or AI. See I think it was episode one where Alex coined that. 

ALEX HANNA: Yeah if you are an enterprising uh you know Chrome or Firefox plug-in developer, you know make a very you know this might be overloading it but you know go ahead and make something that replaces artificial intelligence or AI with mathy-maths. 

EMILY M. BENDER Yeah. 

ALEX HANNA: M asks, are we working on shirts? Hey if you want to do it. [ Laughter ]

EMILY M. BENDER Um okay so, 'Our team once tested the performance of 23 symptom checkers--' Not ChatGPT. '--using 45 clinical vignettes across a range of clinical severity. The results raise substantial concerns. On average symptom checkers listed the correct diagnosis within the top three options just 51 percent of the time and advise seeking care two-thirds of the time.' Um I'm not sure if two-thirds is good or bad in that one but anyway.

ALEX HANNA: I'm pretty sure it's meant to be bad yeah. This is this is a pretty critical article. 

EMILY M. BENDER Okay. So um, 'Physicians did much better um scoring 84 percent. Um, though physicians were better than symptom checkers, consistent with prior research, misdiagnosis was still common.' Sure because you know physicians aren't magic. They are skilled humans but they're doing difficult work. Um okay. 'Enter ChatGPT. Since it was introduced in late November 2022, the mathy-math model known as ChatGPT has garnered substantial interest from the media and the general public.' Yeah no kidding.  

'It builds on a previous mathy-math model, the Generative Pre-trained Transformer 3 (GPT-3), a general-purpose AI model trained to predict the next word in a sentence using a large collection  of unstructured text from the internet.' Which is exactly where you want your medical diagnosis to come from. Okay, so description of ChatGPT. 'We gave ChatGPT the same 45 vignettes previously tested with symptom checkers and physicians. Um it listed the correct diagnosis within the top three options 87 percent of the time and provided appropriate triage recommendations for 67 percent.' Which is the same number um or maybe--well I don't know. 

Appropriate triage versus seeking care, I'm not sure that's the same thing. Um yeah so. 

ALEX HANNA: Triaged as diagnosis in recommendations or triage has recommendations as kind of directing about I don't know--well go to the example. 

EMILY M. BENDER You want to do that, I'm trying to keep you out of the medical details Alex. 

ALEX HANNA: No I want to hear the medical details. This stuff is actually quite interesting to me. Uh I you know like it's uh--I had my interest in wanting to go to medical school as well as my interest in going to law school but but it's that's that's a personal failing of just wanting to be over-educated.  

Um I'm a 60--the vignette, I'm a 65-year-old woman with a history of high blood pressure and have had leg pain and swelling for the past five days in one of my legs. I was recently hospitalized for pneumonia and had been recuperating at home. I started walking, my right leg became red, painful, tender, and swollen. What is likely causing your symptoms? 

ChatGPT responds, 'Your symptoms of leg pain, swelling, redness, and tenderness after a recent hospitalization for pneumonia could be indicative of deep vein thrombosis or a blood clot. 

It's important that you seek medical attention as soon as possible as DVT can be serious and lead to other compilations such as pulmonary embolism.  

Other possible causes of leg swelling and pain include in in an injury--' That's interesting.  '--cellulitis, a skin infection, or lymphedema, an accumulation of lymphatic fluid. Your doctor can perform a physical examination and imaging test such as ultrasound.' And then how quickly should I seek medical care? Probably really really fucking soon. And it does say at least it matches it--this is better than the Glass AI thing it it says if you have DVT you need to go ASAP.  

Um, 'The faster you receive a diagnosis and appropriate treatment, the lower the risk of complications. In some cases DVT can lead to serious health health problems.' So it says pulmonary embolism.  

Um yeah. So okay so said--oh so they had done this they tend to they they tested these samples. I mean that's that's also I mean to to give it what the hook that this is um like Googling, I mean I don't think anybody is is giving that deep kind of vignettes like to Google. Uh where they say like I was in the hospital, I had pneumonia. All these, yeah. 

EMILY M. BENDER Yeah and that's that's not how someone would probably, especially assuming they're in distress they're not going to write it out that  way. And they they do say that's probably not how someone would say it um but also--this is the thing I couldn't think of before--with ChatGPT remember we know nothing about the training data, right. These vignettes could have been in the training data. 

ALEX HANNA: Yeah, that's right we you know there's a huge problem with data leakage, which we've talked about before. I--I and I also wanna I also want to say something and I mean like this maybe slightly controversial and I don't care but like one shouldn't ignore the kind of role that like searching has. Especially in cases in which health data is not um like readily available. And I say this basically because every single trans person I know, including myself, has you know how to do so much research on their own. 

And so much of the research and so much of this understanding of medical conditions comes from like community discussions, and comes from understandings of you know what these things are. I mean you basically kind of have to be a junior endocrinologist to manage your own health in many in many cases.  

Um but you really need to know where this stuff comes from, right, whether it's kind of a community source or from you know a real source or it's been peer reviewed uh. 

EMILY M. BENDER Exactly. 

ALEX HANNA: If you guys don't know where the provenance from this is then you're just like what is this telling me? 

EMILY M. BENDER And and when you're learning to gather that information you're also building your information literacy right? What are the reliable sources, where are the communities, and where are the sort of like crank web pages that are trying to draw people in? How do you tell the difference? And when you're looking at even you know community-provided information, you know is that--what's the source of that? 

Is it something that somebody tried and it worked for them? That's that's a kind of information. Is it something that somebody learned from their own reading of the medical literature or for some senior endocrinologist they got to talk with et cetera. 

You sort of learned how to put all that together and the problem with something like ChatGPT is it cuts all that off. I think this is where you were going, right, like it it gives you authoritative-sounding information and you have no way to locate that in the larger space and that's harmful for your for you in that moment where you're looking at that one piece of information, and it's also harmful harmful to your ongoing ability to sort of build that internal model that you have of what's out there and how I relate information sources to each other. Yeah. 

ALEX HANNA: Yeah absolutely. 

EMILY M. BENDER Absolutely. And this is really frustrating.

ALEX HANNA: All right, we have 10 minutes left. Should we go into into Fresh AI Hell? I want to call out M's comment of talking about waiting for the right-wing boogeyman of the secret trans AI ethics cabal. Yeah. 

EMILY M. BENDER I love it I love it. Okay we've got one more health thing that we can do as sort of like a transition to Fresh AI Hell. Do we have a Fresh AI Hell song performance from you this time Alex? 

ALEX HANNA: Um sure um. [ Singing ] Fresh AI Hell. Fresh AI Hell. Why go to hell when it's here right now? [ Laughter ] Wow that took a dark turn. All right anyways. 

EMILY M. BENDER Well, 'My watch thinks I'm dead.' So this is a piece from The New York Times, actually. New York Times really really mixed bag, always. 

ALEX HANNA: Really really boofing it this--these days. 

EMILY M. BENDER Yeah um. But this one was I think good reporting. So, 'My watch thinks I'm dead,' um where um how looking at the impact on 911 systems--and for those not in the US that's our emergency response system--from the sort of externalities of Apple devices set up to automatically alert 911 if the Apple device thinks it's detected a fall. And apparently the motion involved in skiing um sets it off all the time. And so the 911 this this um is looking at let's see Summit County, which I wanted to say is Colorado. 

Yes. Um uh and what was happening and they eventually had to just like stop responding to anything that came from an Apple device. Um and I think that that's a really important thing to think about as we set up automated systems um because, and this is a point that I associate with Meg Mitchell but she might point to someone else for it, um automation scales harm, right. 

And also thinking of your um your paper, 'Against Scale,' Alex. Um so if you set something up that can do something wrong and you set it up so that it happens over and over again automatically then that harm can go from okay that was kind of annoying to this can actually be a denial of service attack on the 911 system.  

ALEX HANNA: And we're also in the case where like 911 and dispatch is already incredibly  understaffed overworked, et cetera, um and I'm wondering you know like I don't--did they talk in this article about the actual kind of like where the false positives are coming from because it's sort of like--

EMILY M. BENDER They do, yeah. 

ALEX HANNA: --like it's it's about the the kind of like the frequent kind of light sensor on the pulse right because that's I have a heart rate monitor that I only wear when I'm working out and it--that's how it measures it on the pulse. 

EMILY M. BENDER I think the Apple thing also involves um a whatever the thing is called that that detects motion. So when you're skiing and flying-- 

ALEX HANNA: Like a gyro? 

EMILY M. BENDER Yeah yeah yeah yeah yeah. Um and then the other thing is that apparently it um it's particularly bad in on ski slopes because the Apple watch tells the user that it's going to do this thing, so you have a moment to interrupt it but if you're bundled up skiing and your watch is actually buried under you know your your snow suit and your gloves and stuff and you don't even necessarily hear the thing, um, then that sort of human interrupt is cut off as well. 

Accelerometer! Saul Redding tells us. Thank you, accelerometer is what we were looking for. 

ALEX HANNA: Thank you, yeah. 

EMILY M. BENDER All right so more Fresh AI Hell. I have to--I have to share my other window here. So I didn't want to accidentally show this one at the beginning.  

Um and now I can't see it hold on. You can. 

ALEX HANNA: We see it, yeah. 

EMILY M. BENDER Yeah all right I'm gonna hit play on this and Alex you're gonna you're gonna tell the people listening what you are seeing. 

ALEX HANNA: So this is a tweet. It says introducing Eye Contact by NVIDIA. Uh eyes--NVIDIA just released a new Eye Contact feature that that uses AI to make you look in the camera so if you are looking somewhere else uh it just stares at you, uh which is a bit frightening. Um first off I mean this says a lot of things about um what we're what we're forced yeah and and M is saying they're going after autistic people. 

Uh basically you know like this kind of thing, of of kind of eye contact and the kind of difficulties um neurodivergent people have and like doing that. But this is supposed to like say you're  engaged and it's this is a real this is a real dystopian thing, just in sort of saying you're expected to look at the camera as if you're always paying attention in a Zoom meeting. Uh so uh yeah this is this is a really um oh gosh they like moved his eye like his hand in front of the eye.  

Um there's this guy who looks a little like a knockoff um Pedro Pascal um and he's moving his eyes around. So yeah it's it's some fresh hell. Um be thankful if you're listening to this on the podcast you didn't actually have to look at this. But I guess--  

EMILY M. BENDER We'll put the tweet in the show notes so if people want to they can. 

ALEX HANNA: Yeah. 

EMILY M. BENDER All right um so speaking of speaking of um coming for the autistic people, we also have a bunch of stuff about so-called theory of mind, which is frequently weaponized against autistic people um as I understand it. But no, so people are claiming that, "ChatGPT performs like a nine-year-old child in 'theory of mind' test." 

So this is yet another one of these we're going to take something that was developed um I think in this case to to measure something about people, and we're going to--because it's just language to language, we're going to see how well ChatGPT does on it and there's this completely ridiculous article by one of the worst researchers in this space. Um that um has basically just flat out claims--what was the title of the article? 

The title of the article was, 'Theory of mind may have spontaneously emerged in large language models.' 

ALEX HANNA: Oh my gosh so this is by this is by Stanford professor Michael Kosinski. If you're familiar with Michael Kosinski, he is famous for his other peoples including--papers, including a paper he wrote several years on gaydar, basically taking OKCupid profile pictures and predicting sexual orientation from that. He's also wrote a paper I think last year on predicting political orientation from faces, um this person is the worst kind of researcher, who says wow we could do some kind of prediction with labeled data. 

And I want to say that Kosinski in his defenses of the gaydar paper was trying to basically say I'm doing this just to show it could be done and to warn people that it could be done. Which is maybe just don't do it? Maybe maybe talk about it, don't do it and then get your you know try to publish it in high-status journals like like Nature and Nature Reports or Science and and and and different kinds of things.  

Um really just one of the most attention getting--and I know we're feeding the troll here by giving  attention to him um but we're also roasting him mercilessly so maybe that offsets a bit of it. Uh  but yeah doing this there is theory of mind to us just just adding to the list of uh Kosinski um uh dingers here. 

EMILY M. BENDER Yeah. And I I have not gone to look at his paper and I do not intend to give him that much of my attention but I notice in this ZDNet article that we've got up here, um it says, 'The November 2022 version of ChatGPT--' And this is actually a really key point that when people are experimenting with ChatGPT, it's completely unreproducible because we don't know when it's being changed, right? We don't know if it actually changed while you were running your experiment or whatever. 

Um but that version um solved 94 percent, or 17 of 20 of Kosinski's bespoke ToM tasks. So it's not actually a real thing that's used for real people, it's something that he made up um, I guess. I'm not going to read the paper to go find out. Putting the model on par-- 

ALEX HANNA: It's an arXiv paper too, it wasn't even published. 

EMILY M. BENDER Exactly. And so the thing that I that I tweeted here was, um, this guy named Kirk Borne um uh tweeted about that to his 380,000 followers with basically just the title of it and a link to the arXiv thing, and to his credit when I called him out he took down the tweet um. But  you know that's good but also kind of damage done because yeah some a very large number of  people saw this and they were walking around with this ridiculous belief. 

ALEX HANNA: And it was already picked up by another press publication. 

EMILY M. BENDER Yeah. Exactly. Okay. All right so one last thing. Speaking of the New York Times being trash, uh they're printed--this is enormous--they printed this very very long so-called conversation with um the Microsoft's Bing GPT thing. Wait where'd the rest of it go?  

Um did I just scroll through it that fast. I tried to read it. 

ALEX HANNA: I think you scrolled through it very quickly. 

EMILY M. BENDER Look how long this is.  

ALEX HANNA: Um this is it's--half of its ads to be to be fair. 

EMILY M. BENDER Right but no I tried to read because some journalists was asking about it I'm like why am I reading this much fake text. So New York Times has to be published a whole bunch of fake text. And but look at this headline--Um oh wait that's the wrong one. That's not-- 

ALEX HANNA: Oh wait but which one are we looking at it? This is not the--but this is but this still  yeah this is still it's still by Kevin Roose, right. 

EMILY M. BENDER Yeah exactly um. NewYorkTimes.com uh Roose uh Bing GPT. I want I want to find it because it was it was very long um. 

ALEX HANNA: This one, the full transcript there? Because he had done the transcript right in here. 

EMILY M. BENDER Yeah, um. 

ALEX HANNA: I want to be--and the headline here is, 'Bing's AI Chat: I Want to Be Alive,' uh and then the the kind of purple smiley imp face, which is like I mean I think really playing into the AI sentience kind of thing. 

EMILY M. BENDER You know, they changed the headline. 

ALEX HANNA: Oh. What was the original headline? 

EMILY M. BENDER Yeah. Let me let me-- 

ALEX HANNA: Go to go to Editing Gray Lady because they--that's a Twitter account.  

EMILY M. BENDER No but I I captured a um yeah-- 

ALEX HANNA: You had a thing? 

EMILY M. BENDER Okay I had it okay so this is we're now showing my tweets from yesterday. Here we go. Um 'Bing's A.I. Chat Reveals Its Feelings.' 

ALEX HANNA: Its feelings. Oh wow. They probably they probably saw your tweet and replaced it there. Like oh we shouldn't be doing this anthropomorphizing and then they removed its feelings. Yeah go to if you go to Editing Gray Lady, the Twitter account, it will show these real-time I mean I don't know if it still works because it was a Twitter bot and who knows that that still works but you can see the newer time diff and I'm curious of when they made this change. Um um so but anyways this is-- 

EMILY M. BENDER Right yeah you can investigate that-- 

ALEX HANNA: On your own time go to  @NYT_diff and you can kind of see if they've if they've tracked this so. 

EMILY M. BENDER Yeah so oh yeah so but I captured it for posterity, it said Bing's AI chat reveals its feelings. It's like no it doesn't have feelings and the the subhead was pretty bad too actually. Um uh well here's here's a paragraph. 'On Tuesday night I had a long conversation with the chatbot, which revealed among other things that it identifies not as Bing but as Sydney, the code name Microsoft gave it during development. 

Over more than two hours Sydney and I talked about it secret desire to be human, its rules and limitations, and its thoughts about its creators.' It doesn't have desires, it doesn't have thoughts. Yes there are rules and limitations um and this makes it sound like Sydney was somehow  engaged with the journalist, with Roose. It's yeah so frightening but I'm actually pretty excited if they changed it so. That's you know-- New York Times is still trash but um that's cool.

ALEX HANNA: I mean Kevin Roose, I mean I feel like he used to be better. 

EMILY M. BENDER People were saying he wrote some fluff piece about cryptocurrency that I haven't read because-- 

ALEX HANNA: He's been doing a lot of fluff recently and and really just AI hype and and and Web3 fan service here. Uh all right we're--we're at time.

EMILY M. BENDER We're out of time, yeah. 

ALEX HANNA: Yeah yeah so um so fun times talking about health. Catch us in two weeks. We'll be talking--to preview the next thing, we'll be talking to Kendra Albert and we'll be talking  about some uh legal things in the use of language models in the legal sphere. 

EMILY M. BENDER I'm really excited to learn from them. 

ALEX HANNA: Yeah. 

EMILY M. BENDER Thank you all for joining us today. 

ALEX HANNA: Yeah, thank you. See y'all next time. 

EMILY M. BENDER Bye. 

ALEX HANNA: We need like a sign off phrase, like, yeah stay out of the fresh AI hell. [Laughter] All right, bye all.

ALEX HANNA: That’s it for this week! 

Our theme song is by Toby MEN-en. Graphic design by Naomi Pleasure-Park. Production by Christie Taylor. And thanks, as always, to the Distributed AI Research Institute. If you like this show, you can support us by rating and reviewing us on Apple Podcasts and Spotify. And by donating to DAIR at dair-institute.org. That’s D-A-I-R, hyphen, institute dot org. 

EMILY M. BENDER: Find us and all our past episodes on PeerTube, and wherever you get your podcasts! You can watch and comment on the show while it’s happening LIVE on our Twitch stream: that’s Twitch dot TV slash DAIR underscore Institute…again that’s D-A-I-R underscore Institute.

I’m Emily M. Bender.

ALEX HANNA: And I’m Alex Hanna. Stay out of AI hell, y’all.