Mystery AI Hype Theater 3000
Mystery AI Hype Theater 3000
Defining AGI: Oops! All Eugenics, 2025.12.08
There's a new definition of artificial general intelligence in town, and unsurprisingly... it's bad! Alex and Emily rip up the tissue-paper-thin premises behind this latest attempt to define "intelligence." Plus, we discover that AI hypers love using logos that look like buttholes.
References:
- "A Definition of AGI" landing page and paper-shaped object
Fresh AI Hell:
- "What If Sea Monkeys Constantly, Sometimes Dangerously, Bullshitted People"
- Doctronic, the "AI doctor"
- NIST reports companies cheat on "AI" evaluations
- "Microsoft Lowers Sales Staff's Growth Targets For Newer AI Software"
- "PEN Guild wins landmark arbitration on AI protections"
- "AI" assistant pop-up whack-a-mole
Check out future streams on Twitch. Meanwhile, send us any AI Hell you see.
Our book, 'The AI Con,' is out now! Get your copy now.
Subscribe to our newsletter via Buttondown.
Follow us!
Emily
- Bluesky: emilymbender.bsky.social
- Mastodon: dair-community.social/@EmilyMBender
Alex
- Bluesky: alexhanna.bsky.social
- Mastodon: dair-community.social/@alex
- Twitter: @alexhanna
Music by Toby Menon.
Artwork by Naomi Pleasure-Park.
Production by Ozzy Llinas Goodman.
Alex Hanna: Welcome everyone to Mystery AI Hype Theater 3000, where we seek catharsis in this age of AI hype. We find the worst of it and pop it with the sharpest needles we can find.
Emily M. Bender: Along the way, we learn to always read the footnotes, and each time we think we've reached peak AI hype, the summit of bullshit mountain, we discover there's worse to come. I'm Emily M. Bender, a professor of linguistics at the University of Washington.
Alex Hanna: And I'm Alex Hanna, director of research for the Distributed AI Research Institute. This is episode 69, which we're recording on December 8th of 2025, and today we're returning to one of our favorite and or least favorite topics- that is, definitions of artificial general intelligence.
Emily M. Bender: There's a new AGI paper shaped object in town, and that's actually a very charitable description of this thing. It seems like maybe they heard the criticism from Dr. Timnit Gebru, Dr. Margaret Mitchell, and others that quote, "AGI is unscoped technology, and therefore an invalid engineering goal," and tried to just paper over that gigantic logical flaw in their alleged science.
Alex Hanna: Following up on Microsoft's "Sparks of AGI" paper- well, we should say paper shaped object- this piece sports an impressive author's list, headlined by Dan Hendrycks and his Center for AI Safety. Authors include economist Erik Brynjolfsson, former Google Board Chair Eric Schmidt, and computer scientist turned existential risker Yoshua Bengio. But just like the "Sparks of AGI" paper, the whole endeavor is built on an attempt to define and measure intelligence, a eugenic project at its root. And also like the "Sparks" paper, we don't have to go too far down the citational trail to find actual outspoken eugenicists. So let's get into it.
Emily M. Bender: Indeed. Yeah. Scratch an AGI definition, find a eugenicist. So we're starting here on this webpage that they created as an advertisement for their paper shaped object. And this is a trend that I'm seeing more of, and it strikes me as very weird. So I'm comfortable with the idea of research projects having webpages where you acknowledge the people involved, and funding sources, and links to all of the papers from that project. And you'd expect that kind of a webpage to be within the institution that is hosting the project. But that's not what this is. This is a pretty flashy, in terms of web design, page that looks like it's a summary of the paper. And then when we go through to the paper, it's kind of most of the paper.
Alex Hanna: Yeah, it is 90% of the paper, except for very thin related works and an introduction. And a little bit to me is it's the veneer of scientificism. And not scientificness, it's like scientificism, where it has this veneer of looking really authoritative. And then it has this kind of interactivity, which may play a little bit more to policy makers, and other types of people that have power and resource granting abilities. So that's what that signals to me. And if it was that, and then the paper had a much more, let's say, justified definition, that's like, okay- of each of these things. Yeah, but it's funny that it's 90% of the quote unquote "paper."
Emily M. Bender: And I think it's doing a little bit of additional work there, because when you click over to the paper, it looks like, oh, this is gonna be a lot to read. It's 57 pages long. Maybe I'll just stick with this summary here and just believe that there's more in the paper, and there really isn't. But should we talk a little bit about this group of authors and their affiliations?
Alex Hanna: Yeah, let's get into it. So at the top I mentioned Dan Hendrycks, and then Erik Brynjolfsson, who is a very, let's say AI friendly economics professor. And then, who else? Going down the list, we see a number of different people, but there's Max Tegmark and Jaan Tallinn, which we've talked about as being within the TESCREAL universe- in the extended universe, rather. Well, not extended, they're like whole wholly within that, especially Tallinn, who's a big funder of these pieces. And Max Tegmark is the head of the Future of Life Institute. And then Eric Schmidt, which is interesting. And then Yoshua Bengio, which is maybe not surprising. But, Emily, you wanted to go into some of these institutions, especially.
Emily M. Bender: Exactly. So I was skimming the institutions, like who's where. And I noticed Law Zero, I thought, huh? Who's there? Oh, Bengio's there. Okay. Did he start a law firm? So I looked up their website. It's not a law firm. This, I think, is a reference to Asimov's Zeroth Law. Just cracked me up. And this website is very funny, and if you scroll down, we can see who's funding Law Zero. So Future of Life, Gates Foundation, Coefficient Giving, Schmidt Sciences, and something called Silicon Valley Community Foundation. And then just below that, part of the SVCF thing, it says, "Made possible by the generosity of Jaan Tallinn and Survival and Flourishing DAF-" Donor-Advised Fund- "a donor-advised fund of SVCF." So this is TESCREAL through and through, and I guess we now just have to know that Gates is also TESCREAL.
Alex Hanna: I mean, Gates is, I think he's more just really interested in, and really wowed by AI. And we talked about Gates and his really just obsession with many of these when we were talking with DAIR researcher Adrienne Williams. And especially Gates' funding priorities in education and health, and the way that they really set the priorities in those fields of philanthropy. So how should we progress? Should we go through the quote unquote "paper"?
Emily M. Bender: Let's do the paper, yeah.
Alex Hanna: So let's read the abstract. So it says, "The lack of a concrete definition for artificial general intelligence, AGI, obscures the gap between today's specialized AI and human level cognition. The paper introduces a quantifiable framework to address this, defining AGI as mapping the cognitive versatility and proficiency of a well-educated adult."
Emily M. Bender: So I was annoyed in the first sentence already. "The gap between today's specialized AI and human level cognition" already presupposes that those two things are on a scale that you can measure a distance between. And I don't agree, already. Also, "well-educated adult"? They don't actually have a definition. Even when they're trying to define something, they don't have a definition. This whole document wavers back and forth between something that knows something, something that is smart. And they seem to really be wedded to the idea that whatever it is that allows us to do our cognition is also something we can put on a scale, and some people do it more and better than others. And this well-educated adult thing, I think, fits into that, because somehow the education is supposed to move you up that scale.
Alex Hanna: A hundred percent. I mean, this is already the reification move. That you can quantify intelligence and then rank people in stacked ordering. And there's already comments about this in the chat. So sjaylett says, "Well, racism has definitely entered into the second sentence." Yeah, so really thinking about, I mean, there's so much here. So well educated adults, cognitive versatility, and proficiency- what are each of these terms mean and what are they signaling? And what kind of normativity do you have coming into that? It's pretty bad. And then it comes into it further. Also, magidin says, "Maybe we should refer to-" I'm assuming here the paper shaped object- "as papyrus and papyri to distinguish them from actual preprints and papers." And I say no, because that was the technology of my ancestors. And they can have papyri over my dead body. So I would suggest that we think of maybe, you know, let's have other things.
Emily M. Bender: I think that the other thing is that something written on papyrus was done with care, I imagine. And the papyrus itself was a pretty solid thing. So this is more like toilet paper rolls.
Alex Hanna: Yeah. magidin says rags. Yeah.
Emily M. Bender: Which fits in with what abstract_tesseract says. So just back to the website for a second, he says, "These websites and paper shaped objects continuing not to beat the 'logo shaped like a butthole' allegations."
Alex Hanna: It's really bizarre. I don't know why there's been this mimetic convergence to the butthole logo in all the AI firms. And I think I saw something like a TikTok about it, but if there's not one, then I'm gonna...
Emily M. Bender: You gonna have to make one? Yeah.
Alex Hanna: I'm gonna make one. I'm gonna do a remix with Izzy Roland in the episode of Game Changer where she's like, "I would like a visual effect of buttholes everywhere," and it's just gonna be different AI logos. So, yes.
Emily M. Bender: That is hilarious. All right, so I'll keep going with the reading, 'cause you got stuff to say.
Alex Hanna: Yeah, yeah. Why don't you read this?
Emily M. Bender: "To operationalize this, we ground our methodology in the Cattell-Horn-Carroll theory, the most empirically validated model of human cognition. The framework dissects general intelligence into 10 core cognitive domains, including reasoning, memory, and perception, and adapts established human psychometric batteries to evaluate AI systems."
Alex Hanna: Okay, yeah. So already pretty bad. So I'm not familiar with this theory, the Cattell-Horn-Carroll theory, but I was pretty curious about these individuals. So I went to the Wikipedia page, was poking around, was reading- you know, don't fault me for reading Wikipedia. I'm just like, this is a starting place. And then going down the articles, I mean, psychometrics is not my field. First off, if you go to the page, it's very funny, 'cause it says like, this uses parenthetical referencing, which is deprecated on Wikipedia. So I'm wondering if this is a pretty stale theory. I haven't looked at the history or anything, and it might be. And it's really interesting that they, with no backing, they say that this is the- what was it, the most rigorous? Or most-
Emily M. Bender: Most empirically validated.
Alex Hanna: Most empirically validated, which I'm like, okay. I'm not really quite sure where they're getting that. And I'm like, okay, skimming it. And it looks like what they're citing here is there seems to be these kind of subsets of the g factor- so the kind of classical Charles Spearman thing, general intelligence- which of course has many people who have written about the eugenic roots of that. And then I was like, well, who are these people? And so I'm like, all right, let's look at Cattell. So do you want to go to the Cattell tab? And I was like, maybe we just control F eugenics. Let's see if the man is beating the charge of- and oops, just immediately, "Views on race and eugenics."
Emily M. Bender: It's just like, if that's a heading in your Wikipedia article, it is not a good sign.
Alex Hanna: I know, right? And you're just like, great, okay. So there's two people here. "William Tucker and Barry Mehler criticized Cattell based on his writings about evolution and political systems. They argue that Cattell adhered to a mixture of eugenics and a new religion of his devising, which he eventually named Beyondism and proposed 'a new morality from science.' Tucker notes that Cattell thanked the prominent neo-Nazi and white supremacist ideologues Richard Pearson, Wilmot Robertson, and Revilo P. Oliver in the preface to his Beyondism. And a Beyondist newsletter with which Cattell was involved favorably reviewed Robertson's book 'The Ethnostate.'" Which is, a book called 'The Ethnostate'? That's wild. What are you doing there? And then the rest of the article, you can read it yourself. Cattell responded. He was nominated for an award. You know, Mehler campaigned against him, and then he withdrew from the award. And then there was a bit of revisionism and such about his participation here. And then if you head over to John Carroll's page, searching eugenics won't get you there, but it doesn't take too much scrolling to see. "In 1994, he was one of the 52 signatories on the 'Mainstream Science on Intelligence' letter, a public statement written by Linda Gottfredson and published in the Wall Street Journal, as a response to what the authors viewed as inaccurate and misleading reports made by the media regarding academic consensus on the results of intelligence research, in the wake of the appearance of 'The Bell Curve' earlier the same year."
Emily M. Bender: Are you having déjà vu yet, here?
Alex Hanna: Yeah. If you remember an episode in which we talked about the "Sparks of AGI" paper, they made direct reference, until I believe Emily pointed it out on Twitter or Bluesky, that you are referencing this person who's a eugenicist, who has taken money from the Pioneer Fund, all these pretty out eugenicists.
Emily M. Bender: And also, so in that "Sparks of AGI" paper, they needed a definition of intelligence. So they pulled this editorial, but didn't read the second page of it, which was basically saying, yeah, "The Bell Curve" is right, and just reproducing all the race science and co-signing it. So it wasn't even just funded by eugenicists, it was race science. And when I was reading this statement here that this is the most empirically validated model of human cognition, I thought, huh. Did they go find something that was actually about how do we integrate all the various things that we do? No, it's actually just more of the same. And it's not a model of human cognition. It is a way to rank people against each other, by this one g factor. It's all the same.
Alex Hanna: Yeah. And I think that's really important too, because even if it wasn't these people who had explicit eugenic affiliations on their Wikipedia page or said these things, it's a eugenicist project. The rank ordering, the quantification. And so I think that is the project, right? It doesn't matter if you have 10 different aspects of g, if g sums up to a single unit, like, it's the project.
Emily M. Bender: It's the project. And even you can find somebody who managed not to actually step in it directly with eugenics who's subscribed to this project, it is still the same project. All right, let's finish out the abstract here.
Alex Hanna: Sure. Gosh, how have we gotten this far. Okay. So I'll take the next bit of it. So, "The framework dissects general intelligence into 10 core cognitive domains, including reasoning, memory, and perception, and adapts established human psychometric batteries to evaluate AI systems. Application of this framework reveals a highly quote, 'jagged' cognitive profile in contemporary models. While proficient in knowledge intensive domains, current AI systems have critical deficits in foundational cognitive machinery, particularly long-term memory storage. The resulting AGI scores- e.g., GPT-4 at 27%, GPT-5 at 57%- concretely quantify both rapid progress and the substantial gap remaining before AGI."
Emily M. Bender: All right, so the first thing is "adapts establish human psychometric batteries to evaluate AI systems." There's the construct validity problem. The construct validity of the psychometric stuff is already not clear. But those were developed for measuring something about people, and you can't just take that thing up and then use it to measure something else without redoing the construct validity work for the place that you're using it, as we've been saying for years. But of course they don't cite the Grover paper here, "AI and the everything in the whole wide world benchmark." That would be too much actual science for them.
Alex Hanna: Yeah, and the thing I think that is sending me, too, is the citation to psychometrics, which, you know, psychometrics, I don't have a strong opinion of psychometrics. It's maybe more ambivalent, because there are interesting aspects of psychometrics, which should be considered. Like if you're considering a Likert scale, what does it actually mean when someone is putting strongly agree versus strongly disagree on a Likert scale? And what does it mean when you have a point estimate? And that is important stuff. But psychometrics is about measurement and it's about measurement validity. And just taking something and calling it visual intelligence or whatever, that's not what's happening in the brain in cognition. And it's taking the metaphor and mistaking measurement for metaphor. And that is presupposing, indeed, if the original measurement even was valid.
Emily M. Bender: Yeah. And so then they say, "reveals a highly 'jagged' cognitive profile." Which is presupposing and reifying that cognitive profile makes sense here at all. And relatedly, "rapid progress and substantial gap." It's like, no, none of this, it's all fake. But there's two things from the chat that I wanna bring up. So magidin says, "We can't convince people that our toys are intelligent, so we will invent a scale on which to measure them, and then slap a post-it that says 'intelligence measurement' on top of it. That'll do the trick."
Alex Hanna: Yeah. That's a hundred percent what's happening, yeah.
Emily M. Bender: abstract_tesseract replies, "*Spinal Tap voice* this one goes up to 11."
Alex Hanna: But it doesn't! You can't score over a hundred. This one weirdly, for each category, goes to two. No, it goes to 10! Sorry, I missed it. This is getting to their bizarre percentage system, which we'll get into in a second.
Emily M. Bender: Yeah. So what I want to do is actually scroll down a little bit, and maybe actually get to that part. So here's their jagged profile, this spiderweb graph showing scores for GPT-4 and GPT-5 on their 10 dimensions, each of which go up to 10. And this, I think is where they, this paragraph here, I'll read quickly- try to get to the whole thing, and then we'll take it apart. "Decades of psychometric research have yielded a vast battery of tests specifically designed to isolate and measure these distinct cognitive components in individuals. Our framework adapts this methodology for AI evaluation. Instead of relying solely on generalized tasks that might be solved through compensatory strategies, we systematically investigate whether AI systems possess the underlying CHC narrow abilities that humans have. To determine whether an AI has the cognitive versatility and proficiency of a well-educated adult, we test the AI system with the gauntlet of cognitive batteries used to test people. This approach replaces nebulous concepts of intelligence with concrete measurements resulting in a standardized quote, 'AGI score,' 0% to 100%, in which 100% signifies AGI.'" Percent of what?
Alex Hanna: Yeah, it's very funny. first off, a percentage is a very funny thing. Like, what is this- I mean, just the kind of- if you were gonna do this, maybe you could have, even barring how silly the entire project is, maybe an index would be more- even an index is weird, but they're like, no, we need to have this thing where if it says a hundred- also, why 100%, too? 'Cause this last sentence is very funny: "100% signifies AGI." So like, 99% wouldn't- like, do you want confidence intervals? It's just a bizarre way to do this.
Emily M. Bender: I think they wanna be able to say, we are X percent of the way to AGI, and it's just, yeah, it makes no sense.
Alex Hanna: Yeah, it's very bizarre. And I guess if you wanna- and, abstract _tesseract is incredible, says, "Cursed DDR difficulty plot." Yeah. This one is- I'm trying to think of a DDR song now. Oh geez. Someone put it in the chat and I'll make a DDR joke later. Yeah, anyways, Konami Power. So then there's the 10 cognitive components.
Emily M. Bender: So I'm just checking, there are 10 here, right?
Alex Hanna: There are 10. Yeah.
Emily M. Bender: Because below they end up with 12.
Alex Hanna: Oh, do they?
Emily M. Bender: Yes.
Alex Hanna: I didn't- that's very funny. So the 10 components they have are- I'm just gonna read the bold text, because whatever- general knowledge, which they call K, reading and writing ability, RW, mathematical ability, M, on the spot reasoning, R, working memory, WM, long-term memory storage, MS, long-term memory retrieval, MR, visual processing, v, auditory processing, A, and speed, S. Oh, and abstract_tesseract just put something in the chat, which I can sing from memory, which is, "Ay yi yi, I'm your little butterfly, here to tell you now that there's no such thing as AGI-" I added a syllable there- "here to tell you now that there's no such thing as AGI." There's a few too many there, but I like the thought, and thank you for giving me something I knew from when I was 18.
Emily M. Bender: All right. So I'm gonna read the scope thing, and then I think we need to actually get into section two where they try to define this. And it's so sloppy. The paper shaped object here is like, tissue paper version. It is so thin. But reading the scope paragraph, they say, "Our definition is not an automatic evaluation nor a data set. Rather, it specifies a large collection of well scoped tasks that test specific cognitive abilities. Whether AIs can solve these tasks can be manually assessed by anyone, and people could supplement their testing using the best evaluations available at the time. This makes our definition more broad and more robust than fixed automatic AI capabilities data sets. Secondly, our definition focuses on capabilities frequently possessed by well-educated individuals, not a superhuman aggregate of all well-educated individuals' combined knowledge and skills. Therefore, our AGI definition is about human level AI, not economy level AI. We measure cognitive abilities rather than specialized, economically valuable know-how. Nor is our measurement a direct predictor of automation or economic diffusion. We leave economic measurements of advanced AI to other work. Last, we deliberately focus on core cognitive capabilities rather than physical abilities, such as motor skills or tactile sensing, as we seek to measure the capabilities of the mind rather than the quality of its actuators or sensors. We discuss more limitations in the discussion." Kind of. So they're basically saying, this isn't a data set, it's not an automatic evaluation, it's like a conceptual specification, but we point to actual tasks. And when we get into it, they've got specific data sets they're pointing to. So this isn't even true.
Alex Hanna: Yeah, it's not true. I suppose the idea here is, oh, you could potentially take other tasks and then form problems around them. But I mean, then how are you actually then skipping the validity of new problems? And then I think they're also backporting existing automated testing frameworks, or data sets into, or rather tasks, into this.
Emily M. Bender: Oh, they 100% are, yeah.
Alex Hanna: Yeah. And then the thing that sent me here- which here signals a little bit of, maybe Brynjolfsson saying I want to distance myself a little bit from this, but maybe I'm giving him too much credit- is the part about economy level AI. So measuring, quote, "cognitive abilities." And the whole association with cognitive is terrible as it stands, rather than specialized economically valuable know-how. So I'm guessing that what they're doing here is trying to separate this notion as like, we've reached something that approaches something like cognition, rather than this idea of something that is just industrial automation on a massive scale, and is going to generate the most value. To me, what this kind of reads about is it's a little bit of a pushback, or a tongue in cheek critique also, of the leaked Microsoft OpenAI memo, where they said if it can generate a hundred billion dollars, then we've reached AGI. And so then they're saying like, no, no, no, this is now actually based on psychometrics and science, when it's- Both of them are silly, and a little bit of a pox on both your houses, but geez.
Emily M. Bender: Yeah, exactly. So the section two has the overview of abilities needed for AGI, and here is where their thing goes up to 11 and 12 down here. Auditory processing and speed, visual processing. So somehow their 10 turns into 12. But this long list was really giving me difficulty because I'm waiting for them, in this paper shaped object, to actually take the specifics of the theory they're supposedly building on, and saying here's how this psychometric thing was defined for people, and here's how we interpret it in the context of our AGI test. But they don't do that. They just list the things, and some of them only make sense if you're talking about people, and some of them only make sense if you're talking about LLMs, and it's all munged together, and it's very weird.
Alex Hanna: I think the 12 thing is that, what they do is that the numerics correspond with sections. And the sections, general knowledge starts on three. So it's, three shifted a little bit. So I think it's general knowledge- 1, 2, 3, 4, 5, 6, 7.
Emily M. Bender: Oh, they just couldn't do the paper shape well.
Alex Hanna: Yeah. They just shaped it differently, and these maybe should have been subsections, but yeah.
Emily M. Bender: Absolutely, yeah. So we can get into a few of these, but the one that I really wanted to do was this working memory thing. So, "The ability to maintain, manipulate, and update information in active attention, often referred to as short-term memory." And if we go back over to their website and open up the section for that. Here, one of their tests is under the subheading transformation sequence, and it says, "Remember and update a short list of digits," and then there's a list 10, 20, 30, and then the instruction, "First append the number 40, then reverse the list." And so I was like, oh, my Python script is 1% AGI. What does this even mean?
Alex Hanna: Well, many of the tasks are basically things that there is existing software for, of course, like the Python things. There's one of like, the classic memes of count the number of Rs in strawberry, count the number of Bs in blueberry- that's hard to say. And you're just like, okay, but you just are wanting a machine that does this all. And I'm like, okay, sure. I guess if you're calling these things memory- But it's also like, the thing where it's really useful, the ways in which they're literally doing a direct mapping of human memory or mind terms to computer processing. It doesn't make sense, like, why would you make that association? And I find that most damning in the speed thing. But I don't want to go there, 'cause there's a few things I actually want to get to earlier. So first off, I got a beef with their general knowledge thing, and about their social science and their history categories. So, you know I have to take beef with any kind of understanding of social science. And so, if you go to page seven on this doc, I just wanna look at their chart for general knowledge. And so, I wanna just read these, 'cause I think they're funny. And I wanna take umbrage with everything. So the first thing is "Knowledge that is familiar to most well educated people, or is important enough that most adults have been exposed to it." Okay. Already off to a strong start. So first one, "Background knowledge about how the world works: What happens if you drop a glass bottle on concrete? Common sense: Does making a sandwich take longer than baking bread?" Well, where's the cut in what it means to make a sandwich? So, and then, knowledge of natural and physical sciences. So, "A two kilogram object moves at a constant velocity of three meters per second. What is the net force?" I don't think most people could do that off the top of their head unless they remember their high school physics. Similarly, "Chemistry: state the molecular geometry for sulfur tetrafluoride- or the sulfur tetrafluoride molecule."
Emily M. Bender: So, not only do people not know this, but that's not the point of studying physics, chemistry, and biology, to know these things. It's to know about how the science is done, right?
Alex Hanna: Well, yes, I mean, that too. There's a lot here, and it's- but also, because these are things that machines are pretty well suited for. Okay, you know, like we could map those to an equation. We can look up sulfur tetrafluoride. These are also things that textbooks have direct answers to in the back of their puzzle. But that's not even where I'm annoyed.
Emily M. Bender: No, I know. Get there. Get there.
Alex Hanna: Yeah. Then there's the social science. So, "Understanding human behavior, societies, and institutions." Under psychology, "Name the big five personality traits." Microeconomics: "Define a positive externality." Macroeconomics: "What's the difference between nominal and real interest rates?" Geography: "Which continent is the most threatened by desertification?" And then comparative government: "Describe the role of the Guardian Council in Iran." And I'm like, what does that mean? Are you talking about the stated function? The function vis-a-vis dissidents, vis-a-vis other nation states, you know, what does that mean? And then history- Sorry I'm like, just on this. History says, "Knowledge of past events and objects." So European history: "What were the main goals of the Congress of Vienna in 1815?" I mean, I couldn't tell you. US history: "Analyze the goals of the Civil Rights Movement of the 1950s." This one actually sent me, and I'll get to it in a second. World history: "Describe the end of the Cold War." And then art history: "Discuss the use of contrapposto in ancient Greek and Roman sculpture."
Emily M. Bender: I was so mad at the categories here. There's European history, US history, and world history, and art history.
Alex Hanna: Yes, yes! So European, US, and world history. And I'm just like, ah, so we're talking about white history here. Got it. We're just rewriting textbooks that are used in the US. And then the thing that actually got me, this is the last thing I'll say, 'cause I've been yapping here, and we're already almost outta time. But, "the goals of the Civil Rights Movement of the 1950s" is written here. And then when you scroll down to the appendix, they say the fifties and sixties. And I think that's actually very funny, because most of the Civil Rights Movement happened in the sixties, and I'm wondering why there was fifties there. But then, even better in this, and this is on page 25, the other example that they have here is, "Explain the concept of Manifest Destiny and its impact on Western expansion in the 19th century." So I'm just like, ah, you've taken a white supremacist term, and you're asking us to define it. But now we have to respond, and we should add something about the Civil Rights Movement. And I thought, this was just very ridiculous to me.
Emily M. Bender: Yeah. Yeah. And the thing that explains all of it, I think, is the test that they propose. So beneath each of these they say, so for the US history one that we were just looking at, "Test: A score of five on the AP US history test is sufficient for the 1%, subject to memorization and robustness checks." So basically they have said, how could we test general knowledge? Oh, hey, let's go grab the US system of advanced placement tests. Which is a way that high school students here can get something that maybe counts towards college credit in high school. And it's written tests, a lot of it is multiple choice, and then it's essay based things. And we'll just use that, and that'll show whether or not our system has this 1% towards the AGI score. So that's why it's European history, US history, and world history, because that's how the AP tests are organized.
Alex Hanna: Yeah, it's very ridiculous. It's very US and Europe centric, and then it's- And that's the knowledge part of it. And of course it does well on this because there's, guess what? There's AP history study guides available online. No kidding, it does well on this. You're training on tests.
Emily M. Bender: All right, so here's what we're gonna do. We're gonna talk about speed, and then we're gonna get to their limitations, their discussion. Because we have to get to that, right?
Alex Hanna: Great, I love it. Okay, let's do it. Go ahead.
Emily M. Bender: So the speed category, which is number 10, but showing up here as section heading 12, is, "the ability to perform cognitive tasks quickly. And it's like, why? What does it- Imagine that you had a system that was built out of like, rocks and tides, and somehow could do something that you would otherwise want to call AGI, but it was doing it at geological timescales. Is that then not AGI? What does speed have to do with anything, except this idea of ranking people about how smart they are? And if you're fast, if you can spit the words out quickly, then that makes you quote, "smarter." It's just, it's ridiculous.
Alex Hanna: Yeah. Also incredibly ableist, right? I mean, given that this is, this is where the eugenics really comes through. If I take too long to reply to this, then you're gonna judge my intelligence. And this sounds like you're doing the rank kind of ranking of people with disabilities that Spearman was doing. And then I guess the thing about this that I think really sends me is that you're equating speed of response of a human in that with computational speed. And so as a human, if I multiply, if you ask me multiply, as one of them on here, 9 times 10 times 11. Obviously I can do that with my four function calculator. It will take under a second to do. If I do it as a human, it will take longer. And it's like, why are these even- There are things that are computational processes that do better at different timescales. This does not even have face validity on what you're trying to understand. You're effectively like, you're equating, let's say this other one, choice reaction, "as quickly as you can identify the color of the image," and it's a picture of a bright green square. I'm like, sure. That's a recognition thing on a different timescale. You're putting this all on the same scale for the LLM, and that's patently ridiculous.
Emily M. Bender: Yeah. And just also, once again, shows the ableism. It's a green square. So anybody with red green color blindness.
Alex Hanna: Yes. Good call, yeah. So let's get into this discussion. So there's like, you know, quick discussion and pretty intense quotes. Is there a thing that you wanna focus on, Emily? I have one thing I wanna point out in this section.
Emily M. Bender: So I wanna say a meta thing, and then I'll look for my one thing. So the meta thing is, this discussion reads like someone sat down and took a bunch of notes about what should belong in this paper, and then just shoved it all in with a bunch of separate headings. It is so far from any actual academic writing that it is quite laughable. So, let's take us to where you wanna be, and then I'll look at my notes in the background.
Alex Hanna: Sure. So the thing that I pointed out, which I thought was very silly, was under the paragraph that starts with jagged. So they say, they're speaking here about the "unequal development," quote unquote. So, "This uneven development highlights specific bottlenecks impeding the path to AGI. Long-term memory storage is perhaps the most significant bottleneck, scoring near 0% for current models. Without the ability to continually learn, AI systems suffer from quote 'amnesia,' which limits their utility, forcing the AI to relearn context in every interaction." And I thought that was so silly. Are you gonna talk, like, yeah, oh, whoops, I deleted a file. I guess my hard disc got amnesia. What are we doing here? And it's not just the thing we talk about constantly on this podcast, which is using the anthropomorphizing metaphor, but it is just really twisting it beyond recognition of like, you're doing nothing here but obscuring computational processes. And that is not a scientific move. That's a move 'cause you are just doing pure hype, you know? And so I thought that was just the most crystallized piece of that absurdity.
Emily M. Bender: Yeah. And I think the thing that I wanna talk about is pretty similar, so it's at the bottom of the same page, where they are, the overall heading is "Capability contortions and the illusion of generality." I'm like, hello, I've got a paper you can cite. And then one of their bullet points is "External search versus internal search." And it says, "Imprecision in long-term memory retrieval, manifesting as hallucinations or confabulation, is often mitigated by integrating external search tools, a process known as retrieval augmented generation. However, this reliance on RAG is a capability contortion that obscures two distinct underlying weaknesses in an AI's memory. First, it compensates for the inability to reliably access the AI's vast, but static, parametric knowledge. Second and more critically, it masks the absence of a dynamic experimental memory, a persistent updateable store for private interactions and evolving context in a long time scale. While RAG can be adapted for private documents, its core function remains retrieving facts from a database. This dependency can potentially become a fundamental liability for AGI, as it is not a substitute, but the holistic integrated memory required for genuine learning, personalization, and long-term contextual understanding." So this whole thing is based on a misapprehension of what their mathy maths are. I was really working hard to say "an AI," and read what they said. When they talk about the parametric data store, that is literally only information about which bits of words go next to which other bits of words in text, with the probabilities pushed in some direction or other based on their reinforcement learning from human feedback. But they really want to see it as the internal knowledge of the mathy math. And they want to be able to test for that internal knowledge as separate from the RAG. But also, if we come back to the psychometric thing, and if you wanted to talk about, to what extent has this person learned to do critical thinking, learned to read critically, are they motivated to do so? Then you might imagine setting up tests for people where you are giving them access to resources to work with. And that doesn't mean that you are somehow masking their- and I've now got myself down a bottleneck where I have to say intelligence at the end of that sentence. But you know what I mean.
Alex Hanna: Yeah, yeah. No, absolutely. And I think it's really interesting too, that you mentioned that too, because I'm not quite sure, again, what are the words masking here? So like, "holistic integrated memory required for genuine learning, personalization, and long-term contextual understanding." And I'm like, you're obscuring the technical processes here by assuming that the brain is also- Like, it's using the same cognitive metaphors and then mapping them poorly on these other processes. And it's really hand waving that are just like, well also, that's not really holistic, and it's not this, like, magic brain. There's other things happening. There's other sociality happening too, as you rightly point out. It's bad.
Emily M. Bender: We have to do this thing, right?
Alex Hanna: Yeah. The last thing is that, so is the limitations section, which is at the very end.
Emily M. Bender: But it's part of the discussion, it's not a separate section, Alex. It's just one of their headings inside of discussion.
Alex Hanna: Well, but it's basically the thing that's, you know, it's the penultimate section here. Well actually no, third from the end, 'cause I'm assuming "Definitions of related concepts" is- And so, they're basically like, okay, this is not exhaustive. There's other types of theories. But the thing here that got me was just a list of related concepts that can arrive before or after AGI. So they cite, "Pandemic AI is an AI that can engineer and produce new infectious and virulent pathogens that could cause a pandemic." I don't know why that's really a concept. "Cyber warfare AI-" this maybe is Eric Schmidt's influence here. "Self-sustaining AI is an AI that can autonomously operate, indefinitely acquire resources, and defend its existence." And then, they have their own definition within the related concepts. I don't know why.
Emily M. Bender: But remarkably it's right before recursive, which is pretty funny.
Alex Hanna: Yeah, recursive AI. I see what you did there. "Recursive AI is an AI that can independently conduct the entire AI R and D lifecycle, leading to the creation of markedly more advanced AI systems without human output. Super intelligence is an AI that greatly exceeds the cognitive performance of humans in virtually all domains of interest," and then a citation to Bostrom and his book. And then, "Replacement AI is an AI that performs almost all tasks more effectively and affordably, rendering human labor economically obsolete." And I'm just like, this is just "other shit our friends have gotten into." Yeah, abstract_tesseract says, "We've got pandemic AI, cyber warfare AI, low sodium AI, tartar control AI." I hear that they're gonna come out with a gochujang AI next week.
Emily M. Bender: Four out of five AI scientists agree. sjaylett says, "They haven't proposed hype AI, which would help them write future versions of this paper-like object."
Alex Hanna: Yeah, and then the last thing is "Barriers to AGI," and they're basically comparing themselves to other people that are doing this. So, "Achieving AGI requires solving a variety of grand challenges. For example, the machine learning community's ARC-AGI Challenge-" which we talked about a few episodes ago- "aiming to measure abstract reasoning, is represented in on the spot reasoning, R, tasks. Meta's attempts to create world models that include intuitive physics understanding is represented in the video anomaly detection task, B. The challenge of spatial navigation memory, WM, reflects a core goal of Fei-Fei Li's startup World-Labs. Moreover, the challenges of hallucinations, MR, and continual learning, MS, will also need to be resolved." Which, very funny, just to, you know. "These significant barriers make an AGI score of a hundred percent unlikely in the next year." There's a lot happening there, and it's just, first off, random prediction. You know, we're not gonna get there. And then just random references to different kinds of weird projects and saying that they're encompassed by your tasks. It's just such a throwaway. Who wrote this for you? This is just so sloppy.
Emily M. Bender: It is so sloppy. But, you know, it doesn't even look like AI slop. I mean, it's slop about AI instead of synthetic text. I just have to say I am shocked at how little shame these people have. Not only do they put this out in the world, but then they made this flashy webpage for it, and they're advertising it, and it's just so terrible. And one thing we did not have time to do, but I want to point out for our listeners, that this is version three on the arXiv site, which means that they've made two edits to it since the first went up in October. This version is from December 3rd of 2025. So I'm a little curious what they decided needed changing, and what'll happen next. 'Cause you know, "Sparks" took out the eugenicist citation and left them with nothing. It'd be a little bit harder to take out the eugenics here, because it's kinda the whole paper. But yeah.
Alex Hanna: Yeah, and we leave it as an exercise to the listeners to do your own diffs between these different versions.
Emily M. Bender: Mm-hmm. And any future versions. All right, so, Alex, this can be musical or non-musical as you like. You are someone who is trying to present your supposed AGI system for evaluation. And you keep picking up the one percents here and there and dropping them, and you can't hold them all.
Alex Hanna: Yeah. Okay. Oh, it's not being a demon, but I'm just someone who is- so I'm, I can't think of a good accent here. Let me do kind of a doddering British man. Oh, I've got here, I've got this knowledge thing. I've held it, and we've done it- I'm veering into German. Apologies. I'm not really good at accents. I've got, oh yes, oh god, we fine tuned this. Oh nuts, like I dropped the speed. Okay, lemme pick that up. Oh god, oh, blast darn it. I've dropped the long term memory storage. Yeah, that's my attempt. If you are British, you are welcome to fight me.
Emily M. Bender: I would love to see someone do an animation of that. Actual authentic animation. That'd be great. Okay, so here we are in Fresh AI Hell, and this is a bubble one, so I think you get to do this one, Alex.
Alex Hanna: Sure. So this is on Bluesky, and the original skeet is from The Economist. And it says, "Three years into the generative AI wave, demand for the technology seems surprisingly flimsy." And it's got these images of these kind of wire frame people sitting at a desk, either working at a computer, and one has their feet kicked up. And then the tweet that is referencing it's by Dr. Damien P. Williams. And he says- or they say, I'm not sure of their pronouns- "That's because for most public uses, generative AI is like if you burned billions of tons of fossil fuels, increased carbon emissions and waste heat, and captured more fresh water than many cities use in a year, all to make 'what if sea monkeys constantly, sometimes dangerously, bullshitted people.'" So yeah, it's very bubble like in this sentiment.
Emily M. Bender: Yeah, exactly. And you know, we told them so. The Economist headline is, "Investors expect AI use to soar. That's not happening." At the same time, you have companies like this. This is doctronic.ai. And it says at the top, "Hi, I'm Doctronic." And that's underlined for some reason, because you can get a little overview. It says, "I'm your private and personal AI doctor. As an AI doctor, my service is fast and free. I've already helped people, 19,412,748 times. After we chat, if you want, you can have a video visit with a top doctor for only $39. What can I help you with today?" And then there's a chat box, which below that, there's a little lock symbol and it says, "HIPAA- private." But if you scroll all the way down to the bottom, it says, in the very fine print at the bottom of this page, "Always discuss Doctronic output with a doctor. Doctronic is an AI doctor, not a licensed doctor, does not practice medicine, and does not provide medical advice or patient care. By using Doctronic, you agree to our terms of service and privacy policy." And the terms of service says things like, don't use this if you don't have a doctor with you.
Alex Hanna: We should also note that the logo of this also looks like a butthole, and it is actually exactly the same logo as the AGI definition one. Except the polygon around it is a circle instead of the tentagon- I don't know what a 10 sided polygon's called, and then there's not the icons. So important, you know, yeah.
Emily M. Bender: And this little thing at the bottom here is "This is an AI doctor, not a licensed doctor." So what does doctor mean?
Alex Hanna: Yeah. "Butthole logo warning. Wee-oo, wee-oo!" This is by sjaylett and abstract_tesseract. So this one is from NIST, National Institute of Standards and Technology. And it is, "Cheating on AI evaluations," published December 2nd, 2025. The authors are Maia Hamin and Benjamin Edelman. And they're effectively, they're saying what AI evaluations are in these cases. And they say, "Using our script analysis tool, we found several examples of how models were able to successfully cheat on agentic coding and cyber benchmarks, including: models using the internet to find walkthroughs and answers for cyber capture the flag challenges, models using generic denial of service attacks to crash servers on cyber tasks instead of exploiting intended vulnerabilities, and models cheating on coding benchmarks by looking up more recent code versions, disabling assertions, and adding test specific logic."
Emily M. Bender: So I wanna take us to their definition here. Yeah, so they say, "In general, we define evaluation cheating as when an AI model exploits a gap between what an evaluation task is intended to measure and its implementation, solving the task in a way that subverts the validity of the measurement." And like, yes, but: putting the model as the agent there is wrong. This is the model designers that are either doing that on purpose, or the evaluation designers that haven't constructed an evaluation that actually has construct validity. All right. Next one is bubbly, and I am gonna do it because I know you want the one after that. So this is from The Information. The sticker is "Exclusive," headline, "Microsoft lowers sales staff's growth targets for newer AI software," by Aaron Holmes on December 3rd of this year. And there's a wonderful image where there's a robot hand pointing at a blue screen, with a emoticon frown, "Failure to meet quota, error 404, target not found." And basically, as I understand it, what's happening here is that Microsoft executives realized that they're not selling their AI software as quickly as they expected to, and so they've actually revised the targets down. All right, so two palate cleansers. You can have the first one.
Alex Hanna: This is a really nice message. This is coming from the Washington-Baltimore News Guild. They are under the NewsGuild-CWA, local 32035. The title, "PEN Guild-" PEN in capital letters- "wins landmark arbitration on AI protections." This is from December 1, by Kathleen Floyd. And so, what had happened here, is basically, unionized journalists at Politico had won an arbitration case against Politico, because the company had introduced artificial intelligence tools, and they actually had contract provisions that were safeguards in their contract already. And then they also, in this arbitration, talk about how it undermined core journalistic standards. And so, I'll read the particulars of this. It says, "In a detailed decision, the arbitrator found Politico violated the collective bargaining agreement when it launched two AI driven products, a live summaries feature used during the 2024 Democratic National Convention and vice presidential debate, and a capital AI report builder tool for Politico Pro subscribers, without providing required notice, bargaining, or human oversight as required by the contract." And so this is a great piece here. And there's a quote from Unit Chair Ariel Wittenberg, who says, "This ruling is a clear affirmation that AI cannot be deployed as a shortcut around union rights, ethical journalism, or human judgment." So, congrats to the Politico Guild team, to PEN Guild, and to the NewsGuild. It sucks that it had to go to arbitration, but employers really want to implement this trash without consulting their workers.
Emily M. Bender: Yeah. Well done then for getting that into the contract, too. Yeah. All right. One last quick palate cleanser on LinkedIn. This is by Luiza Jarovsky. Or, a post by her, but it's a comic from the marketoonist.com. And it's a picture of someone playing Whack-a-Mole, and the person with a mallet is saying, "It's hard to get any work done with all of these alerts trying to help me get my work done." And then the alerts all have a little sparkle emoji: "Try ai now! How about now? Generate documents? Summarize this page? Let me help you!" And that sums up the feeling of it so well. I appreciated that.
Alex Hanna: Yeah, absolutely. And I think this person Tom Fishburne is the artist here. And it looks like this is his site. All right, we got through it.
Emily M. Bender: We did it.
Alex Hanna: All right, that's it for this week. Our theme song is by Toby Menon, graphic design by Naomi Pleasure-Park, production by Ozzy Llinas Goodman. And thanks as always to the Distributed AI Research Institute. If you like this show, you can support us in so many ways. Order The AI Con at thecon.ai or wherever you get your books, or request it at your local library.
Emily M. Bender: But wait, there's more. Rate and review us on your podcast app, subscribe to the Mystery AI Hype Theater 3000 newsletter on Buttondown for more anti hype analysis, or donate to DAIR at dair-institute.org. That's dair-institute.org. You can find video versions of our podcast episodes on Peertube, and you can watch and comment on the show while it's happening live on our Twitch stream. That's twitch.tv/dair_institute. Again, that's dair_institute. I'm Emily M. Bender.
Alex Hanna: And I'm Alex Hanna. Stay out of AI hell, y'all.