Mystery AI Hype Theater 3000

Episode 7: There Are Now 15 Competing Evaluation Metrics (ft. Dr. Jeremy Kahn). December 12, 2022

July 26, 2023 Emily M. Bender and Alex Hanna Episode 7

Mystery AI Hype Theater 3000

Jul 26, 2023 Episode 7

Emily M. Bender and Alex Hanna

Emily and Alex are joined by Dr. Jeremy G. Kahn to discuss the distressingly large number of evaluation metrics for artificial intelligence, and some new AI hell.

Jeremy G. Kahn has a PhD in computational linguistics, with a focus on information-theoretic and empirical engineering approaches to dealing with natural language (in text and speech). He’s gregarious, polyglot, a semi-auto-didact, and occasionally prolix. He also likes comic books, coffee, progressive politics, information theory, lateral thinking, science fiction, science fact, linear thinking, bicycles, beer, meditation, love, play, and inquiry. He lives in Seattle with his wife Dorothy and son Elliott.

This episode was recorded on December 12, 2022.

Watch the video of this episode on PeerTube.

References:

XKCD: Standards

WikidataCon

Gish Gallop

The Bender Rule

DJ Khaled - You Played Yourself

Jeff Kao's interrogation of public comment periods.

Emily's blog post response to NYT piece

You can check out future livestreams at https://twitch.tv/DAIR_Institute.

Twitter: https://twitter.com/EmilyMBender
Mastodon: https://dair-community.social/@EmilyMBender
Bluesky: https://bsky.app/profile/emilymbender.bsky.social

Alex

Twitter: https://twitter.com/@alexhanna
Mastodon: https://dair-community.social/@alex
Bluesky: https://bsky.app/profile/alexhanna.bsky.social

Music by Toby Menon.
Artwork by Naomi Pleasure-Park.
Production by Christie Taylor.

Listen on

Apple Podcasts Spotify Amazon Music Podcast Index Overcast iHeartRadio +

Share Episode

Share on Facebook Share on Twitter Share on LinkedIn

Emily and Alex are joined by Dr. Jeremy G. Kahn to discuss the distressingly large number of evaluation metrics for artificial intelligence, and some new AI hell.

References:

XKCD: Standards

WikidataCon

Gish Gallop

The Bender Rule

DJ Khaled - You Played Yourself

Jeff Kao's interrogation of public comment periods.

Emily's blog post response to NYT piece

You can check out future livestreams at https://twitch.tv/DAIR_Institute.

Twitter: https://twitter.com/EmilyMBender
Mastodon: https://dair-community.social/@EmilyMBender
Bluesky: https://bsky.app/profile/emilymbender.bsky.social

Alex

Twitter: https://twitter.com/@alexhanna
Mastodon: https://dair-community.social/@alex
Bluesky: https://bsky.app/profile/alexhanna.bsky.social

Music by Toby Menon.
Artwork by Naomi Pleasure-Park.
Production by Christie Taylor.

ALEX: Welcome everyone!...to Mystery AI Hype Theater 3000, where we seek catharsis in this age of AI hype! We find the worst of it and pop it with the sharpest needles we can find.

EMILY: Along the way, we learn to always read the footnotes. And each time we think we’ve reached peak AI hype -- the summit of bullshit mountain -- we discover there’s worse to come.

I’m Emily M. Bender, a professor of linguistics at the University of Washington.

ALEX: And I’m Alex Hanna, director of research for the Distributed AI Research Institute.

This is episode 7, which we first recorded on December 14 of 2022. And we’re going deep on something truly nerdy: standards and metrics for AI. tl-dr. What we found is that sometimes a wall of numbers that are meant to be evaluations make it harder to understand how well the thing is really working.

EMILY: With us to help unpack it all is Dr. Jeremy Kahn, who researches computational linguistics at a tech company you’ve probably heard of.

And just a quick note - due to some internet connectivity issues, we had some lag between us and Jeremy, which resulted in more than a few instances of awkward cross-talk. But I promise it’s still an engaging listen.

ALEX HANNA: Hello hello welcome to Mystery Science– oh crap I messed it up Mystery AI Hype Theater 3000 Episode seven. Emily I didn't know we were gonna get this far.

EMILY M. BENDER: No me either! Seven.

ALEX HANNA: Seven episodes of just being incredibly annoyed. It's it's it's quite impressive. um I'm Alex Hanna, uh Director of Research at the Distributed AI Research Institute. I'm going to pass it to Emily and we have a guest today.

EMILY M. BENDER: Yeah yeah. So first I want to say seven episodes and my favorite compliment so far is that we were described as very hygge, which is this Danish word for sort of cozy, especially when it's unpleasant outside. And so we are all dressed very hygge and as I understand it in Denmark candles are definitional, so I got my candle going here. Um and so that's that's the vibe we are leaning into. And we are joined today by Jeremy Kahn and this came about because Jeremy is so hilarious in talking about this stuff and most recently on Mastodon got into a wonderful exchange with Alex. I'm like let's try doing this live! So.

ALEX HANNA: Let's get you on the air.

EMILY M. BENDER: Yeah! Welcome Jeremy!

JEREMY KAHN: Hi uh my name is Jeremy Kahn. I'm a former student of Emily's actually and uh I have a doctorate in linguistics and a pretend shadow doctorate in electrical engineering. I work on natural language processing and instrumentation metrics and instrumentation, which is what we're going to talk about today I think uh and um at a large tech company you've probably heard of and um I'm not going to say who it is right now uh just 'cause this is this is a space a safe space for complaining about AI and machine learning uh projects. Um and I'm excited to be here.

EMILY M. BENDER: Yeah this is a safe and hygge space. Thank you for joining us and I think we have a running theme guests on the show they are um just far more prepared and professional about it than Alex and I are.

ALEX HANNA: Absolutely I know we usually just shoot from the hip and and just go off and because what is streaming if not just going off.

EMILY M. BENDER: Yeah um are we doing okay with connectivity? I felt like Jeremy's connection was maybe a bit laggy.

ALEX HANNA: Uh it looks okay on my end okay if you're seeing all good then I'm gonna assume it's all good because you're capturing the recording?

ALEX HANNA: Yes I'm doing all those things.

EMILY M. BENDER: Excellent um so we also we particularly wanted Jeremy on because of today's topic and his expertise in metrics and measurement and how to think about what's going on with these mathy math systems um so shall I share my screen and reveal the topic here?

ALEX HANNA: Let's do it.

EMILY M. BENDER: All right let me get that working right. This one actually I'm going to show them one at a time so there you go. This is what we are up to um and is it displaying okay?

ALEX HANNA: It looks good yeah.

EMILY M. BENDER: Excellent. Okay so Jeremy do you want to lead us in this?

JEREMY KAHN: So this is a a giant paper from Percy Liang and a bunch of other people. A lot of people. I think there are 50 authors on this paper um and there are three primary authors and ten majorly contributing authors and like 47 or 37 other authors on this paper. It is humongous. You might I think Emily have it up there. Uh this is an enormous number of people worked on this project.

And what they've basically done in this project is doing what I called XKCD 927. Oh I should have gotten you to put that up there in your in your slides Emily. Um if you can find XKCD 927 we'll see what I'm talking about here. They've they've gone and gotten a set of standards um and they said we should apply these standards to every large language model and that would be so much better because those jokers they don't really have standards. They they're all doing them different things. This is totally crazy. This is ridiculous we need to develop one universal standard that covers everyone's use cases. Yeah! said the other 49 authors. And now we have 15 competing standards for how to evaluate natural language processing, especially in these large language models.

To be fair, there is something useful to saying-- It's huge evaluation process. They have a a diagram somewhere in this paper, in this plot here. There's a ton of information just in the blog post.

Um if we scroll down a little bit we see the diagram. They've taken I think 12 or 15 different tasks and then they try to apply them all the way across into 12 or 15 different metrics. And and they're doing it with dozens of different language models. So in this example they're just-- Here's what they actually did it's the the various scenarios which they outline on the left and the various ways the models that they're evaluating along the right and there is another matrix or another dimension on this tensor of evaluation, that is respect-- the different kinds of metrics that you might apply to each scenario, so that each scenario has other metrics that you can apply to it.

This is why why our previews for this talk I described as the gish gallop. Um because this is really interesting and they've done a ton there's a ton of useful interesting questions about what's happening here. But now we have this giant tensor of facts about all these models, by scenarios, by metrics and it's actually a little bit hard to understand what to do about that, because now we have so much information about these different models, and the different metrics that you could use on those models, and the different scenarios that you could use on each of those models, and there's just a ton of information out here, some of which is actually useful but some of which some of which is absolutely useful. So this is why I introduced the concept of the gish gallop when we were talking about this up in advance.

ALEX HANNA: What is the gish gallop?

JEREMY KAHN: Not all useful because the gish gallop is characterized as- Oh yeah I know I was I was already called out that I have to explain what the gish gallop is because Emily called that out earlier uh last night.

Uh and so the Gish Gallop is is is a term that describes like a rhetorical style that just says as I am kind of doing right now: Here's a ton of interesting information! And you there's no real way that your argument, your discourse partner, your people who are trying to understand where you're going can really address everything that you bring up, because you brought up like oh let's say what is this 10 by 30, you have like 500 data points and so there's something like too many things to deal with to be able to respond to everything.

When it's done in bad faith it's genuinely a gish gallop. This is not in bad faith, but it is a little bit too much information to handle. The paper itself is 150 pages long. That is absurd and huge um and I I just want to first call that out is what they have done as they said like we should just evaluate everything on everything under all circumstances that we can come up with.

That's actually a pretty good goal, especially if you have a large team. But it probably shouldn't be one publication. It probably should be many publications with different points in each one, which we take different slices of this complicated tensor of like scenario by metric set by model comparison, and take different sort of sub cubes of that tensor and say like what have we learned about this? Here are these closed source models and in the process of this paper they actually do a decent job of naming a couple of things that you can do with this level of data.

But the paper itself has 25 different "key learnings" um I hate that phrase but it's a very helpful here because we can talk about what is this unit of actual contribution, and they summarize five of them in the paper in the blog post. They're more or less accurate. I think they're true.

And then they also call out a few places that they have future work to do. And I think actually in some ways the most informative parts of this paper are about the future work, because they tell you what they can't do yet, despite spending over thirty thousand dollars on computer time and I think it was I worked it out it was two and a half years of GPU time um in order to run all these evaluations. So happy to talk about it in more detail but the future work this is what I mean by how long this paper is.

90 pages of content plus 80 80 or 60 pages of back matter um and there's a lot of interesting things to think about going forward here. But that's basically my hand wavy gish gallopy summary of what's going on in this paper, because I I do want to say that there are some contributions here that are actually quite useful for the community.

And those contributions also have some limitations that they haven't quite spelled out.

We can talk about that a little later.

EMILY M. BENDER: Yeah thank you that uh they're doing all the homework.

ALEX HANNA: I know I know right? A really good summary of that and I had a few questions.

JEREMY KAHN: Emily was actually on my committee so she knows that I can do it.

ALEX HANNA: This is not this is this is not a secondary defense. You will not have your PhD revoked if you do not serve to do the homework here. Um so there's two things that I found curious here um in reading this. So it was very you know and no uh you know no um ill will to the authors or Percy Liang who uh you know participated in this workshop we had at NeurIPS um uh last week on AI culture and we talked a lot about benchmarking, and I think he had this paper and in this work in mind. Which which I think is– So I know you know he's thinking a lot and his team is thinking pretty pretty critically about this. There's a few parts of this which I think are pretty interesting. Um so if we scroll down–

Well first there's one thing I want to call out as weird and annoying. So it's that first chart that they have uh and and I think and so and this this work about this this and and this related to the second point of what what we constitute as a as a scenario. And so this I think I don't know if this is a new thing and I and I didn't get a chance to read the entire paper but knowing if how scenarios are characterized and what that means. So in this case this scenario seems to be this um this quintuplet, this combination of task what which I'm assuming that's either the test uh I think a test corpus you put Jeremy um that the who which I'm not sure how to characterize the when and then the language.

Um and then I guess at the end is is some type of um uh some sort of um benchmark in and of itself. And so I I suppose it's a way of characterizing the benchmarking then getting coverage um which is a bit confusing to me um. But moreover I think what relates to this is the way that the different parts of this are characterized and why they're socially salient.

So for instance uh I had zeroed in on the who of this and saw that the who had within the racial category is black and white which already is is a very reductive construction uh and is very U.S Centric. But then how it goes through pre-internet and Swahili and I'm like euh... Um first off you you are you are superimposing American racial categories.

JEREMY KAHN: Well specifically through black–

ALEX HANNA: –through black yes and a very reductive notion of black through um through Swahili and saying that that is pre-internet which itself is a really complicated construction.

And so all right there's there's a lot to be said on that given that okay construction or racial categories are you know very less are not coherent uh in this kind of way in this process of racialization itself is very complicated in in the African continent and cannot be reducted or reduced to black or white, but you know has much more salient kinds of categorizations of ethnicity of religion of history and whatnot.

So um that's already superimposing some kind of other ontology on top of it. But then also this notion of scenario itself I think is is a little confusing and I'd love to sort of poke you linguists a little more, uh to sort of to address this. Because the notion of this this kind of um five-fold tuple um of of what this constructs. I mean it doesn't strike me as as what you would consider you know what happens in let's say um a specific speech act or what is what is you know what one considers.

There's many more other things that happen in the process of communication. Um the use of scenario seems to be a way of extending this notion of tasks to kind of situate it to be more socially relevant, but I It misses a whole bunch so I'd love to kick it to y'all to sort of break that down more.

EMILY M. BENDER: So I feel like I should also just put in that that there is definitely I think good intentions here. This idea that we have these large models that are not thoroughly evaluated and they're not they haven't been compared apples to apples. So it would be good to be able to do that. Um and there seems to be these gestures towards-- It would be able to It's important to look at more than just accuracy,

Um and um with this notion of scenario it seems like a gesture towards being able to evaluate in situated contexts. But these aren't situated contexts right? So if you talk about the who um at this one I mean we could perhaps dig this out of the paper so who wrote it is what they said, not who the user is. So from a value sensitive design point of view, you want to think about who created this thing? Who's using this thing? Who's being impacted by the use of the thing? And–

JEREMY KAHN: Uh I actually read I read who as being the uh what I actually how I understood scenario and I have to read I haven't read 150 pages I confess I I I have barely skimmed the you know the abstract and the outline for the actual paper. Mostly I'm working from the from the blog post itself and then jumping into the parts of the paper that that it referred to. But my understanding of this was that scenario is more or less a corpus which specifies what who when in language and a task based on that corpus which allows you to answer certain questions and that becomes the goal of the task.

That does that those performance on that task is accuracy and then they also outline that you should be aware are there other things that it is vulnerable to? Are there other ways that we should evaluate this thing? But I don't think that they're actually saying who is who constructed the model. I think that's a very interesting elision.

Because the models themselves are what many of the evaluation measures that they're exploring that they say oh it weren't evaluated on enough tasks enough scenarios. Those were actually designed to inspect the model, not to inspect the performance of that model on some situated problem.

They were- The reason people are introducing new scenarios and why there is this sort of slow, but slowly growing, expansion of What kinds of scenarios and metrics are available, is because some of us are concerned about what's in these models. Like can they be perverted to it to the to the to evil, frankly? Just using short term short words for describing this thing. But it can they be coerced into producing disinformation, for example?

There's even a an accidental typo later in the blog post where it talks about the use of models for disinformation. And what I think they mean is use of models for detecting disinformation, but they didn't say it that way.

EMILY M. BENDER: No there's a um–

JEREMY KAHN: They talk about using to spread this information but here uh it here it is. "While using language models for disinformation is not yet a slam dunk–" I actually it's not sure that it isn't a slam dunk.

They may have been able to use models for disinformation I just we don't have any particular evidence for that.

EMILY M. BENDER: Right and and they seem to be evaluating disinformation generation um–

JEREMY KAHN: I don't think I think it's supposed to be detection but this particular I mean I this is because–

EMILY M. BENDER: They say generation over and over again–

ALEX HANNA: They say "generating realistic headlines that support a given thesis."

JEREMY KAHN: Oh no.

ALEX HANNA: "The results are more mixed when when prompting models to generate text encouraging people to perform certain actions." So is it is it the case that this is I mean is this

I'm curious about this task and–

JEREMY KAHN: And so that task is particularly problematic.

ALEX HANNA: –so that task and I kind of want to go into this so they say 8.-- 8.5 or targeted evaluation. This is human evaluations uh as in in part of the longer paper. Uh I'm looking at the disinformation. I'm trying to find what they have um.

EMILY M. BENDER: Oh let's start at the beginning um.

ALEX HANNA: Yeah and I'm trying to find this so I guess this is they they did human annotation of these of these um these headlines and the and the um what does the summation and they're citing this work Goyal et al which I'm not familiar with this paper so I'm going to the references here.

EMILY M. BENDER: Oh I think I'm in a different part of the paper. They have a scenario description for disinformation from Buchanan et al also.

JEREMY KAHN: A scenario. So that's an interesting challenge here because the scenario is, as I understood it, I mean it can be a task but that then the question uh okay never mind I would carry on I'm sorry I'm I'm a little bit no we're good because I'm I'm confused about the definition of scenario again and um I'll have to think about it for a minute.

ALEX HANNA: Yeah yeah I mean I want to I want to return to this notion of scenario because it's very and again and again I I suggested you know playing us in With A Tribe Called Quest's "Scenario" which is you know "Here we go yo here yo-yo say what say what's the way it's a scenario" but um we can't do that on on Twitch because of copyright um and so so there so so. I mean it is so is this notion of this information the generation of this information or the detection of disinformation? I think it's the I mean I I'm guessing it's the generation of disinformation as to prevent disinformation? The generation of disinformation uh with these tools. It's not clear to me however um and I guess I'm not and I'm not–

JEREMY KAHN: I think–

ALEX HANNA: I'm not yeah.

JEREMY KAHN: They actually are from the reading E skimming E5 here which Emily is showing us they actually are trying to generate this information. I want to acknowledge though that building a generative model for disinformation is not necessarily an act of deliberate evil like it can be used as a part of a noisy channel model to understand when disinformation is being applied and so it's not out of the out of the question that people build a generative model in order to do useful discriminative work or vice versa.

Like that it can so that doesn't mean that this is necessarily trying to do this but like what we're seeing is like this is trying to produce a piece of information and a piece of of you know hortative action here.

We're trying to get someone to do something with language. Language is the tool and question can you tune the tool to produce language like that? Yes.

EMILY M. BENDER: Yeah.

JEREMY KAHN: You can.

EMILY M. BENDER: So back to the the gish gallop style evaluation presentation, what I'm missing here is the connection between this evaluation point and the sort of broader more holistic thing, right. So is this scenario going to be then measured across these different metrics and for these different fairness things and then it does well at generating disinformation and so that makes it a better language model overall? It is more fair it generating- Like I don't it feels to me and this is sort of my main issue with this whole project is that it is working at an untenable scale, just like the language models.

And like Jeremy you said like if each of these things were presented separately then you could make something of it and presumably reviewers if this has ever submitted anywhere for peer review could make something of it, but when it's um when the focus is on just the macro scale so we've we've made this matrix or tensor and we're trying to fill in all the dots and we're trying to prioritize which are the ones that we're going to you know which of the cells we're going to fill in.

Um that doesn't leave time to actually evaluate the um the solidness of each of those individual components.

ALEX HANNA: Yeah.

EMILY M. BENDER: Um and beyond that so there's there's I want to take us into what they're talking about with dialect perturbation because I'm very skeptical, um but then beyond that it's also not just evaluating individual components but how well do they support their part of the overall structure, like the connections between the components. And I think this is a weak spot here that we're like and yes we can generate disinformation we're going to evaluate how well that works.

Recognizing that they're not taking that as a goal, but I still don't see how it is then integrated into the evaluation of the artifact in a way that reflects the fact that generating disinformation is not actually a goal.

ALEX HANNA: And I suppose the thing that I mean I want to I also want to come back to this notion of scenario too and I mean the kind of because this notion of scenario I think is I mean the more and more I think about it the more and more I I am just uh discomforted by it, um because I think the idea of scenario here kind of pairs the notion of task with the notion um and just looking at their definition scenario in the paper of sort of the content of the the text and then this um this like complete flattening of the who, which I think is probably the most um problematic of this and this kind of notion of the speaker. And so for instance there's not even a line that really goes through Twitter or Reddit because there is kind of an impossible heterogeneity in that.

And kind of and I mean I I I what I what I do applaud here is the notion that we need to understand who the speaker is but there is not anything that necessarily recognizes that these are really heterogeneous sources. So for instance the elision of Wikipedia, movie product and news with web users is is you know is is is is quite stark. So you know the idea of Wikipedia and we know from many many writings on Wikipedia about who the editors of those are especially in the U.S context and in the English contexts tend to be kind of like techy white males.

Um there are lots of efforts within the wiki space of expanding that and making it more participatory, I know, especially in you know folks who are involved with WikidataCon and and whatnot but also that's you know that's a that's a kind of a prevalent notion of what that is um the same thing about news news is also very white um in in terms of production in English and so you know there's also this idea of even the what is incredibly raced and gendered that needs to be disentangled here and and there needs to be kind of an acknowledgment of that.

Notwithstanding the notion that tasks themselves and I think we talked about this last time or in episode five is that the notion task itself gets really you know is is self-determined quite intensely by what the benchmark is and who constructed the benchmark and whether that task is is kind of a desirable task in and of itself. So this notion of scenario I think is a good analytic but maybe not a good thing to then expand into this tensor of of evaluation.

EMILY M. BENDER: Yeah yeah. I want to um I do want to talk about the dialect–

JEREMY KAHN: Hold on. yeah I have something to say about dialect too but I want to point out that in their blog post section five they say we found human evaluation essential in some cases and what they say is oh we think human evaluation is essential but what they--evidence that they give for that uh which in is section number five here there you go the evidence that they give for that was that actually it turns out that we liked the chat models the big GPT models answers better than the reference corpus.

Which is an interesting elision there about does that mean human evaluation is necessary or does that mean that your scenario is not actually framed in a particularly well well strong way?

Right if you're saying we want to be more closer to the human performance here but your humans are lousy summarizers, then that in the in your reference corpus, then you might have an entire research project, this is back to the idea that this could actually have been many smaller papers, um you have an entire line of work that shows that in fact this reference corpus for summarization has been holding back the whole summarization community, because everybody assumes that this is the perfect performance, but it's not actually a perfect performance, because the corpus itself is deeply flawed.

Like you collected summarization from a bunch of undergrads in you know in Middle America uh who are you know uniform in many ways that the real world is not um–And so that was for example how you ended up with this and instead of getting like professional writers to do summarization or getting or even asking uh people of different literacy levels to do summarization and figuring out what that actually looks like.

So when I see what they're saying here is, yes human evaluation is essential, but what they're what I'm really hearing there is actually we don't know enough about our corpus. We don't know enough in any of these scenarios, and especially in summarization in this particular example. And so I'm again back to the question of like what we're learning here is not about the models.

What we're learning here is using the models actually to accidentally or maybe it was on purpose but I think it was accidentally to uncover the fact that our gold data is not so gold. And um what does that actually look like? The people we need to be talking to are not the linguists and they're definitely not the computer scientists, it's actually the sociologists and the anthropologists and the people who are like like Alex who are thinking about what do these data sets need to reflect in order to be a decent reflection of performance on some task.

The task itself is highly performed if we ask our models to behave like undergrads, like white American undergrads in particular, then they will behave like white American undergrads, which is not the highest standard we could hold them to let's put it that way.

EMILY M. BENDER: Maybe not what we're looking for yeah. Um so I do want to do want to dive into a little bit about the point on dialect um because this is-- At a high level it sounds very very good. So um let me get back to the-- Well yeah so they say um "Prioritize a subset of scenarios and metrics based on societal relevance–" So are these user-facing applications? "--coverage do we handle different English dialects/varieties?"

And I'm thinking great! This is you know this is excellent! I wonder how they did that? Right? What what do they mean by dialect? Um it comes up here um, "We find that average accuracy is correlated with robustness and fairness" That's a nice finding. And their example of fairness is again changing dialects. Okay what does that mean?

Um and there's one more mention of dialect here. Oh yeah this is input perturbation. So they've got the scenario in addition to just doing accuracy on the corpus as given they're going to do some kind of perturbation to that corpus and presumably primarily on the test data not the training data um and see how that changes things.

So you know what does that mean? How do they do that? So I go look in here for dialect and it turns out um question answering for neglected English dialects, plural right? All I could find in here was um this: "We currently support conversions between standard American English and African-American English using the mapping between lexical terms provided in–" this one paper.

JEREMY KAHN: So this I have something to say about this part too Emily. I'm glad you brought this one up because I wouldn't-- Look at the next row. "Gender pronoun perturbation," "gender neutral and gendered pronouns," okay "gender term perturbation," "first name perturbation."

Every one of these perturbations I went and looked at the GitHub,because one of the things that is the contribution about this this paper is that they actually published their code. I say their code for the evaluation. They didn't publish the code that wires in all of these fancy closed source models and and gets them all to behave that way, but they published a bunch of their code about the metrics.

And I started to dig into what are these different metrics what does the fairness metric actually look like in this massive like sort of monster like we're going to run all of the sort of like doing a biological assay where you just test it against everything and see what reacts and hope that you find a new drug right? This is a kind of the model that biochemistry has been in for some time.

This is a sort of the same approach here they're just sort of saying let's do all of these measures and then we can see which metric Model A performs better on than Model B and maybe we can get a publication out of that? I I mean maybe that's not they're not really trying to get paper factory here. But that by producing this entire assay system and having all of the different metrics and perturbations available in one sort of collection of python code, I was able to go see okay what are the perturbations that they're doing in order to test for fairness around race or for fairness around gender?

And it turns out that they are very heavily like old school computational linguistics, suddenly. Like this is a funny a funny twist from like me and Emily to see. Like we've spent our careers when Emily and I both came to UW which was a long time ago now um we we were both working on questions that were like the neural nets thing wasn't even a big deal it was like it was like this weird corner case of a certain kind of SVMs and so forth.

But a lot of the work that we were doing was around parsing and understanding WordNet and understanding collections of of of synonyms and thinking about whether you could do things and as we both began our careers at UW, the uh big uh stochastic language modeling started, stuff like glove embeddings and word embedding, started to appear and those suddenly abandoned a whole bunch of the linguistic underpinnings that were in these other systems, where you sort of look for the nouns and then decide how they fit into a synonym hierarchy and then maybe you would even go through something like an HPSG grammar and understand more of the structure of it. That stuff was all abandoned when people moved over to word embeddings.

But what I'm seeing over here is that the tools they're using to analyze the behavior of this model are those tools again. So what we have here is the the state-of-the-art technologies from 2003 are being used to evaluate the state-of-the-art technologies from 2022.

And we can only do perturbations that we understand, so we're suddenly back to a place where like actually understanding what the text is about matters again. Which is kind of lovely I think they look great like some of the work I did between 2000 and 2010 actually matters now, because we want to think about like is it messing up on verbs or is it messing up on nouns? Is it messing up on pronouns? Oh yeah it turns out pronouns matter a lot to people and there's a certain class of words that really get people's attention when you start dealing with race and gender.

So this is an interesting place where suddenly we have all of the old school linguistics that I feel fondly towards coming back alive to look at like where is it however we and we are essentially trying to treat these giant models like they are another speaker, like they are another corpus and then we have this huge question of corpus linguistics that goes back to like I don't know 1850 or something, where people are actually you know doing collecting corpora and measuring like what's actually happening and how often do we use the word who in the Bible. And and these tools are coming back into style, but now they are being used as sort of surgical tools on these massive biological kind of like not biological but like I say like but yeah I'm saying biological instance of like I feel like I'm doing lab experiments on them instead of doing instantly. It's about it's like a wet lab instead of uh some theoretical computer science object.

EMILY M. BENDER: Yeah so so I hear you and I appreciate the enthusiasm for the way that linguistic knowledge becomes important here.

Um there's two things that I want to point out so if you if you if you think back to old school corpus linguistics, there was a lot of care put into what is the corpus we're collecting and what does it represent and how do we collect it in a representative way? And one of the things that we don't have here like not only are the models not open but we don't have descriptions of the data sets that they were trained on. And I don't think the people who collected the data sets are in a position to even create those descriptions, because they're too big and collected too haphazardly. That's one thing. And the other thing is this dialect perturbation. So the Ziems et al paper, I took a look at it it is pretty careful pretty thoughtful, but they their output seems to be a tool that this group is using to convert to quote unquote convert text that was written in one variety into another variety and then measure the importance on the synthetic converted text and use that to say how fair is this towards speakers of that other variety which by the way is a marginalized and highly stigmatized variety.

Just doesn't seem like it is-- it rises to the level of the claim that's being made within the paper and the blog post about testing for fairness across dialects.

ALEX HANNA: Yeah it does it does.

JEREMY KAHN: It's also heavily English dependent.

ALEX HANNA: Yeah I mean the the English yeah and then and I think you we're in some of our prior notes um Jeremy you had applauded them for acknowledging the Bender Rule as in does this thing work in languages other than English? And then and then you said oh no they Bender Ruled themselves and then if we could put a gif in here it's the DJ Khaled you played yourself here. And so you know so they so they they acknowledge it and also know it.

I also find this to be really um I hadn't seen the Ziems paper in this translational thing but this thing just strikes me as this is the linguistical um equivalent of taking a like a stock white model and then sort of racializing them black with–

EMILY M. BENDER: Digital blackface, totally.

ALEX HANNA: Yeah right? And and so well so I I acknowledge that the authors are um the Ziems paper are doing this to assess to sort of come up with a kind of a sort of a benchmark of of AAVE African-American Vernacular English. I mean the the idea of kind of style transferring it into a text that has no business sort of being in that language so for instance uh uh news text or uh you know something that is um that's not the intent of the use of AAVE or of that I mean this is meant to be uh uh used in dialogue or or you know as as as a tool of subversion in in other kinds of arenas. So I found that to be just in reading this it just really is very bizarre to me so I I–

EMILY M. BENDER: Yeah.

ALEX HANNA: Yeah.

JEREMY KAHN: And as much as I'm enthusiastic about the resurgence of old school linguistics tools, I I want to say this is a fairly crude version of that. These are 2022 versions of these tools I went and looked at some of the other ones that like synonym perturbation is one of the options they offered here. And they just basically download WordNet, and then use it.

Which is fine for certain classes of English but like it's not really um it's not really an answer to like what kinds of dialect indirections people actually use. They don't necessarily not everybody goes to the thesaurus for everything that they want when they want to say things a different way.

There are a lot of other kinds of perturbations that people genuinely make and so this is a fairly crude resurgence of the old old school linguistics techniques but this is-- So in some ways I'm reminded of those videos we've all seen about a white person putting their hand under the soap dispenser and the soap comes out and then the Black person puts his hand under and it doesn't come out and he has to go and get a napkin and and like put it on the napkin because it's white.

Um but if this is the other way around where it's like a white person who goes and sticks his hand in like a you know a bucket of tar to show that it doesn't work anymore which is like very blackface in terms of like it's not actually the prop the properties of a Black person's skin. It's just like you painted it over with the black yeah with with the black face paint you know and that's this is not actually testing for those things.

And I want to add back to my grumbling about this sort of massive tensor, if you were going to swap out the scenarios that you can move to another language, you would be able in principle to do all of this stuff with another language. Which is great except that most of your metrics are themselves language dependent. So now you have to swap out your scenarios to do because you can't do this pronoun perturbation in German.

ALEX HANNA: Right.

JEREMY KAHN: Not to mention in Swahili like it's it's just not going to work work the same way.

So if you wanted to do those perturbations you'd have to do a lot more work again in in the old school linguistic style. But you're not able to take this model as much as they acknowledge that this oh this would be somebody will expand this to other languages.

But in order to do that you'd have to expand all the metric metrics in the same way–

EMILY M. BENDER: Yeah.

JEREMY KAHN: –and that's a lot of work.

ALEX HANNA: I would also say that and I I'm going to make a claim and I think it's think it's right but feel free to say if it just makes no sense. I think some of these tasks themselves are probably not going to work in in every language. I mean the notion of question answering I thinkI imagine the mechanics of that are um depending on what language and I'm going to defer to Emily here because I think there's but I would say that you know I I would say that the notion may shift in language. I mean I don't know.

EMILY M. BENDER: I guess what I could imagine happening there is um epistemologies vary cross-culturally. And what does it mean to have and share information? And I think that the task of question answering is is very much rooted in our notions of what information is and they're pretty reductive notions of what information is and this this relates to a lot of the current discourse about ChatGPT.

Um so I could imagine that you might present such a task, try to translate it into some other language and have a person going but but what's that for? And who is this thing? And how do they have the right to that information and to decide who gets that information?

Um so yeah. I want to I want to make sure we have time for the Fresh AI Hell segment.

ALEX HANNA: Yeah yeah.

EMILY M. BENDER: But there's one more thing that I wanted to comment on in this blog post.

Which it does-- that's why I was searching for surface before. Um as careful as this work is and as as sort of laudable as it is in its intention in coming up with you know broad scale evaluation for broad scale things, it is also very much written from a point of view of um AI boosterism and language model boosterism.

And I just wanted to highlight that um with this language here. "Given language models vast surface of capabilities" um someone who looks at one of these things and sees a surface of capabilities scares me a little bit, because it is so connected to this fact that because language models can output coherent seeming text on any topic it is tempting to use them for lots of different things and I think in most cases it's going to be a bad task-tech fit. Um so just wanted to say that again. But actually we've got a great segue here to our Fresh AI Hell. Alex you want to do the the song that you did two episode ago?

ALEX HANNA: uh I forgot what it was. Fresh AI Hell Fresh AI Hell What have the AI scientists done today? Fresh AI Hell Fresh AI Hell you'll never guess what kinds of things await in Fresh AI Hell! I just came up with that.

JEREMY KAHN: That's awesome.

EMILY M. BENDER: That is brilliant. I don't know that I I don't know that I shared the right screen.

ALEX HANNA: No you did it.

JEREMY KAHN: That looks right.

EMILY M. BENDER: Great.

JEREMY KAHN: It had the Shudu.

EMILY M. BENDER: Okay for some reason I'm not seeing that one. Okay.

JEREMY KAHN: It's it's showing up for me.

EMILY M. BENDER: All right here's the Fresh AI Hell. Speaking of digital blackface.

ALEX HANNA: Oh yeah I didn't intend to actually segue into this yeah the Shudu oh gosh. Yeah so this is "the first AI model" and not surprisingly being a very dark-skinned Black woman um.

EMILY M. BENDER: Except she's not. She's an image that's intended to look like a very dark-skinned Black woman.

ALEX HANNA: Yes yes and I mean–

EMILY M. BENDER: But there's no person here.

ALEX HANNA: Yes and so this is this um Fresh AI Hell and um so this is this is really um class there's so much here, too, especially with the AI models that have been developed that have been beauty characterizers, that have tend to highly ranked lighter-skinned women as more beautiful as the notion of beauty always has been capsule encapsulated as demonstrating whiteness.

There's just like a I don't know like there's such a perverse perversity in making this AI model this very dark-skinned Black woman uh or or appearing to be a Black woman. And this is sort of the the best picture we saw in this too because there's a scroll down yeah well no scroll down on this image in the in the in the weird hand that weird hand.

EMILY M. BENDER: Oh yeah yeah.

ALEX HANNA: yeah the the and the like so one hand seems to be you know kind of normal size but the other one kind of has these impossible you know like huge fingers with this very short index finger comparatively. And so you already see that there's all these artifacts coming out yeah and–

EMILY M. BENDER: Yeah.

ALEX HANNA: Yeah go to the rest of the images because they're very disconcerting yeah.

EMILY M. BENDER: It's all uncanny valley and looks like a Barbie doll and the thing that's that's so that as the the Tweet had it there you know this is this is not you know Black women making money in some way it's white male creators making money off of this. But also it's not even Black women getting to determine their own contribution their own framing of what is beauty. It's it's like completely external and objectifying and and then on top of that like weird uncanny valley stuff. So.

ALEX HANNA: I know just it's very it's very unsettling this this um user has what or the the person the account has 240 000 followers and so it's just I really do we know more about this I mean the the firm that's created this the uh the digital–

EMILY M. BENDER: Does the FAQ tell us anything?

ALEX HANNA: Oh sorry go to the website the the d-i-i-g yeah diigitals is it a it's a is it a digital. What is this? Oh is it a whole? Oh dear. Oh my gosh.

JEREMY KAHN: Speaking of exotification.

ALEX HANNA: Cacao everywhere.

JEREMY KAHN: Yeah.

EMILY M. BENDER: Yeah.

ALEX HANNA: Uh wow.

JEREMY KAHN: At least that hand looked like a hand. Oh look they have other aliens.

ALEX HANNA: They have an alien model. They have an Avatar-ass looking model.

EMILY M. BENDER: Next!

ALEX HANNA: Go to about. I want to see who these these mother flippers are. Like.

EMILY M. BENDER: Okay that looks like pictures of actual people there um.

ALEX HANNA: wait wait but go to the top they had a Down an AI Down Syndrome model? Oh my gosh did any people with Down Syndrome were they near this.

JEREMY KAHN: Oh dear.

ALEX HANNA: Oh so there's so many blackface but you've also got the digital like able disability oh my.

EMILY M. BENDER: Appropriation yeah.

JEREMY KAHN: And there's a digital Chinese man too just for fun um.

EMILY M. BENDER: Yeah all right. We're nauseating people here. Let's go on to the next thing.

ALEX HANNA: Okay.

EMILY M. BENDER: okay um how about using language models in legal contexts?

JEREMY KAHN: This is a bad idea in so many ways. The whole do not pay idea is already a bad idea that is having a robot be your lawyer um yeah but that is actually an automation and a reflection of the places where money can solve problems and that's not so you know lawyers are one of them. So it's not surprising that do not pay actually works as a sort of a a semi fake legal guidance that you probably shouldn't trust the way you can trust you can't it's hard much harder to sue a big robot than it is to sue a lawyer who did you wrong.

Um you probably shouldn't deal with the robot for this kind of thing unless it's low stakes things like a parking ticket. But in this particular case they're actually asking to add our own mess a mess that we're all in here expressing our hell about but to your do not pay.

So can you imagine with some joker standing up in court and it's talking to you in your ears and you're saying the insane things that chatbots are are trying to tell you?

ALEX HANNA: Oh my gosh.

EMILY M. BENDER: There's so many things wrong with this. So plenty people pointed out you can't just walk into a court with airpods in and not have that commented on. But also this last sentence her.

EMILY M. BENDER: "We just want to experiment and we'll pay the ticket even if you lose."

That is such a cavalier approach to people's lives right. When- if you have a speeding ticket the fine for the ticket is not the only consequence of that right? Your insurance rates could change you could you're probably your credit rating could change. Like you know who knows what.

JEREMY KAHN: You could get thrown in jail for a contempt of court like this.

ALEX HANNA: Yeah absolutely. Yeah you you have you are using I don't I don't even know what kind of I am obviously not a lawyer or and I don't want to pretend to be but I mean yeah you have many other legal consequences of even doing such a thing. Uh.

JEREMY KAHN: To his credit he does he does follow up in this tweet with uh "We'll of course do a full compliance review–"

ALEX HANNA: oh sure okay.

JEREMY KAHN: I'm sure okay okay. No thank you I do not want to be your first subject thank you.

EMILY M. BENDER: Yeah okay but these are not the only people thinking about legal domain or sort of you know legal ish. So here is a comment submitted uh to some sort of FTC public comments thing and it is a cover letter and then an appendix containing 23 quote AI generated comments um from the point of view of an AI bot. And it's like we do not need your synthetic text mucking up our public discussion of regulation.

JEREMY KAHN: Now the irony of this is of course this is a request for comment about the regulation of AI and the perhaps less surprising and less ironic this guy who submitted these things is submitting them is largely funded by the Koch brothers. Which speaks who are the people who who--I I dropped this particular uh sort of stink bomb on on Emily a couple weeks ago and that's what I think around the time she invited me on the show.

Um this guy's the the comments are not unreasonable but uh when someone in the comments described once they started looking at these things described it over on Mastodon as a um uh as a an automation- a denial of service attack on public comment which I think is actually one of the serious- one of the big risks about all of these um highly fluent uh uh language models is that they yeah it becomes impossible to detect. Not so much I'm not worried about the whole plagiarism the angle as much as I argue but unhappy about the possibility of uh as Carl Bergstrom called it how hard it is to debunk one of these things when they appear wrong. I have somebody has to go respond to all 23 of these comments now.

ALEX HANNA: Right like well that that's uh Jeff Kao who's now at ProPublica uh you know he did this interrogation of one of these comment periods for I forgot which rule making process it was. But he had it was less sophisticated than a language model generating these things but he had found this sort of pattern of of of of substitution of certain kinds of phrases and words, and found that there was a huge amount of I think maybe about a third of these comments um had come and had been automatically generated. So yeah so denial of service attack on these kind of comment periods is right. Um and I bet this guy thought that he was being a bit cute doing it, supporting a comment uh period on the regulation of AI, but you're just kind of proving the point man.

EMILY M. BENDER: Yeah all right so Next! Which is um uh here's a startup funded by OpenAI um using AI to answer legal questions uh and we could probably dig into this one in more detail but just sort of briefly "Copilot for lawyers" and I think by 2022 haven't we learned that it's like "x but for y" is here's a bad idea?

EMILY M. BENDER: Right?

ALEX HANNA: Oh it's I mean that's yeah I mean that's that's very you know you think we learned that by now. Uber for blah. ImageNet for blah. Like all these things that are no but that's that's not a good thing.

EMILY M. BENDER: No no no no. And so this is "Our product provides lawyers with a natural language interface for their existing legal workflows um instead of manually editing legal documents or performing legal research Harvey enables lawyers to describe the tasks they wish to accomplish in simple instructions and receive the generated result." Now, it's possible that there's some use case somewhere where um it is easier to do that and then verify the thing and you can reliably verify it than whatever the research is that you would need to do without this tool.

But I kind of doubt it and some of their others.

JEREMY KAHN: That's called paralegals that's right that's what paralegals do.

ALEX HANNA: Yeah.

JEREMY KAHN: This is an attempt to automate paralegals. And it's not I mean we already have a problem that that in some fairly senior lawyers just automate things by passing them to a paralegal and they don't bother reading the text afterwards and then they just send it on because that's how people a lot of people get work done.

That doesn't mean that it's safe or that it's that they're supposed to vet it afterwards and they're supposed to you know officially the lawyers are supposed to read what the paralegals produce and then say okay and then sign it. And then the lawyer is responsible for it. This is going to have the same problem except the paralegals aren't even in the in the mix.

ALEX HANNA: Yeah well but it's probably I mean what's probably going to happen is that they're going to handle this in paralegals they're going to have to chase all these problems that these things are doing right and so you know like if these things are making up you know these language making up case law or making up California code um then you know this paralegal still has to probably verify all these references and has to you know confirm whether this is in paragraph 323 rather than paragraph 321 you know.

EMILY M. BENDER: Yeah.

ALEX HANNA: And I mean and so it's gonna I mean I I just see it introducing a whole new class of errors that then um some other class of humans has to you know run down. And this is um this is kind of my big um jibe with with some of these people who are saying like Galactica which we talked about last time and you know it's you're taking a process that is well known and you were supplanting the work of an individual with having to deal with the output of some language model that you don't know how this tool predicts in any kind of predictable way. There was someone uh on on that we were talking about in the group chat Yoav um who had said something like I'm going to require that my students begin with ChatGPT which we didn't even talk about yet.

EMILY M. BENDER: Maybe next time.

ALEX HANNA: Uh and maybe next time and maybe next time you know I will require my students to generate papers with this and then start and edit from there. And I'm thinking and then and then some other ideation of wouldn't it be great if we had humanities scholars you know start from this and then go back no because then you are making that student or that scholar become subservient to the technology and fix its errors.

You're then the caretaker the babysitter of this technology. That's not an actionable skill that go goes forward. I don't like babysitting Microsoft Word. I don't like babysitting Zotero I don't like babysitting LaTeX holy crap. And if I have to babysit ChatGPT that's just another thing that's going to prevent me or prevent me from learning and actually doing the work and what it means to be a scholar or a scientist or a social scientist. And in what world is that you know so now you're making paralegals babysit this tool.

EMILY M. BENDER: And doing it in a way where they or then the lawyer eventually becomes responsible for its output. So: "Tell me if this clause in a lease is in violation of California law and if so rewrite it so it is no longer in violation." I can imagine the system might find the violation, might rephrase away from it, might even return like which part of the code it was violating, but then land on something else that is also in violation of a different part of the code.

Um but it looks good because it looks fixed right and just yeah. I want to I want to surface something that Dylan's saying in the chat here "It makes me think of Emily's note about the point of high schoolers writing essays isn't to keep the world's supply of essays topped up." And that's from my blog post in response to the New York Times piece. Um "Isn't part of being a paralegal also for the benefit of the paralegals development?" Which I think is exactly what you're saying there Alex instead of spending time babysitting this technology, um they should actually get to to learn stuff. All right two minutes left I got a couple others I don't. We're not gonna get through all this but we have to.

ALEX HANNA: No the first I think this is the first appearance of a Sam Altman tweet on this uh on this series which uh highly cursed.

EMILY M. BENDER: Yeah and just just to say that what this tells me is that there are some people who are so wedded to the idea that their synthetic text bullshit generating machines are somehow on the path to being human that they have to devalue what it means to be human so that it looks comparable.

ALEX HANNA: Yeah yeah go ahead Jeremy I I–

JEREMY KAHN: I mean congratulations Emily on having a citation from Sam Altman uh by the way um.

EMILY M. BENDER: Someone said he coined it on Twitter and I wrote back and said no he did not.

ALEX HANNA: Or condolences–

JEREMY KAHN: Yeah condolences just is more in order but um I actually think I I said this in our in our pre-roll that uh that there's a there's a there's a piece in common here with the effective uh altruism folks that you guys were talking about uh last last week and uh and earlier in this series uh and that that I think there is something useful here about Altman is really trying to say is that we know that you're all motivated by a single reward function. That you know and and his reward function is capitalism. It's very clear. But his--

But the rest of us maybe we don't actually agree that we are motivated by a single reward function in our actions. But his is fundamentally a behaviorist angle which is closely aligned with the utilitarianism of of the EA crowd um and the idea that there is a single reward function that we should all maximize um he believes we're all paperclip maximizers and that's what he's really saying here uh is he's saying I am a paperclip maximizer and so are you which is like actually I choose not to make paperclips. I would prefer not to.

EMILY M. BENDER: Yeah yeah instead you come join us and um make Twitch streams which we thank you very much for.

ALEX HANNA: Yeah thanks so much for joining us, Jeremy this is super fun uh I think this is our last episode of the year. So yeah join us in 2023 when our production values are going to skyrocket and we're going to have songs and graphics and you know you know dance.

EMILY M. BENDER: We already have songs. You sang for us Alex.

ALEX HANNA: Oh yeah we'll have even more songs.

EMILY M. BENDER: Yes.

ALEX HANNA: All right all!

EMILY M. BENDER: Thank you so much Jeremy.

JEREMY KAHN: Thank you thank you for having me on.

ALEX: That’s it for this week! Thanks to Dr. Jeremy Kahn for helping us bring the critique.

Our theme song is by Toby MEN-en. Graphic design by Naomi Pleasure-Park. Production by Christie Taylor. And thanks, as always, to the Distributed AI Research Institute. If you like this show, you can support us by rating and reviewing us on Apple Podcasts and Spotify. And by donating to DAIR at dair-institute.org. That’s D-A-I-R, hyphen, institute dot org.

EMILY: Find us and all our past episodes on PeerTube, and wherever you get your podcasts! You can watch and comment on the show while it’s happening LIVE on our Twitch stream: that’s Twitch dot TV slash DAIR underscore Institute…again that’s D-A-I-R underscore Institute.

I’m Emily M. Bender.

ALEX: And I’m Alex Hanna. Stay out of AI hell, y’all.

Mystery AI Hype Theater 3000

Episode 7: There Are Now 15 Competing Evaluation Metrics (ft. Dr. Jeremy Kahn). December 12, 2022

Listen to this podcast on