Across Acoustics
Across Acoustics
Should AI tell you how to talk?
Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.
With the development of automatic speech recognition has come a new type of technology, designed to give the user advice on how to speak better. In this episode, we talk with Nicole Holliday (University of California, Berkeley) about some of the issues that can arise with the use of these technologies, from their nebulous definitions of "good communication" to the impact they could have at businesses that use these technology to evaluate employees.
Associated paper: Nicole R. Holliday. "Socially prescriptive speech technologies: Linguistic, technical, and ethical issues." J. Acoust. Soc. Am. 158, 4361–4369 (2025). https://doi.org/10.1121/10.0039685.
Read more from The Journal of the Acoustical Society of America (JASA).
Learn more about Acoustical Society of America Publications.
Music Credit: Min 2019 by minwbu from Pixabay.
ASA Publications (00:26)
It seems like AI is showing up everywhere these days. Now, not only can it recognize what you're saying, it can give you advice on how to improve how you communicate. But, as with many new technologies, there are some major caveats. Today, I'm talking with Nicole Holliday about some of the issues that arise with these technologies, which she wrote about in the recent JASA forum piece, “Socially prescriptive speech technologies: Linguistic, technical, and ethical issues.” Thanks for taking the time to speak with me today, Nicole. How are you?
Nicole Holliday (00:52)
Thank you so much, Kat. It's great to be here. I'm good. I'm excited to talk about this dystopian technology.
ASA Publications (00:59)
I know, right? Who doesn't love a good dystopian technology discussion. So, first tell us a bit about your research background.
Nicole Holliday (01:06)
Yeah, so my title is Acting Associate Professor of Linguistics at UC Berkeley. I'm a sociolinguist, but really a sociophonetician, which means that my research is at the intersection of language and society, but also the speech signal itself, so sounds, and sort of how we make social meaning out of them. Obviously, I wasn't going to grad school to study AI. I got my PhD in 2016 so that we weren't even thinking about that yet.
So my earlier research really focused a lot on the idea of performing and constructing a social identity through kind of micro-level linguistic variation. So this is where the phonetics comes in, right? So we're thinking about the tone of voice, the quality of the voice, whether it's like breathy like this or creaky like this, and all of these things have social meaning. So when you hear somebody even just say a word like “hello,” you build a sort of social identity for them based on your previous experience of the world. And that is really interesting, fascinating sort of sociologically, but also it has implications for how we move about the world, right? So because when we make these judgments and we attach social ideas to them, that opens the door for there to be things like power differentials, stereotypes, discrimination, right? So that's sort of the practical implication of this line of research. And a lot of my previous work focused on how people are able to get a picture of someone's, not only kind of broader social identity, but their race, their age, their gender, right? Like what is all of the social information that we're bringing, when we hear very small kind of microvariations in pitch.
ASA Publications (02:39)
That's so interesting. It's like, there’s so much embedded, or there's so much that people can assume based on tiny bits of acoustic information, I guess, basically.
Nicole Holliday (02:49)
Yeah, I’m teaching sociophonetics this semester, which is my favorite class to teach. And I asked the students the other day, I was like, what if somebody releases all of their T's and D's? So what if they say like, “Oh, I write with my lefT hanD.” Like, what kind of person is that? And they all were like, “That's pretentious.” And I was like, “Yeah, but did anyone teach you that? Why do you think that?” It's just about how you produce the letter T or the sound T or the sound D. And then you've attached a whole personality to it. And this is, we do this and a lot of times we're uncomfortable with the idea that we do this. It's not bad or good, it just is. It's problematic when we jump to stereotypes or discrimination based on, you know, whether what that T sounded like.
ASA Publications (03:35)
That totally makes sense. Okay, so to jump into our dystopian AI future. So what are socially prescriptive speech technologies and why do people use them?
Nicole Holliday (03:45)
Yeah, so I made up this term. I'm still sort of working on it. I've also thought about calling them “evaluative speech AI.” But I think basically that's the thing, right? They are technologies that are designed to tell you how you sound to other people or prescribe, sort of instruct you, how to speak in an idealized way.
This is a limitation, right? Because what is ideal for any particular situation is context dependent, it's speaker dependent, it depends on what your goals are, right? And so just at the face of it, we already have a challenge for these technologies, which is that speech and human interaction are super dynamic, and systems are not necessarily able to be as dynamic as humans are, sort of intaking all of the real-time information that we do in a sociological context. So the point of these technologies, or the way that they're advertised is, you, person, individual, consumer, think that there's something wrong with the way that you talk. You think you talk too fast. You think that you say too many ums, whatever that is, right? There's something you're uncomfortable with with your speech. This technology is going to give you feedback on that.
The bigger issue to my mind is that it's not actually typically consumer facing where we see these technologies being primarily implemented. They are advertised to management to sort of police their employees for idealized speech. So it's not you saying, “Oh, I'd like to talk slower.” I mean, maybe you could, right? It's your boss saying, “This technology says you talk too fast. Fix it.” And then that has implications for you in your workplace, or if you're a student, or whatever. It's sort of a very top-down kind of process that these technologies use to tell you that you're not talking, right?
ASA Publications (05:33)
Right. So what are some of the issues that come up with automatic speech recognition technologies in general? And what are the issues that arise with socially prescriptive speech technologies in particular?
Nicole Holliday (05:44)
Yeah, so I think we have to start with sort of understanding how these technologies work. And a lot of people have no idea how large language models work, or specifically with speech AI. So hopefully this will be a little bit demystifying.
When you are speaking, it's running an automatic speech recognition system that's taking all of the words that you're using and then sort of auto-transcribing them, right? This is already a challenge for people that use a lot of linguistic variation. So every LLM does better when it has more training data that looks like its use case data. So, if you have a large language model that was totally trained on speech from 30-year-old white men who live in San Francisco, then that automatic speech recognition is going to perform best when it hears a 30-year-old white man from San Francisco. That same technology, if it's hearing my 83-year-old black grandma from Ohio, might not do as well. And so this is what we see across the board.
There are thousands of papers about how automatic speech recognition underperforms for people from marginalized backgrounds, in part because they're not represented in the training data, but also in part because they're not seen as an idealized consumer. Who do the companies imagine is using this technology? It's not my grandma. It's the Bay Area guy, right? And so they're not incentivized to make them work for more people. And it's literally a technical problem. When these systems are confronted with more variation, that's more paths that they have to learn to deal with different types of things that can change in the speech. So one issue is socially prescriptive speech technologies is already this underlying automatic speech recognition, this ASR bias, where it might not even be adequately transcribing, you know, my grandma's speech the same as it would for a Bay Area white guy. So we start there, right? It might be misrecognizing the speech.
But even say that it's not, automatic speech recognition has improved across the board as of late, and it continues to improve. So maybe even that's not the problem, right? Works pretty well for me. There are culturally specific communication patterns that these systems are not necessarily taking into account, right? So the reason that I call them “socially prescriptive” is that they uphold an idealized version of speech. So very simple example, they look at a lot of different parameters. They'll give you scores on things like engagement, sentiment, charisma, bias, whether you're talking at one person more than another person. There's lots and lots of things that they'll give you scores on. One of them that's kind of easy to understand is speech rate.
Okay, so what happens with speech rate? People are afraid that they talk too fast or afraid that they talk too slow. If you and I are in a conversation, and it's a successful communication, we should actually be converging towards each other. So if I started fast, for me, right at 200 words per minute, and you started at 150 words per minute, if we're successfully communicating, we're gonna naturally converge to 175. So you'll come up a little bit and I'll come down a little bit.
These technologies are not necessarily dynamically adapting to that. So it's going to tell me, even if I go from 200 to 175 words per minute, I'm still talking too fast, even though now you've come up to match me and we're having a successful communication. The system says, “Your ideal range should be X.” (And I'm going to say X because it doesn't even actually specify an ideal range.) But say that it idealizes 150. If I'm starting at 200 and coming down to 170 and that's actually appropriate, it's still gonna tell me I'm talking too fast because it wants 150 because that's how it was programmed. It's not taking into account any of the context of the person I'm talking to, what they're doing, and what I'm doing.
There's another issue here which is our speech rate is constrained by what we're talking about. So if we're talking about something super straightforward, we're gonna talk really fast. We assume that our interlocutor, the person we're talking to, is following us because it's just like I'm telling you what I had for breakfast or whatever. If I'm telling you about some really complex linguistic concept, I'm going to talk slower to give you time to catch up. And so a system that is not taking into account the interaction between you and I, the context of our speech and also what we're talking about, and the way that we're converging over time, is actually not modeling successful communication the same way that humans do in our own.
ASA Publications (10:11)
And it would be really, really weird if like your grandma started talking like a white guy from San Francisco.
Nicole Holliday (10:17)
Right, exactly. Also, this is another point that you're bringing up too, right? When we have repeated conversations with the same person, we adjust our expectations for how they're supposed to sound. In general, we expect that older people talk a little bit slower, and we give them time to do that. It's going to sound a little uncanny if my grandma's getting all of this feedback from this technology saying, you're talking too slow. Leave her alone, right? She's, and this is one thing that really does personally bother me about this technology, which is there's no space for the idiolect, for the individual person's way of speaking. There's only space to conform to the ideal that the model was built to judge.
ASA Publications (11:01)
So you're missing out on personality or, not even just personality, like if a person for some reason needs to speak a different way, you know? So what was the goal for your work here?
Nicole Holliday (11:12)
Yeah I have some experimental work on these technologies, but I have a theoretical kind of problem with them, which you can already tell from the way that I've been talking about them. Like, I have experimental work that's under review right now showing that indeed they are biased. They give lower scores for things like sentiment and engagement to black speakers and sometimes to people who speak English as a second language. Which is unsurprising given that they're targeting sort of an idealized model, right? They were certainly not trained on the same number of people who are second language English speakers as first language English speakers. That's not necessarily good business for the companies putting these out. So I already know that to be true from my experimental work.
But the point of this piece was to say there's actually a theoretical problem, right? Even if the systems aren’t kind of systematically biased against a particular racial group, or against a particular language group, there's still a danger here in the way that they can be weaponized against workers and against students, but also in the way that they are by design, supposed to homogenize our speech. For sociolinguists, we're interested in variation. Variation contains social information. Variation tells us about our world. Variation contains the history and culture of entire groups of people. Right? We don't want to get rid of variation. So technologies that say there is one and only one right way to speak are antithetical to the project of language and society. Right? They are constraining because they're trying to get everybody to sound exactly like their idealized model.
And not for nothing, who is the idealized model based on? It's not me. It's not my grandma, right? They're never gonna say this, right? Because these are large language models. And so in fact, if you ask people who program these systems, they are usually not aware of all of the input data, right? So if we know like you take something like chat GPT, we know that it was trained on Reddit. Reddit is huge, right? Who are the humans on Reddit? Yes, they are like diverse in age and race and gender in some ways, but we can't exactly tell you how. We can't say it's 50 % women that was in that training data. We don't know anything about those speakers. And so I think that this is not something like malicious from the part of people that are designing SPSTs. They don't know who's in the data. But because of what we've seen for issues with other large language models, we know that there tends to be an over-representation of people from groups that contain hegemonic power. So this is why I don't think that it's designed keeping my grandma in mind.
ASA Publications (13:47)
Right, right. So let's go more into these, like the metrics and models that are used by the SPST systems. How exactly do they score speakers?
Nicole Holliday (13:56)
Yeah, so in this paper I talk about two different systems that I'm classifying as socially prescriptive speech technologies. One of them is called the Zoom Revenue Accelerator, ZRA. If you're on a Zoom call, a corporate Zoom call, this could be running in the background and your manager could be getting your scores and you would never know. The more commercial Zoom version now has AI companion. And AI companion doesn't quite do this yet, but it could. All of those meeting summaries and stuff are sort of a step in the direction of this. The next step is the sort of speech feedback coaching thing that you see in the ZRA. The other system is called Read AI, and you can just go buy that download it. It integrates with Microsoft Teams, so if you use Teams, it's there, but you can also get it for yourself, pay for it. I don't know what the cost is, but it's not too much. Now and then it will add itself to your Zoom calls and then give you this kind of feedback. So the type of things that it gives you, I'll talk about Read just to keep it simple because they're kind of similar. Read AI gives you like an overall kind of “goodness” of the score just an overall meeting score. That's confusing. It gives you a score for sentiment. It gives you a score for engagement, and those are sort of the top line ones.
And what's a little confusing about this is you might be thinking in your mind, like, “What is engagement?” And they give you a definition, but it still feels kind of, you know, kind of, Ohhh… So what Read says about this is, “Engagement is a measure of attendees’ level of involvement and interest in a meeting.” It's like, okay, is the other person paying attention and interacting, I guess is the question. So then when you go back to their documentation and you say, okay, well, how is this measured? They say, “We measure engagement through a combination of facial and verbal cues, as well as the talk time of attendees throughout the meeting to assess whether engagement levels are high, medium, or low in real time.”
So if I, um, ah-- People listening out there, Kat and I are doing this on a video call, right? If I were running Zoom Read AI in the background in this conversation with Kat, it would tell me that Kat is not engaged at all. Why? Because I'm talking a lot. But it's appropriate for me to be talking a lot because it's an interview about my paper.
ASA Publications (16:10)
Right, right, exactly.
Nicole Holliday (16:12)
And so it's not that Kat is not engaged, right? I think you're engaged. You're not talking.
ASA Publications (16:18)
I feel engaged.
Nicole Holliday (16:19)
Thank you. You're just not talking as much because that's not the format of this particular call.
Another thing that could be a complication here that I worry about a lot is it says that it's looking at facial and verbal cues. Okay. We have other research showing that laptop eye tracking is not particularly reliable. So if it's trying to see if your eyes are looking at the camera, a lot of times when people are on Zoom calls, their eyes are not looking at the camera. Maybe they're taking notes. Maybe they're multitasking. Maybe they're writing on paper. This is particularly a concern for neurodivergent people, who might not do sort of standard ways of eye contact or who might be using a fidget or something else, right? Because that manages their neurodivergence in this kind of difficult call environment.
Okay, so here's the problem. If you're a person who is not looking at your camera for whatever reason, I'm getting a low score on engagement because you're not looking at your camera because you're not jumping in. Now I'm being punished. So if my manager is looking at this to see if I'm doing a good job at work, they're gonna say, “Well, nobody's ever engaged on calls with Nicole.” Well, maybe they're just not looking at the camera.
ASA Publications (17:29)
Right, right, yeah.
Nicole Holliday (17:30)
Who knows?
ASA Publications (17:31)
Now, when I was thinking in terms of like talking about like students using this, if a student is talking to other students, you know, they're not necessarily gonna be engaged. You know, you never engaged your audience is gonna be. It's not necessarily the speaker's fault if the audience isn't engaged either to be…
Nicole Holliday (17:49)
Exactly. Yeah. Also like—Okay, so there's an even more basic question. Is your communicative effectiveness best measured by how engaged your audience is? Which I think is what you're getting at. Like I could be an effective communicator, and, man, everybody's tired today. Like whatever you know? Anything could be going on. That's not a direct metric of how good I am at my job, whatever my job might be.
ASA Publications (18:01)
Right, and that's also not something you can easily correct for, right? Like, how do I become more engaging? Okay, they aren't listening to me or they're not looking at the camera enough. What do you do to fix that even, you know?
Nicole Holliday (18:29)
Also, I've taught classes before where, man, everybody was there under duress. It's like a gen ed. They have to take it to graduate, whatever. And so, like, I could be tap dancing the whole time and being maximally engaging. These students are just not here for it sometimes. And that's not, I'm trying really hard.
ASA Publications (18:47)
Right, right, right.
Nicole Holliday (18:50)
There's also, the one with sentiment also, they say, it measures how positively or negatively attendees are feeling throughout a meeting. Now, first of all, how do we measure feelings from looking at somebody on a Zoom call? But it again says that it analyzes facial and verbal cues from attendees to examine whether they're expressing positive, neutral, or negative reactions. Okay, so now we're looking at culturally specific facial expressions that are not necessarily easily measured by a laptop camera either. And again, not something that we know is the same for all people, based on their cultural background, based on neurodivergence, based on all that kind of stuff. The other thing that I've thought about is there are situations in which your sentiment should be negative, and that is communicative effectiveness. So if we're in a call ,and you're a student or whatever, and you're really struggling, you know failed a midterm or whatever, my sentiment towards you should not necessarily be, “Well, that's okay! You're gonna be great next time!”
Right? That’s a lot. It might be appropriate for me to say, “I know that that's really disappointing and that you're struggling,” and we might be having a heart to heart and you can tell me about what's going on in your other classes or why you didn't do well here or why you deserve another chance or whatever. That's gonna get a low sentiment score. But it doesn't mean that it's not an effective communication. It means that everybody's sad, because you failed a test. And it's okay, we're allowed to be sad in a conversation. It doesn't mean we're not talking well.
ASA Publications (20:25
Right, right, right, totally. So you kind of got into this a little bit when you were talking about the Reddit stuff, but what are the limitations of using large language models for scoring and giving feedback?
Nicole Holliday (20:35)
Yeah, if we imagine that there is no interaction of language and culture, that there are no idiolects, no individually specific ways of talking, then this is a great idea. Right? If we think that culture doesn't matter, that individual differences don't matter, that there is some one way of speaking that is right all the time, then we can model that really easily.
The issue is that the amount of information that an LLM would need to have to do this effectively, to do this in the way that a human does this, is way more information than our current LLMs have. And especially when we're getting into the realm of talking about speech, right? Speech just takes up a lot more space. It's a lot more data than text. And so people are really amazed by, know, chat GPT or Claude or whatever. Yeah, it's got a lot more data. Speech is much harder to process than text. And so the training data sets for something like an SPST is gonna be smaller to start with. And so then you have a bigger risk of them not having representative sample, not culturally sensitive to these kind of group level differences, right?
There's also just the interpersonal, right? So if you and I have talked several times before, and it's a student-professor relationship, let's just say. If we are on a Zoom call, and this is on, I can know something because of the last time I talked to you that this system does not have access to. So if you failed the test last week and now we need to have another conversation and you failed the test this week, that's something I know about the world that the system does not know about the world that impacts how I should interact with you appropriately.
We talk about this as a type of sociolinguistic competence. So when you learn a language, you don't just learn, okay, here are the sounds, here are the words, here's the syntax, how I put the words together. You learn how to use that language in the world. So super simple example. If you're an American English speaker and you're in a customer service interaction, you know, you're at cafe, Starbucks, whatever, and the barista says, “How are you?” We all know that the answer to that question is not how you actually are. The answer to that question is “Fine,” “Good,” “How are you?” That's not real, right? That's not like the meanings of the words that tell you that. That's you being a person in the world.
And so what worries me about these technologies is they don't actually have enough data to be able to model how we know how to interact in the world. Because that's way too much data for them to expect to have. They don't, they're not embodied, right? They're not seeing anything about your body language. They don't know the whole history of our interaction. They don't know the pedagogical theory I have for the class that I'm teaching that you're taking, nothing, right? But you and I, the humans involved, do have that. And so sort of reducing the effectiveness of our communication to these things that we can measure by looking at people's facial expressions, setting aside the culturally specific differences for a moment, we just erase so much of what humans know when we communicate with each other.
ASA Publications (23:54)
Right, right. Because we're not just using, ultimately, like this is about acoustics, but we're not just using acoustic cues, right, to judge things or like you said, facial expressions or any of these other things. There's context.
Nicole Holliday (24:07)
Well, this is what I was saying before, right? Like about the meaning of the T, the release T that my students jumped to saying pretentious. That could mean pretentious in some situations, but in that same question that I asked my class, there was one student that said, “Well, maybe it's not pretentious. Maybe it's just that English is their second language and their first language is a language where they release T.” And I said, Yeah, you know that. How do you know that? You know that because you're a person in the world who has spoken with thousands and thousands of other people in the world and created a model for what social meaning is.” These systems just don't have access to that much information.
ASA Publications (24:42)
So another problem with SPSTs that you mentioned is the issue of idealized speech. What parameters are used for this ideal speech?
Nicole Holliday (24:51)
Yeah, so I told you before there's these top scores, like overall score, sentiment, engagement. There's some other ones that are a little bit more specific. So there's charisma, bias, talking pace, filler words, non-inclusive terms, interruptions, impact bias, impact charisma, and questions asked. And I'm not going to take apart all of them. But there's a few that I think are kind of illustrative, and they are ones that I unpack a little bit more in the paper.
So I'll give you an example. Zoom Revenue Accelerator talks about this kind of thing as pause or patience, and Read AI does it in a different way. But let's take this particular metric. Patience is how long the interval is between when you stop talking and the next person starts talking. So it's a pause duration kind of thing. We know from a lot of sociolinguistic research for like 60 years that pause duration is culturally specific.
So there's a pretty famous paper by the linguist Deborah Tannen where she is a Jewish New Yorker, and she has Thanksgiving at her house with some other Jewish New Yorkers and with some California WASPS, some white Anglo-Saxon Protestants. So everybody comes to her house. This is in the 70s. And she's recording because she's a sociolinguist, you know, kind of doing this ethnography type thing. And they walk away, you know, there's sort of a weird social interaction that she's capturing here. And everybody walks away from that Thanksgiving dinner thinking that everybody else is rude. And she's like, what's happening?
Okay, so she asks the Californians, like, “Why didn't you get along, you know, with other guests or whatever?” And they say, “Well, we couldn't get a word in edgewise. Like we couldn't, they never paused to let us talk. Like, you know, we just couldn't say anything. They just talked over us all the time.” She asked the Jewish New Yorkers, “Why did you think the Californians were rude?” “Well, they didn't volunteer any information. They didn't jump in the conversation. They didn't want to talk with us.”
Ohhhh… What happened? The Californians were waiting for a pause to start talking. And the New Yorkers tolerate a higher degree of conversational overlap. So the pauses were intolerable for the New Yorkers, so they just kept talking. And the Californians were like, but these people never stop talking. Neither of these is bad. They're just different ways of being. So when you have a technology that says, well, this is exactly how long you should pause to let someone else talk, that's culturally specific. If it's a group of Jewish New Yorkers that have this kind of communicative style, no one should ever pause. If you have like people like the California WASPs, it takes a while.
And we see this cross culturally. English speakers who learn Japanese frequently comment on this. Japanese speakers allow for very long pauses compared to English speakers. Pauses that are two, three, four times longer than what you expect for American English speakers. Because it's just a fact of the language. So if you are using this technology in a business context and your person that you're talking to is Japanese, maybe the appropriate pause length is longer than it would be if it were somebody that was from your same cultural background, if you're like a white American. So that's the kind of thing that makes me feel kind of uncomfortable here, right?
And that's addition to the technological limitations. So I've been talking about, like, imagine that everything is working perfectly. How many times have we been on Zoom calls where there was just a little bit of lag? And we measure pauses in milliseconds. So what feels like a long pause might be, oh, their internet's just not perfect right at this moment.
ASA Publications (28:39)
So what does research actually say about group level differences regarding these parameters?
Nicole Holliday (28:44)
Yeah, so I gave you the example of pause length. Speech rate, as I mentioned before, also varies by what you're talking about, but also sometimes by age, right, as well as how comfortable you are with your interlocutor. This is also the same for fillers. So a lot of students, when I've taught about this technology or showed them this technology, have said, “Well, I know that I say too many ums, uhs, and likes.”
These are called filled pauses in the linguistic literature, fillers. They're not bad. I think in the public imagination, everybody's had it sort of beaten out of them by some high school debate coach or something, or their mean uncle that you should never say like or um or uh! You should. It's actually pro-social, especially yif ou're talking about something that's difficult to pause and say um and uh and like sometimes so that your listener can catch up with what you're saying.
ASA Publications (29:39)
Well, there you go. I like that.
Nicole Holliday (29:40)
Yeah. There's something really kind of interesting in the documentation. Hold on, let me see if I can find it. So the Zoom Revenue Accelerator talks about fillers and it says, it never tells us what words it's counting as fillers, by the way. So it says that there's a recommended range of filler words between 0.6 and three per minute. So the Zoom Revenue Accelerator will give you a good score if you use between 0.6 and three fillers per minute. Go ahead and count every minute. Make sure that you use no more than three. I'm sure that that's not cognitively demanding at all. So there actually is something a little pernicious about this that I haven't mentioned yet, which is if you're trying to please this algorithm, it makes you really self-conscious about your speech, which then can actually have the opposite effect. Right? If I'm trying to count how many pauses I'm using in a minute, it is taking attention away from my ability to do the successful communication or focus on the message of what I'm saying, because I'm only focused on the style. Right? Now my attention's away from the content.
But what we know from linguistic literature is these types of filled pauses, the ums, uhs, and likes, have pragmatic functions, right? So they can show how the discourse is structured, so what you're doing. It can also allow you to repair yourself. So if you started to say something and you're like, wait, that's not quite right. It can be useful, we can do it on purpose. It’s not necessarily a marker of disfluency. It’s sometimes a way to hold the floor. So if you ask me a question and I'm not totally sure where I'm going with it, but I don't want you to interrupt me again, or I don't want you to think that I've paused too long, then I might say, well, that's a pretty good question, right? And we do this all the time. And I'm not saying that there are no people who rely on lot of filled pauses in a way that's not necessarily communicatively ideal. Yeah, if you're using 50 per minute, that's probably going to register as a lot for most people. I'm not saying there's no situation in which you might want to use 50 a minute, but that's probably outside the realm of ideal for communicative effectiveness. But 3, 6, 10, 15, right? This is not something that is easily quantifiable because it depends so much on what's happening. And in particular, the kind of people that get policed for using filled pauses are disproportionately young people. So my students are always like, yeah, my parents said I pause too much or my teacher or whatever. They're learning to do public speaking. They're learning to talk about complex things. We should let them say some ums and likes. It might be different.
ASA Publications (32:17)
Right, right, exactly.
Nicole Holliday (32:23)
It might be interpreted differently by me, the audience if it's one of my students versus whether it's one of my colleagues. And that's what I'm saying about sociolinguistic competence. Like, we know this, because we're people in the world, that we don't expect students to sound exactly the same as full professors. And there's just no space in here in these metrics for that kind of difference.
ASA Publications (32:41)
Right, totally. So clearly the parameters do not account for group-level variations in speech as we have been discussing. What kind of biases have been showing up in the SPST's feedback as a result?
Nicole Holliday (32:54)
Yeah, so I mentioned that I have this other experimental work that shows that black speakers get lower scores for sentiment and engagement. I have an earlier paper with Paul Reed from the University of Alabama where we look at another technology that does this, an older one called the Amazon Halo. This paper came out earlier this year in PLOS One. It's a wearable device that, you know, tracks your steps and sleep and all that kind of stuff.
And they discontinued it now, but it had this tone of voice evaluation, and we found a pretty strong gender and race bias in its scores, but there was other researchers, Young et al. 2023 found the same thing with the gender bias. So the Halo would measure these kinds of things as well and it would just always give the women lower scores. It would also give them, it gives you adjectives. So it gave more uniquely negative adjectives for the black speakers and the women than it did for the white men. So the white men had no unique adjectives because probably they were more well represented in the training data. But the descriptive feedback that it gave to our female speakers was more likely to say that they were anxious, irritated, insecure, these kinds of things.
ASA Publications (34:05)
Mmm.
Nicole Holliday (34:06)
Yeah, it sucks. So Zoom out, right? To be clear, these technologies, the Zoom Rev, Zoom Accelerator, and Read AI do not do this. They don't give you these kind of adjectives the way that the Halo did. And the Halo was discontinued for reasons probably including this. But I'm not saying, I don't know that there's not other SPSTs out there that aren't doing that type of thing. There probably are.
All right, so if you're a manager and you manage a team of 10 people, and you have an SPST that operates like the Halo does. It's telling you that your female employees sound anxious and irritated, and it's never saying that for your male employees. Okay, well then who gets dinged on their performance evaluations? Who doesn't get the promotion? Who doesn't get the raise? Who gets fired?
And if you zoom out just a little bit more because of the way that the law works, this is really frustrating. You're not allowed to discriminate against people on the basis of gender, right? That's illegal because of the Civil Rights Act of 1964.
But, if you say, well, no, our technology has these quote unquote objective metrics, these people are pausing too much. These people don't sound charismatic. They're getting lower scores on sentiment. And you fire the bottom half of your employees that have these low scores that were given by the SPSTs, you've fired your women. Or you've fired all your black people. You've fired everybody who speaks English as a second language, which is also, national origin is a protected category by the Civil Rights Act as well, which means that you can't discriminate against people who speak English as a second language as long as it's not impeding their job, right? So this is bad. And if you were a nefarious employer, who didn’t want to have people who spoke English as a second language on your team, then the SPST gives you cover. They can say, “No, no, I wasn’t discriminating based on national origin. Look, these—”(seemingly objective) “scores that they’re worse at communicating.” That’s the nightmare. That’s the thing that keeps me up at night.
ASA Publications (36:18)
Right, that would be incredibly terrible. Yeah. What would SPST developers need to do overcome these biases? Is it even feasible?
Nicole Holliday (36:27)
I think that, so first of all, I don't think we need this technology. I think that people should trust themselves that sociolinguistic variation is really complex in a way that doesn't need to be modeled by these systems. We are able to improve our own communication on our own, right? These systems, I don't think, provide feedback that is reliable and realistic and actionable. They're too limited.
That said, the smaller the task is that they're trying to do, the better they can perform. So if we have a model where they say, “This model was trained on customer service calls where every person was a white American in a bank,” right? That's gonna work better than these off-the-shelf ones that I've talked about already that are like, “We're gonna help you improve your calls.” Well, a call between a professor and a student is different than a call center interaction, and that's different than a real estate transaction, and that's… So what we're looking for, what is kind of ideal speech in each of those situations, varies so much that that creates an additional technical challenge. So one way to make these, I think, a little bit more reliable would be to make them domain specific and demographic specific. So if they were just to say like, “Look, this was not trained on people who speak English as a second language. So we can't stand behind the reliability of the scores for those kind of speakers. They're not in the model.” That would be great. They're never going to say that because they usually don't know who's in the training data. There's too many people in the training data.
ASA Publications (38:00)
Right, right.
Nicole Holliday (38:01)
But everybody that's looked at this post-hoc, right, the academics who are trying to deal with automatic speech recognition bias, can see in the bias who tends to not be represented in this training data. So I would say, like, if they really do want to overcome that, being domain specific is one thing, but also not over-promising. It's not going to necessarily make you a better speaker. If these technologies were marketed to say, “Hey, are there specific things that you personally want to work on in your speech?” Maybe you're a person who, for whatever reason, would like to talk slower. I'm a fast talker, but I don't care. Maybe I do care, right? If I'm a person that cares, if I've gotten complaints from students, maybe I care. I could say to this kind of system, “All right, I want to practice really understanding what it's like to talk at 160 words per minute instead of 200 words per minute.” And I can use that as a tool for myself without it being held against me in my workplace or used to discriminate against me or trying to homogenize my speech towards some kind of idealized standard.
I think particularly for people who are learning a second language, that's where these technologies could be a little bit useful. Not necessarily in the like what they're doing today, right? Like charisma? Yuck. Like don't measure my charisma! I hate this, right?
ASA Publications (39:19)
Yeah. Right. Right.
Nicole Holliday (39:21)
But if we have like that example where Japanese speakers tend to pause longer, and you need to learn that rhythm because you're learning Japanese, maybe there's some use there. But they're over-advertising, kind of over-promising, and doing this thing that is really not allowing for there to be any variation in people speaking their first language. And that's kind of the problem that I see.
ASA Publications (39:45)
Makes sense. So how do you foresee SPSTs impacting how people think about and use language?
Nicole Holliday (39:51)
A lot of people ask all the time, what is doing to our language use? Like, what is TikTok doing to our language use? Unfortunately, it's really hard to catch change in progress at the beginning, right? We're usually looking back and saying, “Oh, the language changed because of this…” Really, because of how our brains work, we are much more influenced by the people that we interact with in an everyday basis, the people that we talk to. So I'm not panicked about this changing the way that everybody talks out in the world.
I am panicked about this kind of technology being weaponized against people to further entrench this thing that we call “standard language ideology.” And standard language ideology is really pervasive in the United States. This idea that there is one right way to speak and that if you don't speak that way, either it is a failure of your intelligence or it's a moral failing. You don't sound perfect, and so you're not smart or you’re lazy. But that perfect, this idea of ideal speech, and this is true before we got into the existence of SPSTs, says a lot about social power, way more about social power than it does about the language itself. And this is true for every human being in the entire world.
This is my favorite story that I tell, maybe on every podcast I've ever been on. Day one of Intro to Linguistics, I'm going to tell you this story. So we get Martians, imagine, they come to Earth from Mars. Great, our alien friends! They're not here to harm us. They just want to know who's in power. So they can land anywhere on this Earth, anywhere, and ask two questions and understand the social hierarchy. Who speaks the best here, and who speaks the worst here? And the answer to who speaks the best is always the people that have the most power, usually the most money, usually the dominant gender, the dominant religion, the dominant, you know, everything else. And who speaks the worst? The people that don't have those things.
So these models entrench those ideologies in a sort of automated way that we can't negotiate with, because when we talk to other people, we can negotiate. You know, we can learn about them, we can change our minds, we can do all this kind of stuff. With these models, it's this way is the best, achieve this way, and you will be an effective communicator. This is a promise. This is the way that they're advertised. But that best way is not neutral. That best way is informed by the entire history of whatever place we're in. So in the United States, that best way is informed by the fact that we had slavery, by the fact that we have a lot of anti-immigrant sentiment, by the fact that women still make less in the workplace, right? Like by all of these things that we can objectively see because they're about money and power and they're all enacted in our ideologies about language.
ASA Publications (42:50)
Right. So do have any other closing thoughts?
Nicole Holliday (42:53)
I think I was on my soapbox. Right, so? If you've made it this far into the podcast and haven't been like, man, this woman is wild. I want to be really clear that I'm not against AI at, like, at all really. I'm really against AI pretending to do sociological things that it can't do because it doesn't have sociological awareness. And right here, we're at the interaction of acoustics and social information and sociological information is so complex that I don't think we should outsource it to tools that don't have the sort of richness of the context and experience that humans do to be able to interpret language through a social lens.
ASA Publications (43:34)
AI is not part of society, therefore it cannot judge how we interact within the society.
Nicole Holliday (43:40)
You said it better than I could have, right? Like it doesn't have... And the way that we talk is informed by our identity, by our social identity, by all of our experiences, right? Our parents, our teachers, the community we live in. These systems, these SPSTs and LLMs in general, don't have parents and teachers and communities that they live in. And this is why I'm saying they're sort of devoid of this rich context. So I don't know why we would outsource the responsibility for effective communication to systems that don't know what it's like to be a student, to be an employee, to be a parent, to be a kid, right? They just don't have that.
ASA Publications (44:21)
Well, I won't lie, it is a little scary to think about the potential repercussions of how these tools will show up and what they will do. It's a little concerning that we don't entirely know how they work and how little their prescriptions are based on actual research. Thank you again for this fascinating discussion and I wish you the best of luck in your continued research.
Nicole Holliday (44:42)
Thank you so much, Kat. I really appreciate the opportunity to tell everybody about this stuff.