RCSLT - Royal College of Speech and Language Therapists

What's happening at the juxtaposition of AI and speech & language therapy?

The Royal College of Speech and Language Therapists Season 6 Episode 2

Welcome to the fourth episode in our RCSLT AI series. In this episode we chat with Dr Richard Cave, Project Manager SLT at MND Association and Consultant Speech and Language Therapist, about his work in the field of Artificial Intelligence. Richard has backgrounds in both computing and speech and language therapy and he brings these together in his work with voice banking and AI, working with Goggle, MNDA and most recently in his PhD research at UCL.

We cover:

  • Richard's journey with AI and speech and language therapy.
  • What are the opportunities with AI?
  • What are the things to be cautious about? How can we ensure people are not left out.
  • Why speech and language therapists need to be in this space.


Interviewees:

Dr Richard Cave, Project Manager SLT at MND Association and Consultant Speech and Language Therapist.

Resources:
Centre for Digital Language Inclusion: https://www.cdl-inclusion.com/


RCSLT Artificial Intelligence resources: https://www.rcslt.org/members/delivering-quality-services/artificial-intelligence-resources/



Please be aware that the views expressed are those of the guests and not the RCSLT.

Please do take a few moments to respond to our podcast survey: uk.surveymonkey.com/r/LG5HC3R


Transcript Date: 

17 February 2025

 

Speaker Key (delete/anonymise if not required): 

HOST:                         JACQUES STRAUSS

RICHARD:                  RICHARD CAVE 

 

 

MUSIC PLAYS: 0:00:00-0:00:05 

 

HOST:                         0:00:05 Welcome to another RCSLT podcast. My name is Jacques Strauss. And this episode is part of our series in which we do a deeper dive into AI and what it means for speech and language therapy as a profession. 

 

Today, we are speaking to Richard Cave, whose working life started in IT before he made a career change to speech and language therapy and then pursuing his PhD, looking at the question of AI and SLT, which he’s just completed. Richard is also Director at the Centre for Digital Language Inclusion, an organisation that looks into how automatic speech recognition, or ASR, could benefit non-English speaking people in African countries. 

 

I started by asking Richard to introduce himself.

 

RICHARD:                  0:00:46 I’m Richard Cave. I’m a speech and language therapist. I’m a… now I’m a researcher at UCL and I work at the Global Disability Innovation Hub here at UCL, and I am a co-Director of the Centre for Digital Language Inclusion, which is all about AI-driven ways to help people be better understood with impaired speech in whatever language they choose to speak. 

 

HOST:                         0:01:19 Can you tell us how you became involved in AI? 

 

RICHARD:                  0:01:22 I was seconded to the MND Association to help them roll-out voice banking nationally. And while I was there, I was copied in on an email from a Google employee volunteering at the MND Association. And she said, have you heard about this AI-driven speech recognition technology that Google is building for people whose speech is not easily understood? It’s going to caption their words live on a phone and it’s going to be free. And the person that she wrote to said, okay, thanks. And I wrote… It’s like, what was going on? What is this? And that that changed my life, that email. Because what I did was I basically hassled and hassled, and eventually I got to meet the Google team over in California who were building this thing. And I kept in touch with them, and eventually I got hired by them part-time, and I worked with them for four years to help them build this free speech recognition technology that works on a phone. 

 

Very interesting place because I wasn’t hired to agree with what they said about AI. I was hired to advocate and just point out the stuff which they didn’t know. And more than that, I was hired with this AI project to actually do [code as design 0:02:56] – that means actually bring people in living with MND or cerebral palsy or Parkinson’s or head and neck cancer, or all the other [conditions 0:03:05] where speech has changed actually into the office, or Zoom calls, or actually often the Google employees would go and visit them. 

 

And so, the idea is that this incredibly talented, incredibly young team in Google who were doing the AI, who were building for our communities, we helped them to understand who those communities are. 

 

I mentioned about the value of speech therapy and how others… It’s like, our knowledge to AI companies is worth gold. We know so much that those companies just don’t know. And we have the connections and the networks and the clinical skills; they don’t have those. And yet, at the same time they are building for our clients. They are building AI stuff and, actually, what they want is a successful project. And actually, what we want is stuff that’s useful for the people that we work with. So actually, speaking slightly different language. But one of the things is that for people that are not easy to understand, or where speech is changing as, unfortunately, almost always happens – people with motor neurone disease – is that, not only do few people understand them, but they often say less. 

 

It’s a fact. The evidence is there that for people, as they advance with motor neurone disease or these other progressive conditions, that social isolation is a thing. And actually it goes beyond that, and often in advanced MND sometimes the only social engagement people have outside the home is to see the doctor at the hospital. 

 

And for the longest time, I thought, there’s got to be something here. And so, this speech recognition tool I could see that the people who I’m working with that are feeling socially isolated that people don’t understand as much but want to keep talking, this tool could fit that need in a way that works for them. 

 

And that was why I suddenly got super interested in this AI stuff. It just so happens that the team at Google didn’t really know that much about MND. And so, the very first thing I did was, with permission, I drove them around London and they met my clients, and they had a conversation. And it’s just like, this is who you’re writing for. And so that was how I got very interested. And once I got into Google there were, like, ten other projects in which they’re all working on with all sorts of interesting stuff. 

 

I was honoured to meet the first person in a northern town in Ghana… first person in her family that went to university, and she was the first person accepted at that university with cerebral palsy. And because of the kind of tools, the AI speech recognition tools, it’s helping her get through. 

 

HOST:                         0:06:30 That’s a wonderful story, and certainly inspiring. Can you tell us a little more about your PhD research? 

 

RICHARD:                  0:06:38 So, my research was on this AI speech recognition tool, and I worked part-time on it and I finished a couple of months ago. And I followed three couples – one of which in each couple had MND. They used their speech recognition tool. They trained it up for their own speech – it’s called Project Relate – and I videoed them using it over 12 months to see how it worked in an everyday conversational setting, talking with somebody that they talk to every single day and in their homes. So, it’s a kind of ‘in the wild’ study. 

 

I wanted to assess what difference it made, if at all. I measured it in three ways. The first was from the device, because the AI tool itself scores itself and provides forecasts for how successful it will say the interaction is. 

 

The second was that I used something called conversational analysis, where I just studied the interaction itself in detail and just to understand what parts of it worked, where there was a misunderstanding, or trouble [this is called 0:07:59], and how that trouble was fixed or repaired, and whether this technology, this AI technology, was part of that or whether it wasn’t. 

 

And the third area was I asked them. So, it was a thematic analysis semi-structured [into use 0:08:20]. The interesting thing was the way that Project Relate measures itself it estimated its performance as much higher than it really was. In a way, that’s not too surprising because this tool is about conversation, and you can’t measure the success of conversation from a phone, because we talk to people and people talk back. We can’t just get that from a phone, but maybe because it’s easiest to do, from the technology perspective, they tried to measure success using a measure called word error rate from the phone. And so, I wrote at length about how that’s not the right thing. 

 

But from a conversational analysis perspective, even though I saw that many of the conversations often had breakdown, misunderstanding, because one person’s speech was very difficult to understand, that actually, often, there was quite a lot of success in repair. And it reminded me that, actually, people use context and prior knowledge and gestures and all the other glorious things that make up communication all at the same time. And so, it’s not just the speech, it’s everything, and that can help people to understand each other even when it’s very difficult to understand. And people that know each other well often are good at repairing when things are misunderstanding. 

 

And the third area about the interviews which I took really shone a light on how they thought about the tool. And I found that the most surprising of all, that there were some themes that came out of that. And the first was about responsibility. So, everybody could see when the captioning was incorrect. But the problem was, for people that are not tech-savvy about AI, about how these tools work, how they allocated the responsibility, at least in one couple, maybe even two, that when the captions went wrong, they appeared to blame themselves. And they had no awareness. They didn’t display an awareness of what was the technology’s fault and what was a different issue. 

 

On more than one occasion they allocated responsibility for the failure to themselves. And so, what that ended up being was that this tool that was built for the very best purposes and intentions, using AI and all the rest of it, is that it ended up, at times, being psychologically reinforcing where people are as opposed to supporting where they want to be, where they want to get to in terms of communication. So, it was giving negative cycle support. 

 

And that’s why from this it’s so important for us to talk about AI in terms of what it can do and what it can’t do. Because both need to be given equal importance for us to understand how it works in daily life.

 

HOST:                         0:11:45 So, what are you working on now? 

 

RICHARD:                  0:11:50 There’s a bunch of areas. I mentioned the Centre for Digital Language Inclusion, and I think that is a massive initiative, we’ve got a lot of funding for that. The aim is nothing less than to ensure that we include people with disability and changed speech in AI so that whatever language they speak that AI can help them be better understood in daily life where they are, whatever they’re doing, so they can just get on with whatever they want to do. That was launched last week, so that is the next pretty much 18 months of my life. 

 

But I’ve also been working as a part of a working party on brain computer interface technology, BCI. We know that brain computer interface is a technology that’s already here, and it’s a way that you could register, collect signals, directly from the brain and convert them into words or actions. So, it’s kind of arguing that if you think what you want to say that you can type those words, or those words can be spoken via a text-to-speech via a synthetic voice. 

 

And for us as speech therapists, now is the time. Now is the time for us to get involved with this because in 2025, right now, there are plans for trials here in the UK on BCI technology for people living with MND. I’ve been working with various researchers that are planning on doing that to just recognise that this technology… we need to understand better how this technology is going to be used in practice, in the lived experience of the people using it. How is it going to help? How can we assess how it will help somebody who just wanted to have a chat with their partner, or if they want to control their environment, how are we going to do it? Can we can we actually test that in their environment with their partner? And if they want to say to their partner, I love you, is that an easy thing to do? Is it not? 

 

I want to take it out of the lab and into the lived experience. And I think for us as speech and language therapists, we can help different teams actually make that happen. From a recently published survey of people living with MND that I helped with around BCI technology, many people with MND think it’s a very important technology for them, but they have a lot of questions on how it’s actually going to work, where I live in my house. I’m completely up for that. 

 

HOST:                         0:14:55 So, this is really interesting. The question is, how do speech and language therapists get involved? 

 

RICHARD:                  0:15:00 This is the ongoing challenge. All I can say, I got involved by being annoying and getting in touch and offering feedback to these products and projects. And I think there are a few things you can do is that if you want to be involved or if you see something that you’ve got an opinion about, get in touch with them. Get in touch with them. But I think there is scope here for us as an organisation at Royal College to think more broadly about this. Because AI is here, and AI is affecting our communities, us as a profession, and our clients. And it’s in one of two ways: either we are being involved or we are being thought of and things are being decided on our behalf, or we’re not being thought of and things are being decided anyway. In a way, we don’t really have a choice. We need to get involved somehow. 

 

I’m more than happy to help with my links at Google and all the rest of it, but I think we need to talk about that. 

 

HOST:                         0:16:20 What in your research has been, I guess, surprising or changed your thinking? 

 

RICHARD:                  0:16:25 I want to talk about voice banking and AI-driven voice banking. So, voice banking is a way to create a synthetic version of a person’s own voice, how they sound. And in the last few months there’s this AI-driven staff from ElevenLabs and others. And that’s just been introduced for the communities that I work with because ElevenLabs have made it all free for people with MND around the world. 

 

The interesting thing there is that with very limited data, very limited number of recordings, the level of authenticity is… and I’m jaded here, is stunning. 

 

HOST:                         0:17:14 This is a good point for us to pause and listen to some clones of Richard’s voice. The clip I’m about to play took around ten minutes of training, so ten minutes of reasonable audio. 

 

AI ‘RICHARD’:            0:17:25 So, this is a clone of my voice. As you can tell, it sounds pretty natural and provides a good approximation of what I sound like. 

 

HOST:                         0:17:34 But this clip only took 30 seconds to train. 

 

AI ‘RICHARD’:            0:17:38 So, this is a clone of my voice. As you can tell, it sounds pretty natural and provides a good approximation of what I sound like. 

 

HOST:                         0:17:46 Now, I used the cloned voice for the whole episode with Professor Alanu Waller – check your feed if you’re interested – and that one took around an hour to train, but all of them are pretty good.

 

RICHARD:                  0:17:58 It’s stunning. And what I found is that I’ve had feedback from people who say not only is the voice which is being generated from their device feels more like them, and this is all about identity, of course, but that the people around them appear to be talking to them more because of these voices. 

 

This technology is so new that there hasn’t really been any research on it. We’ve only at the MND Association only been using it for about three months. I just need to find something that helps with social isolation and getting people back to where they would like to be. But by the same token, [inaudible 0:18:43] authentic voice will do nothing for speed. So, it might sound like me from my text-to-speak device, but it’ll still be 20 words a minute using Eyegaze, or what whatever it is. 

 

One of the things is that, as things get more authentic sounding, we have to remind everybody that voice banking is not talking banking, even though it might sound a bit like it at times. Talking banking requires talking at the speed of conversation, and we’re just not there.

 

HOST:                         0:19:16 I think this comes back to the point that we were discussing with Professor Alanu Waller, the voice cloning gives us an authentic sounding voice with fairly natural intonation. The generative AI can help with some of the interface and speed problems, provided it maintains the personal history and doesn’t usurp the personality of the person using the technology. Would that be a fair summary? 

 

RICHARD:                  0:19:42 Yes. Yes, it is. It’s a fair summary, making sure our eyes are completely open about what’s actually happening. The only arbiter for how this will actually work is with people actually using it and in their homes. And so, I think that is a really interesting way forward, and actually, that’s something that we should be exploring, marrying these technologies. 

 

And also the other area that I’ve been looking at is avatars. And so, we’re going beyond synthetic voice, we’re going to synthetic appearance. Avatars these days are hyper-realistic. I mean, really amazingly good. For people living with MND they’re being made available to some folks in the US, and there are teams in Portugal working on it. And so, when you combine those hyper-realistic avatars with hyper-realistic voice, and perhaps the ChatGPT stuff, then they’re argued to represent an incredible likeness who they are. 

 

But the question I have on avatars… well, I have so many. The first thing is how and where are they actually used? And would they only be used on a Zoom call, or would you use them out and about in shopping? How would it be used? The thing is that there have been researchers already have been looking at that and putting out ideas for how they can be used. One of them was I saw that they had a mount on a wheelchair and had a big screen on the mount right in front of the person’s face, and the avatar was blocking out the person’s face. When you looked at the wheelchair, you saw the TV screen with the avatar, and the person was… And I was thinking, stop right there. 

 

HOST:                         0:21:56 That sounds horribly grim. 

 

RICHARD:                  0:21:59 From a developer perspective, I get why… from an engineering… I kind of get why they did that. And actually, perhaps for some people that’s completely fine; I just don’t know. But there’s a wider discussion here, I think. When we talk about avatars and voice and sentences it’s like there seems to be a hierarchy of acceptability that people are trying to fix or cure. 

 

One thing from my perspective is that I the reason why I became a speech therapist is because I stammer, I stutter, and I always have. Speech therapy came and helped me when I was young. I still stutter and stammer and that’s just who I am. I just think about it differently. But for my voice bank, it’s fluent. And I think, well, do I really want something to represent me that is fluent? Because actually, the way I talk is the way I talk, and that’s me. And actually, I’m quite happy with the way I bumble and stop and pause, because that’s just me. And so, do we want AI to create these better versions of me and others? 

 

HOST:                         0:23:15 That’s a really good point. And for anyone who is interested, we have a wonderful episode on stammering and how we are changing our attitudes to differences in speech and language that are not necessarily deficiencies. 

 

So, Richard, a couple of weeks ago Keir Starmer made a big announcement about how the UK was making a big bet on AI. And I think it’s important to note this was before the release of DeepSeek, the Chinese large language model which is certainly changing the way we think about AI. But what are your thoughts on the government’s announcement? 

 

RICHARD:                  0:23:48 I think this is a tremendous opportunity. I think this is the right thing. And also, I think you mentioned about how we can get involved and now we have a good reason to get involved, because Keir Starmer has said, we are going to do this, we are going to move ahead, we’re going to be a centre of excellence for AI. Now, let’s make AI work for our community, not some of our community, all of our community. And he’s given us the reason, the excuse, as an organisation but as a clinician to get in touch. It’s like, okay, I’m going to write to my MP and say, actually, I really like this, I have got something to offer here to actually be part of the conversation. 

 

Sitting here at UCL at the GDI hub, we’re one of the leading voices on AI and disability, and we advise on the international stage as well, and we’re doing AI-led solutions. So, it’s like, we can do it, we just need the opportunity. And I think that the UK Government personally, I think they made the right decision, because there’s all this AI going on globally, and it has the potential to make a positive difference to the people that we serve and also to our profession as well. We shouldn’t be afraid of AI. It’s time to make it the way it should be. 

 

HOST:                         0:25:26 A very big thank you to Richard for his time. As always, do see the show notes for links that may be of interest. Please do share rate and like the podcast so that we can continue to advocate for speech and language therapists and their service users in the UK and beyond. 

 

Until next time keep well. 

 

MUSIC PLAYS: 0:25:46

END OF TRANSCR