Disruption Works Chit Chat

Voicebot jargon, TTS and all that jazz in conversation

November 16, 2022 Disruption Works Season 2 Episode 4
Disruption Works Chit Chat
Voicebot jargon, TTS and all that jazz in conversation
Show Notes Transcript

Talking about all the acronyms and craft of conversation and where all this jargon fits in with the design and build of a voicebot. We discuss the voice artist role in a voicebot and how that adds the human touch.

Find out more at https://www.disruptionworks.co.uk/voice-bots 

Our latest series of podcasts, concentrates on voice and how that is going to impact the next few years with tips along the way. Find out more about voicebots here and if you have any subjects that you would like us to discuss then email info@disruptionworks.co.uk with the subject Podcast and we will see what we can do ;-)

0:0:7.180 --> 0:0:15.920
 Steve Tomkinson
 Hello and welcome to another edition of disruption works chit chat with Stephen, Dave. Hi, Dave. How you doing today?

0:0:16.330 --> 0:0:19.900
 David Antwis
 All right, Steve, I'm doing OK. Thanks very much despite the weather.

0:0:20.420 --> 0:0:27.110
 Steve Tomkinson
 Yeah, well, I got, well, we're applied today. So we're not in the damp basement that you were not there. You you usually in.

0:0:27.650 --> 0:0:32.440
 David Antwis
 The wind isn't too bad, so I I didn't. I didn't feel afraid coming up. Yeah.

0:0:31.910 --> 0:0:33.160
 Steve Tomkinson
 You you thought you dressed it?

0:0:33.600 --> 0:0:33.860
 David Antwis
 Yeah.

0:0:35.340 --> 0:0:36.270
 Steve Tomkinson
 You might as well.

0:0:37.60 --> 0:0:38.130
 Steve Tomkinson
 I'm sorry.

0:0:37.750 --> 0:0:40.980
 David Antwis
 Lucky you got pictures on the wall though, cause the view isn't much for the clouds.

0:0:41.30 --> 0:0:43.590
 Steve Tomkinson
 Well, absolutely. Yeah. That's a great all day, all night.

0:0:44.390 --> 0:0:47.640
 Steve Tomkinson
 I'm well, like I I supposed today.

0:0:48.880 --> 0:0:51.950
 Steve Tomkinson
 I what I wanted to talk to you about is.

0:1:5.400 --> 0:1:6.30
 David Antwis
 Yes.

0:0:54.10 --> 0:1:10.280
 Steve Tomkinson
 Conversation and the the voice applications. There's a lot of jogging going around and but it was a bit of an explainer around the nuances because you're very technical, but it's and and a lot of people look at these things as just a technical application.

0:1:10.810 --> 0:1:15.320
 Steve Tomkinson
 And and you were asking about, well, how does a?

0:1:16.190 --> 0:1:44.580
 Steve Tomkinson
 As a conversation put together outside of, you know, a normal chatbot work or you know SMS campaigns or whatever else it may be because voice is a different animal and it's quite a different animal, a very, very powerful one. But it's. But it's different altogether. So I suppose really it was kind of the glossary of terms that we've got and kind of understanding those and that was the thing that you were you were looking at.

0:1:50.850 --> 0:1:51.130
 Steve Tomkinson
 Yeah.

0:1:56.210 --> 0:1:56.950
 Steve Tomkinson
 Yeah, yeah.

0:2:0.240 --> 0:2:0.840
 Steve Tomkinson
 Yeah.

0:2:6.110 --> 0:2:6.560
 Steve Tomkinson
 Yeah.

0:2:9.880 --> 0:2:10.690
 Steve Tomkinson
 Yeah, yeah.

0:2:11.680 --> 0:2:12.80
 Steve Tomkinson
 Yeah.

0:1:44.300 --> 0:2:14.390
 David Antwis
 Yes, well, it's a. It's a different set of skills to that which I'm used to using. If I press a button, I know that I'm gonna get speech out of the end generated by the computer. But you come from promotional videos, I don't know if it's kind of natural to you to use humans for your products and you're putting humans into my products that so I don't know how to do that and I don't really know why in some in some cases.

0:2:15.420 --> 0:2:31.550
 Steve Tomkinson
 Well, I suppose so. And and and and again, there's lots of technical acronyms around like TTS, STT, NLU, all that type of stuff and we we use it too much. You know we're jogging have it, it is jogging. Having it let's face it.

0:2:31.730 --> 0:2:32.240
 Steve Tomkinson
 You know.

0:2:31.810 --> 0:2:33.240
 David Antwis
 It what does that stand for?

0:2:34.170 --> 0:2:34.640
 David Antwis
 That's.

0:2:34.940 --> 0:2:35.650
 Steve Tomkinson
 Exactly.

0:2:36.830 --> 0:3:5.160
 Steve Tomkinson
 I but I I I suppose that's the the difference. The difference is that when you're doing a voice design, so we do chatbots a lot and we have text responses and that's pretty straightforward and make them very conversational. The text responses in chat bots, whether it's based on a big knowledge base and that's all I and machine learning driven you know. So there's a fair bit of tech in the background there, but the responses can be taxed and that's very easy to control.

0:3:6.120 --> 0:3:31.220
 Steve Tomkinson
 But voice is different with the input and output, so if we're trying to understand what somebody is saying, then that's literally is and are you, which is natural language understanding. So if you're coming across that, that's literally what the computer has to understand. So it doesn't know what the other person said, it's, you know, and that is not necessarily banged on every time.

0:3:31.630 --> 0:3:37.260
 Steve Tomkinson
 And you know how many times have we shared Alexa? It's try and get Alexa to do exactly what words.

0:3:38.460 --> 0:3:38.800
 David Antwis
 Yeah.

0:3:38.210 --> 0:3:48.980
 Steve Tomkinson
 Goodnight. That's not we we want, you know. You know the amount of Alexis. Now the only response respond to shouting is probably significant.

0:3:50.220 --> 0:3:55.30
 Steve Tomkinson
 It's a trained, trained the your own Alexa to respond to shouting only.

0:4:2.50 --> 0:4:2.410
 Steve Tomkinson
 Yeah.

0:3:55.700 --> 0:4:5.800
 David Antwis
 But that's just that's the the the core of the technology, isn't it? The understanding what we're saying under getting meaning from us that's.

0:4:8.70 --> 0:4:8.280
 David Antwis
 Yeah.

0:4:5.150 --> 0:4:22.160
 Steve Tomkinson
 Yeah, well, absolutely. And that's where tax really gone on, you know, because we've now got massive databases of understanding language and one of the biggest and most successful is Google. They've got more data than anybody else, much more than actually Amazon, which most people are kind of surprised by.

0:4:22.660 --> 0:4:32.880
 Steve Tomkinson
 Umm, but it means that because that technology's gone on as improved so significantly we can now have a conversation that we understand.

0:4:33.520 --> 0:4:35.370
 Steve Tomkinson
 Umm. And it makes.

0:4:35.630 --> 0:4:36.330
 David Antwis
 That's true.

0:4:36.350 --> 0:4:40.300
 Steve Tomkinson
 It makes it all available, so it makes the whole part of our.

0:4:41.480 --> 0:5:8.750
 Steve Tomkinson
 Voice dot story available because that's the starting point. If we can't understand what somebody is saying in natural speech, you know, no matter what the accent, no matter what the dialect, if we can understand those people, then it means that we can then respond. Because what we then get is a text stream that can be understood with the programming that's behind that.

0:5:9.880 --> 0:5:12.960
 David Antwis
 Right. So that's makes sense.

0:5:13.890 --> 0:5:14.350
 Steve Tomkinson
 Yeah.

0:5:13.740 --> 0:5:16.990
 David Antwis
 Text stream you you're on to the next one there, I think.

0:5:16.950 --> 0:5:24.180
 Steve Tomkinson
 Yeah, well, that's right. So let's TTS, which is, I mean, Scott, which is speech to text.

0:5:27.70 --> 0:5:27.320
 David Antwis
 That's.

0:5:24.840 --> 0:5:31.960
 Steve Tomkinson
 Yeah, let's get it right eye. And just getting up back to front, we're supposed to be the experts on this, but they.

0:5:33.480 --> 0:5:36.80
 Steve Tomkinson
 But the point the point is you don't get a text.

0:5:36.540 --> 0:6:6.630
 Steve Tomkinson
 And answer from what somebody said and then you've got the layer which we designed conversations and we understand what the use case is usually for somebody. So if we're doing at the moment, we're doing one for debt collection for instance, we know that there's a certain amount of vocabulary and language that's going to be give us the context of what somebody said, what they're trying to do. They're usually trying to pay a bill or they're trying to negotiate a.

0:6:9.950 --> 0:6:10.180
 David Antwis
 Yeah.

0:6:22.440 --> 0:6:22.680
 David Antwis
 Yeah.

0:6:6.720 --> 0:6:37.130
 Steve Tomkinson
 Are payments or you know whatever that may be, then that would. There's a finite amount of language in that. So it makes it quite easy for us to then to design the conversation around it. But we have to understand what they've said first and then get that text out then or platform will understand what happens with that. So then we've got natural language programming, which is an op. So then the NLP takes over and goes right. This person is trying to pay a bill.

0:6:37.290 --> 0:6:43.880
 Steve Tomkinson
 But I can't buy yet and they want to negotiate payment terms because that's what they've just said. Yeah.

0:6:52.340 --> 0:6:52.970
 Steve Tomkinson
 Yeah, yeah.

0:6:58.140 --> 0:6:58.390
 Steve Tomkinson
 Yeah.

0:6:43.780 --> 0:7:4.620
 David Antwis
 So we've gone from understanding the utterances, the speech and and getting words from it, which you could look up in a dictionary. So we've got what they've said and then we're going and that's your speech to text next, we're using NLP, natural language processing or programming.

0:7:3.420 --> 0:7:5.490
 Steve Tomkinson
 Prograph programming, yeah.

0:7:5.770 --> 0:7:8.220
 David Antwis
 And your then.

0:7:14.240 --> 0:7:14.870
 Steve Tomkinson
 Yep.

0:7:10.740 --> 0:7:15.650
 David Antwis
 Interpreting what they've said from from the text.

0:7:15.910 --> 0:7:39.260
 Steve Tomkinson
 Yeah. So again, then the context of what that how that works. So then the context is programmed. So that's when we're into interacting with APIs into systems. We've got the data for that person, we know who they are with maybe done the right person identification for them. So we've asked them who they are, what their name is, all that type of stuff.

0:7:40.0 --> 0:7:40.220
 David Antwis
 Yep.

0:7:40.380 --> 0:7:41.370
 Steve Tomkinson
 And we've.

0:7:42.180 --> 0:7:44.140
 Steve Tomkinson
 We've done that, uh.

0:7:45.450 --> 0:7:54.460
 Steve Tomkinson
 We we have that part of the process, then that's the the whole, uh kind of interaction that you would be doing as a human agent.

0:7:56.140 --> 0:7:56.610
 David Antwis
 OK.

0:7:57.0 --> 0:7:58.890
 Steve Tomkinson
 So the responses.

0:7:59.850 --> 0:8:30.60
 Steve Tomkinson
 Are then the output, which is the voice that somebody would hear, and that's their, you know, the then the text to speech. So then we're TTS, which is text to speech and there are loads of text to speech engines. So you can have different voices, male, female accents. You could have an Irish voice. You can have an English voice. You can have a mid, mid English voice. You can have an American voice. You can have different languages depending on what you want to do. And you have different languages based on who you're talking to.

0:8:30.900 --> 0:8:37.420
 Steve Tomkinson
 So, you know, do all those things are possible now and there's a wide range of those things and they're.

0:8:39.400 --> 0:8:42.190
 Steve Tomkinson
 Nuanced in a way, but they're not totally.

0:8:43.530 --> 0:8:44.360
 Steve Tomkinson
 Humanistic.

0:9:0.500 --> 0:9:1.690
 David Antwis
 Yes, yes.

0:8:45.150 --> 0:9:6.20
 Steve Tomkinson
 They're they're pretty good, you know? You've talked to Alexa, and Alexa is OK and she's, you know, got quite nice response. But she usually pretty set it's edit can't sound a little bit robotic. This is what I found from the Internet and they read out there Wikipedia or whatever it is. And that's what comes out.

0:9:3.400 --> 0:9:7.170
 David Antwis
 Yes, let doesn't really know what it's saying, yeah.

0:9:6.810 --> 0:9:13.620
 Steve Tomkinson
 Yeah, yeah, yeah, it's it's just reading out. So that's text to speech. When you're hearing that and using Alexa's own.

0:9:14.840 --> 0:9:15.550
 Steve Tomkinson
 Speech engine.

0:9:16.910 --> 0:9:21.340
 Steve Tomkinson
 So and you know, we can leverage that stuff, but.

0:9:22.270 --> 0:9:26.530
 Steve Tomkinson
 What we also do, which is a nuanced is we do the.

0:9:49.180 --> 0:9:49.440
 David Antwis
 Yeah.

0:9:27.480 --> 0:9:55.310
 Steve Tomkinson
 Side speech. So little bits of interruption within a conversation. So a conversation flow and we and we have to use the likes of Tom, who's on our team to to then design A conversation. That sounds a lot more natural. So it has not just the this is your answer. Like we just talked about the Alexa, you know, tell me what was the biggest selling David Bowie track you know and or whatever it is.

0:9:56.150 --> 0:10:2.340
 Steve Tomkinson
 At it will come just read something out, but what we'll have is again. Well, that's interesting. You're interested in that.

0:10:7.960 --> 0:10:8.400
 David Antwis
 Right.

0:10:3.130 --> 0:10:18.860
 Steve Tomkinson
 You know, and you'll have side speech that that takes big around the journey. So if we're negotiating a debt like we are with this latest project, then the negotiating the date is going well. OK, let's see what we can do for you all that's not.

0:10:19.660 --> 0:10:31.970
 Steve Tomkinson
 Necessary speech, but it's part of conversation. It's part of our natural conversation flow, right? Just give me a second. Let me have a little look at that. Like you would as a human agent, but SFC the computer saying that stuff.

0:10:33.140 --> 0:10:40.30
 Steve Tomkinson
 The next layer on that is to go one step above TTS, which is then a voice artist overlay.

0:10:41.70 --> 0:11:3.990
 Steve Tomkinson
 So we are platform allows us to use voice artists so that the speech can be actually unbrand. So you can have a branded speech, but also you can have very, very natural speech because it is actually a human voice you're hearing, but it's part of the conversation flow. So we'll have all different utterances. We'll have lots of little side pieces of.

0:11:5.990 --> 0:11:24.240
 Steve Tomkinson
 Voice that supports the conversation and the context we're trying to do, and that's a whole new layer that makes it sound well. We know I try and tell somebody that they're talking to human, but what they do is they forget. That's the whole aim is the they forget, they're talking to a computer.

0:11:24.920 --> 0:11:27.520
 David Antwis
 That's the art of it, isn't it? Yeah, yeah.

0:11:26.380 --> 0:11:32.20
 Steve Tomkinson
 Yeah, yeah, yeah. So we've started with understanding faultlessly.

0:11:32.960 --> 0:11:45.250
 Steve Tomkinson
 All the standing, what they're saying, and then we go. We know what you're talking about because we're doing a debt collection journey or we're doing a appointment reminder journey or we're doing a, you know, whatever it may be.

0:11:47.20 --> 0:11:47.780
 David Antwis
 Yes.

0:11:46.660 --> 0:12:0.480
 Steve Tomkinson
 We understand the context, so we don't have to have some sort of broad ranging, UM, Alexis style intelligence going on with this because that's not why you're you're you're founding up because you wanna pay a parking ticket.

0:12:1.540 --> 0:12:4.230
 Steve Tomkinson
 So why were we gonna say, you know?

0:12:6.890 --> 0:12:9.30
 David Antwis
 What's your favorite colour? Yeah.

0:12:4.430 --> 0:12:16.150
 Steve Tomkinson
 Uh, you know what was the? What was the best? Yeah, that's right. Or the best selling David Bowie album. You're not gonna say those things. That isn't part of the conversation. So you don't need to have support.

0:12:17.310 --> 0:12:37.750
 Steve Tomkinson
 But what you can do is you can have a very natural conversation, take the payment and do everything and have a full self-service journey where a human isn't involved, but it didn't feel like there wasn't a human involved and that's then that is the art. And so the art of that conversation starts with the tech and there's a lot of tech in there.

0:12:39.90 --> 0:12:44.520
 Steve Tomkinson
 But it also is finishes with the design and the.

0:12:45.500 --> 0:12:48.430
 Steve Tomkinson
 The important parts of the conversation flowed are.

0:12:49.440 --> 0:13:2.360
 Steve Tomkinson
 They are intrinsic in this. You know. You can't. You can't make it work without understanding how a conversation flows. Not from a chat bot. Level of understanding from a voice and conversational understanding.

0:13:3.40 --> 0:13:4.820
 David Antwis
 Yes, you get so.

0:13:4.130 --> 0:13:5.570
 Steve Tomkinson
 You know, that's totally different.

0:13:9.250 --> 0:13:9.830
 Steve Tomkinson
 Yeah.

0:13:6.590 --> 0:13:14.820
 David Antwis
 Becomes awkward and disjointed, but the the special place though for the for the voice artists that where.

0:13:15.820 --> 0:13:17.890
 David Antwis
 You want icing on the cake.

0:13:18.920 --> 0:13:19.970
 David Antwis
 In the conversation.

0:13:20.670 --> 0:13:45.40
 Steve Tomkinson
 Yeah, well, my, the, the ones that we're doing that are big, uh, you can have a a fully TTS uh chat uh voice bot and that works fine and it's alright. The quality of the voice platforms now that we that we use that are TTS voices they are really quite good. So that's all.

0:13:46.60 --> 0:14:0.50
 Steve Tomkinson
 Really doable. We even have custom TTS's and option which is you know you have your own voice artist doing all the work. That's Google is already done or Alexis already done, or Microsoft or a done to do it and English.

0:14:2.170 --> 0:14:2.920
 Steve Tomkinson
 Wavenet.

0:14:3.880 --> 0:14:5.940
 Steve Tomkinson
 Voice, you know that we can use.

0:14:6.550 --> 0:14:6.920
 David Antwis
 Right.

0:14:7.300 --> 0:14:29.180
 Steve Tomkinson
 So you can have that as well, but the voice artist overlay gives us the option of making it sound perfectly normal, perfect voice, because it actually is a recording of a person doing the responses we would expect to respond. You know, whether it's slightly randomized and it's not the same response every time or whatever it may be.

0:14:36.30 --> 0:14:36.300
 David Antwis
 Yeah.

0:14:30.600 --> 0:14:59.690
 Steve Tomkinson
 The the voice is actually a human voice, so it's not made-up of parts. It is actually. Yeah. So that is hugely powerful. It still gets the same effect. It still does exactly the same job, but the artistry around getting that right is really powerful. You know, there's a lot of people trying to do that automatically, but it does not. Still not Bob on, you know, it takes a little while for that to get right.

0:15:9.130 --> 0:15:9.380
 Steve Tomkinson
 No.

0:15:0.610 --> 0:15:18.990
 David Antwis
 Where she was saying earlier that we don't pretend to be humans and our bots don't pretend to be humans. But the idea the if we can make the the customer forget they're talking to a bot, then you've achieved the perfection there and.

0:15:18.680 --> 0:15:19.20
 Steve Tomkinson
 Yeah.

0:15:19.750 --> 0:15:22.180
 David Antwis
 Whilst text to speech.

0:15:23.100 --> 0:15:24.40
 David Antwis
 Works pretty well.

0:15:24.590 --> 0:15:24.970
 Steve Tomkinson
 Yeah.

0:15:24.640 --> 0:15:25.30
 David Antwis
 Umm.

0:15:33.90 --> 0:15:33.700
 Steve Tomkinson
 Yeah.

0:15:27.400 --> 0:15:36.280
 David Antwis
 Having the voice artist in it, it just makes it easier to forget, I suppose to while you're listening you you can forget, yeah.

0:15:36.170 --> 0:15:49.380
 Steve Tomkinson
 No, it's right because you use an. All the power of a computer to understand what's being said and then the response is are always going to have some sort of control because you want to make sure you're compliant and you're right and everything like that, but like.

0:15:50.600 --> 0:15:51.880
 Steve Tomkinson
 Uh contact center?

0:15:52.680 --> 0:15:56.930
 Steve Tomkinson
 Agent needs to state to their script, not.

0:15:57.440 --> 0:16:12.840
 Steve Tomkinson
 And go off, pay off peace somewhere. They're gonna have to go through a process because they do go through a process. And and if you're left, listen to it. Enough contacts. And to recordings, you'll see here that they all pretty much say the same thing.

0:16:14.240 --> 0:16:15.50
 Steve Tomkinson
 Just with different.

0:16:18.710 --> 0:16:19.520
 David Antwis
 Yeah, yeah.

0:16:15.880 --> 0:16:22.530
 Steve Tomkinson
 Voices because they're different people, you know? And and this is exactly the same scenario. You end up with.

0:16:23.770 --> 0:16:41.490
 Steve Tomkinson
 Exact same service every single time, same quality, never a bad day and and also really slick, you know so efficient cause all connected up. You know the connected to all the data and things like that. So it works really well.

0:16:42.810 --> 0:17:11.920
 Steve Tomkinson
 So I suppose really all I wanted to do is understand was explain that so that you know if you've keep coming across all this jargon and glossary of terms that you kind of start getting to understand what those mean but also what they mean in context. So how that builds up a conversation outside of just going right? Yeah. It's just a bit of tack. Well, yeah. OK to an extent. But it's, you know, it's a powerful tool that can be.

0:17:12.690 --> 0:17:15.790
 Steve Tomkinson
 Really, really well used. You know, if you know what you doing.

0:17:16.460 --> 0:17:17.760
 Steve Tomkinson
 Which we do, fortunately.

0:17:18.660 --> 0:17:19.110
 David Antwis
 Yay.

0:17:19.760 --> 0:17:20.470
 Steve Tomkinson
 Yeah.

0:17:22.230 --> 0:17:31.350
 Steve Tomkinson
 Alright. Well like thanks for bearing with my monologue, dad, Dave and. And you know, I hope people found that useful. Anyway, it's.

0:17:32.670 --> 0:17:42.460
 Steve Tomkinson
 The the so much choked around that, I just felt it that's necessary. To where to explain that part? Really, you know, and I chose this podcast to do it.

0:17:44.20 --> 0:17:58.890
 Steve Tomkinson
 If you've got any subjects, do you want us to discuss, then you know, place that e-mail is in and we'll be happy to pick up those subjects and have a chat about them. So thanks, everybody, and I'll speak to you next time.

0:17:59.480 --> 0:18:0.870
 David Antwis
 Yeah. Thank you. Goodbye.

0:18:0.940 --> 0:18:2.110
 Steve Tomkinson
 Cheers, Dave. Funny.