Infinite Machine Learning: Artificial Intelligence | Startups | Technology

Modifying Speech Accents In Real Time With AI

February 19, 2024 Prateek Joshi
Infinite Machine Learning: Artificial Intelligence | Startups | Technology
Modifying Speech Accents In Real Time With AI
Show Notes Transcript

Ofer Ronen is the CEO of Tomato.ai, an AI platform to soften speech accents of people as they speak. He was previously at Google where he built Contact Center AI products. Prior to this, he was the cofounder and CEO of Pulse.io and Sendori.

(00:26) Accent Modification in Real Time
(02:21) Training Data for AI Models
(04:14) Overview of Tomato AI
(05:36) Challenges in Modifying Speech Accents
(07:17) Role of Phonetics and Linguistics
(08:19) Handling Intonation and Subtleties in Different Languages
(09:36) Advances in AI for Speech Modification
(11:21) Addressing Societal Bias and Ethical Considerations
(13:08) Expanding Accent Modification to Different Accents
(14:20) Deployment on Mobile Devices and Computational Requirements
(16:02) Disclosure Policy for Accent Modification
(17:59) Future Applications of Speech Modification
(19:48) Ripe Areas for Innovation in Contact Center AI
(22:28) Impact of Generative AI on Speech Modification
(24:53) Demonstrating Success in Contact Center AI
(27:06) Future of AI and Speech Modification
(28:37) Rapid Fire Round

Ofer's favorite book: Life 3.0 (Author: Max Tegmark)

--------
Where to find Prateek Joshi:

Newsletter: https://prateekjoshi.substack.com 
Website: https://prateekj.com 
LinkedIn: https://www.linkedin.com/in/prateek-joshi-91047b19 
Twitter: https://twitter.com/prateekvjoshi 

Prateek Joshi (00:01.662)
Ofer, thank you so much for joining me today.

Ofer (00:04.794)
Hi Prateek, yeah, thanks so much for inviting me.

Prateek Joshi (00:07.929)
All right, let's dive straight into it. AI can modify a person's accent in real time, which is pretty amazing. If you go back a decade, it was a hard problem, but now it's happening. So can you explain how it works in practice?

Ofer (00:27.566)
So, yeah, so at Tomato AI, we change someone's accent. We make it softer is what we're calling it. So I have the accented person still sounds like themselves, but not as heavily accented. And this field, well, to pull that off requires generating a voice in real time. That sounds like the person, but not as heavily accented. And that's challenging.

It's hard enough to generate voices offline when there's no real-time constraints, but to pull this off, we have to do it in real-time. Also, the field of accent conversion is what's called, is a pretty small community of researchers. There's a big community of people in speech AI that do transcription. There's even a smaller community that does speech generation, and then there's an even smaller community that

tries to do that in real time, and even smaller tries to do the accent work. And so we had to do a lot of like, of our own research to create new, new R&D to address this issue. It's a speech to speech model, so we never go to text. There's three reasons why you don't go to text and then to voice back to voice, which is maybe the first way people might think of doing this.

And those reasons are first, you don't wanna introduce transcription errors. Second, latency could be too great if you're relying on, you know, transcription and speech generation. And third, you lose elements of what's called the prosody or the melody of the voice. And so if you can keep that, that really helps out with naturalness. So yeah, it's a little bit of a preview. There's a lot more to it.

Prateek Joshi (02:21.993)
That's amazing. And actually, that would be the first instinct when somebody is asked, hey, how do you think it happens? They're like, hey, you convert speech to text, and maybe you generate the voice and the words in any different accent, but that's a roundabout approach. So to do speech to speech, to build these AI models, where does the training data come from? And how is it used in this context?

Ofer (02:50.638)
So we've been at it now for a year and seven months, raised $10 million, which helps, because there's a good amount of money you spend on data, but also a good amount of money on training. And so the data can be original calls with accented agents. We focus on Indian and Filipino accents initially, where there's a lot of offshore call centers.

And so we've gotten datasets that are actual agents speaking. And also we've had simulated calls recorded. So you're just creating, we generate transcripts using chat GPT and then you're having people going through scenarios. And then a third way is like studio quality voice actors, that kind of recording, which has like

no background noise and is the highest quality of voice recording. And we mix and match and use different data for different approaches. So yeah, we've done it all.

Prateek Joshi (04:04.109)
Right, and maybe it's a good stopping point to quickly talk about Tomato AI. You're the founder of the company. Can you quickly explain what it does and for whom?

Ofer (04:15.894)
Yeah, so, Tomato AI is a company that has been around for a year and seven months. We raised $10 million. And the goal is to help, when you're calling support or you're getting a sales call, sometimes if the agent has an accent, first, unfortunately, people have biases against people with accents. And so, like a sales call, you might hang up.

even though it might be something valuable. But also there's this issue of understanding what's being said. When someone pronounces words in the way you're not used to, it's hard to keep up with the conversation. And so our solution is like this magic button that makes someone that's heavily accented, sound more like you're expecting to hear words. So we can Americanize, for example, an Indian accent or Filipino accent.

Our goal though is not to completely replace the original accent, it's just to turn it down a bit. So we call it softening, accent softening, and we keep the person's voice and even prosody. So only the accent is what we modify a bit and that goes a long way.

Prateek Joshi (05:36.737)
Right. And if you look at the tech stack to make something like this happen, what is the biggest challenge when it comes to accurately modifying speech accents in real time? Is it the AI models? Is it the understanding of nuances of various languages? Is it keeping the speakers' intonation intact? What is the biggest challenge here?

Ofer (05:46.455)
Mm-hmm.

Ofer (06:04.39)
Yeah, you know, there's different dimensions that you care about and we run what's called Moss scores. It's a rating of one to five across different dimensions. So we might have MTurkers, Mechanical Turkers rate the original versus the enhanced speech for naturalness, for accent level, accentedness.

Ofer (06:34.438)
for acoustic quality, like are there like all kinds of clicks and clacks and noises introduced maybe by the speech generation? And so there's different things like that we measure and that's how we know we're getting better with our models. Beyond that you're also very careful to create a low latency solution. The latency needs to be under one second for sure and the less

added latency, the bigger the market opportunity is. So it's a continuum and the less the latency under a second, the better this can be used in the market.

Prateek Joshi (07:17.273)
And for an AI system like this, what roles do things like phonetics, linguistics play here? Do you need that or an AI system probably doesn't need it as much as humans do?

Ofer (07:33.69)
Yeah, the work is done at the phoneme level and we've definitely engaged linguists from the Philippines and India at times for various things. So it is a part of it and it has to do with certain approaches to enhance the output. And so yes, it is a factor that matters here.

Prateek Joshi (07:56.921)
Right. And earlier you talked about retaining the speaker's intonation and how they talk, right? And obviously you soften the accent in real time. So how do current AI models handle these subtleties like intonation and stress and pitch and rhythm in different languages?

Ofer (08:16.524)
Enough.

Ofer (08:20.418)
So, yeah, each language has its unique prosody or rhythm, and that can make it hard to follow if it's not what you're used to. Now, so orthogonal to that is also the accent and how the words are pronounced, and you can play with those different dimensions. And so, yeah, but if you can carry forward the prosody,

you retain some of the individuality, like the person still sounds like themselves largely. And so there's some art that goes into this work. This is not all science, and you're trying to find the right balance of what to change to maximize trust and understanding, like those two dimensions across as many calls as possible and then across many accented individuals.

Prateek Joshi (09:15.125)
Right, right. And if you look at the developments in AI that have enabled this to happen, maybe if you look at the last two to three years, what advances in AI have been pivotal in making something like this happen?

Ofer (09:36.014)
So we rely heavily on transformer, the transformer architecture. Us two co-founders came from Google, and so we have that in our roots. And yeah, each of us sold startups to Google. I sold two. My co-founder sold one. But anyway, yeah, transformer, but also RNN architecture, like a mix of those. And those are.

newer kind of developments are real helpful. And then in the speech space, there's a lot of research that has been done on how to generate voices that sound natural. We mix various things out there and add our own research on top. And yeah, two years ago, it would have been tougher. And also, we rely on the latest and greatest hardware out of

you know, Nvidia, and that helps accelerate the work as well. You know, the strength of a startup is how quickly you can run experiments and iterate. And if you can run experiments on latest and greatest hardware faster, it really makes it possible to like move faster as a startup and burn through less money.

Prateek Joshi (10:52.909)
Right. Earlier, you talked about bias, societal bias when it comes to accents. It's just a, it's just part of life. That's how different people in different regions, we talk in different ways. And the good side is of doing this is the customer support will be that much better because you're calling a company to solve your problem. And if you can understand the other person easily, then everyone wins, you're all happy.

On the other side, some people, a small group, might complain that modifying accents might reinforce that societal bias. So how do you address this question? What are your views on it?

Ofer (11:36.306)
Yeah, yeah, this is a really important point. So first of all, as you said, both sides of the conversation when in the sense that the customer, their internet's out, they're like freaking out and they're calling. And on top of that, they're already like triggered. They can't understand what is being said.

And they're, you know, with all their biases like coming into this and they're triggered. And so, you know, you can turn all that down by making the voice of the speaker with a foreign accent in this case, like easier to follow. Okay, so the customer benefits, the agent as well. They have this tough job of getting all these frustrated people on the call and there's abuse and they have to repeat themselves. And so this also reduces all that.

Now, we're very sensitive to cultural aspects. And for that reason, we've shifted from doing accent neutralization or conversion, which is like, these are terms that precede us that we're in the industry, which we think are harsh. Instead, it's accent softening, keeping the voice of the speaker, their identity, their cultural identity, just not as heavily accented. So it's in line also with current practices where call centers,

help have like courses where you can learn how to pronounce words. And so it's in line with that. And it's a gentler approach here.

Prateek Joshi (13:08.101)
Right. And that's a fair point. And also does it work in multiple directions? Meaning if the customer is in Australia and the support agent is in the US, for example, does it modify the American accent to the Australian accent?

Ofer (13:21.835)
Yeah.

Ofer (13:25.538)
Yeah, just last night I was at like a VC networking event and I was talking to a Nextdoor Dash employee who had exactly that. He was managing Australia and they were using Indian agents. And you could imagine that both sides would struggle with each other's accent. And so that's what we're gonna do eventually. Right now we're starting with softening the agent side of the conversation. It's a more fixed set of accents.

Prateek Joshi (13:38.361)
Heh.

Ofer (13:52.594)
And as we support more accents, we'll open it up to the other side.

Prateek Joshi (13:56.769)
Yeah, I think that I think it's a good way to start any product is you pick a narrow specific problem, you solve that and then it can go in all directions. Right. Now, moving this to more applications, for example, maybe I want to use it on my mobile phone locally. I don't want latency, so I wanted to happen right here. So what are the computational requirements?

Ofer (14:20.884)
Yeah.

Prateek Joshi (14:25.017)
for deploying something like this on mobile devices. Is there a trade-off in accuracy or latency that we may have to live with? Or is it gonna be solved anytime soon? What do you think here?

Ofer (14:38.347)
So our initial use case is call centers and agents and call centers. And we want to have like the most powerful model that gives all these capabilities. Like our solution is zero shot. There's no pre-training. It works out the gate on any new voice, keeps your voice, softens it. So there's a lot we're packing in. And to make that work, we hosted near the agent, but on the cloud.

So it's a 30 millisecond measured average round trip in these markets we're targeting. So minimal in terms of the network latency. And so we're not yet at a point where we're working on miniaturizing models to fit them on device. And you would have to make compromises if you do that. You'd probably have to make them sound like a voice actor. Everyone would sound the same.

And those kind of compromises that many people would not go for. So I would say over time, like as we develop this further, we could do that on device, but not yet.

Prateek Joshi (15:46.294)
And if a company is using this for their agents to make it better and nicer for customers, what should their disclosure policy be? Like do customers need to know? Would they care? What's your view on this?

Ofer (16:02.134)
Yeah, it's a fair question. You know, I think that as we have more and more agents using it, we'll get a sense for how organic it is and whether it comes up at all. People either just appreciate it, if it works perfectly and when it works perfectly, then it's seamless and you can just have the conversation. So I think there's some learnings to be had around like,

Oh yeah, it might sound a little bit off, but this is why, but it's better with this on. You know, we hope you won't have to make those disclaimers, but we'll learn as we go what's appropriate. And I don't know that we'll enforce anything specific around that. I think it's really each company can decide what makes sense for their use case.

Prateek Joshi (16:53.813)
Right. Because when you as the customer, if you contact a big airlines, and if you're talking to a chatbot, you need to know because you'll set your expectations appropriately. But this is not that. It's just AI enhanced humans. So it's an interesting thing. Yeah.

Ofer (17:04.085)
Yes.

Ofer (17:12.21)
Yeah. I think those, those will evolve like the best practices there that will evolve over time. I was at Google when the issue around self-identifying whether you're a bot or not. Cause duplex had an impressive demo came up and Google, yeah, rightfully so. Fell on the side of, yeah, self-identify as, as a virtual agent.

Prateek Joshi (17:36.153)
And going beyond accent modification to speech modification in general, and you're talking to a lot of companies. They have to work with a lot of speech in their business. How else can AI modify speech? Or rather, what other use cases would be interesting to your potential customers right now?

Ofer (17:59.31)
So what we imagine is a world where you A-B test continuously, what's the best voice to use in every conversation to get the best outcome? And so eventually, we believe this will be table stakes. You wouldn't dare to make a call without optimizing the voice of the speaker in the sales situation, for sure, but also in support situations.

Ofer (18:30.806)
Anecdote, for example, they found that men from Texas respond really well to a woman's voice for the Northeast. You'd think that they would respond to like a Texan voice, but there's certain like pairings and matchings, like, you know, keep people longer on the call. That's the goal. And so you could tap into that and do it in a scalable way.

Prateek Joshi (18:54.041)
That is actually very interesting. You could almost use these signals to route a particular customer to the best agent because your goal is to resolve it as quickly as possible and leave them as happy as possible. So we can pick the right combo. Like as you said, for some reason the data shows someone from Texas responds well to this particular accent. So yeah, let's have that.

Ofer (19:16.742)
Yeah. So that already exists, like smart routing to whoever can solve the best problem. But the next level is anyone can be made to, you know, can be optimized to sound the way that would work for, and that's using AI. So AI would augment the per, any speaker's voice. Yeah.

Prateek Joshi (19:32.649)
Yeah, yeah, yeah.

Yeah.

Yeah, exactly. So if you know that Northeast accent works well, then anybody can just talk and it'll be AI enhanced to sound like a Northeast person. And then it can go. Yeah. Right. Amazing. And for startups who are diving into contact center AI.

Ofer (19:48.074)
Yeah, it'd be seamless to the speaker. They don't need to know all the optimization happening in the background. Yeah.

Prateek Joshi (20:02.581)
What areas are ripe for innovation right now? Outside of speech, obviously there's a lot happening in speech, what else do customers need?

Ofer (20:06.062)
Yeah.

Ofer (20:10.706)
Um, yeah, so I've done, this is my third contact center. I started up, I have done analytics for bots at Google. We had a million bots we did analytics for. And then I did a way of building virtual agents using data. So we modeled millions of past conversations between humans to create more robust chat and voice virtual agents. And then, uh, I worked also on agent assist using the same intents and work you do with virtual agents.

You can repurpose that for agent assist to help the agent on a turn-by-turn basis. Generally speaking, there's these categories of AI infused solutions, and anyone starting to implement AI in the contact center, my recommendation is go for the quick wins first to build up momentum and trust, and then get more budget, more time for the more complex solutions. So the quick wins are things like

call summary where a chat GPT type of technology, you know, latest and greatest LMS are increasingly out the box doing pretty well at summarizing the call without not a lot of customization. Similarly with speech eye work, the kind that we're doing, not a lot of customization. Any new voice with an Indian accent, Filipino accent can then be optimized, you know, in our case, less accent.

So once you get the quick wins under your belt, then you can move on to the more complex and time consuming virtual agent projects, agent assist projects, lots of complexity, APIs you have to build are responding in real time, not just the AI, there's other aspects of those projects that make them hard. So yeah, that's a rough roadmap.

Prateek Joshi (21:59.237)
All right, and as AI models get better at understanding speech, converting that to text, and also generating responses, do you see that as a threat to your business because you're relying on human agents doing all the talking, and the more they talk, the more business you'll have? So how do you see this generative AI trend impacting the future of speech modification?

Ofer (22:28.418)
So I think I have a bit of a contrarian view here. A lot of people are thinking AI is gonna cause all this automation and humans will like be out of a job. But I see the major opportunity with AI as being a superpower for humans. And I just see agents becoming that much more productive, but also more agents coming online. So for example,

In our case, and this is related to like the future of work kind of thread that a lot of people talk about, but all these people, let's say in all corners of the earth that are capable, but hard to understand all of a sudden can come online if AI can make him like more intelligible. Um, and then yeah, the, the agents just like can have AI accelerate the work they do, give them answers, all those things at their fingertips.

So I see that like human in the loop and human superpowers being the more immediate opportunity with the AI. Now automation, I've been there. I've done some of the largest call center automation projects. One was 200 million calls a year, another 100 million calls a year. And what I've learned is that it's so hard to make gains on automation and it's because of unintuitive things.

When people call at major companies, they often press 0000. They don't even give you a chance to respond. And they want to like tell stories. They don't know how to describe their problem. They wanna negotiate a deal. They wanna talk to someone. So you have to overcome habits. And so it's not only about like great AI, but it's also how do you change habits of people, get them to like even give you a chance as an automation tool.

Prateek Joshi (24:25.525)
Right. Actually, that's a good point here. And as a technology that sells to contact centers, how do you demonstrate success? Meaning in a very quantitative way, what does the customer need to see after three months of operation? Like is it time saved? Is it, hey, happiness? What do you measure to show success?

Ofer (24:53.346)
So there's two classes of agents that we help. There's the sales agents, we call them outbound, and then there's the inbound agents that do support. And each of those has different metrics that you can measure in a pretty short time, the lift for. If we start with the sales agents, it's revenue, close rates, a lot of times the offshore agents, in our case, that we work with.

They are getting warm leads and transferring onshore for closing. So are you getting more of those? Sometimes they even measure how long can you keep the person on the phone. If you can keep them on for like 15 seconds or more that matters. So, okay. So yeah. So that's the sales. Like if you're today having people, let's say in Pakistan calling it, it's 98% hang up and we can make it 97% hang up. That's a 50% bump in their business.

Okay, so that's the sales side. On the support side, the metrics that we've seen matter are CSAT NPS, like how happy are customers? Also handle time, how long are the calls? Are they getting shorter once intelligibility goes up? And the last one is callbacks. Are there less callbacks? Callbacks are very expensive when someone calls you back about the same thing, so you try to limit those.

Prateek Joshi (26:19.249)
It's actually good to have these metrics because as startups, as founders, a good number of listeners are aspiring builders that are already building a startup. And it's good to know that these things matter, meaning you have to show how your product is making your customer's life better in a very, very objective way. And it's not enough to, oh, they're feeling good about it. That's not enough, right? You have to show the numbers and...

I feel like this is, I mean, obviously, it's not your first rodeo. So you've done the homework here. So this is really good. All right. I have one last question before we go to the rapid fire round. And it's about what's coming next, right? When it comes to AI and speech in the next 12 months, what is going to happen? What's coming next here?

Ofer (27:06.526)
Yeah, so one of the things that we where we want to go is not just to optimize the way the voice sounds so the listener responds best, but also the words. So if you could swap out words and localize them. In India they say needful.

And then we don't know what that means here. Much needed, let's say, I think is what it means. What if you could swap it out when that's said? And beyond that, what if you could also adjust grammar? So the combination of adjusting the language and the voice heard is powerful. And that really gets even more of the workforce in all corners of the world online.

Prateek Joshi (28:00.229)
Right, right. Actually, that could be a great next step here is not just to modify the accent, but also the speech itself to make it more intelligible. And yeah, you're right, different people express the same idea using just different words, which make no sense outside of that geography. So I, because I've sampled both...

India and the US I can see how some phrases make no sense in the other land and vice versa So it's very interesting. All right with that. We are at the rapid-fire round I'll ask a series of questions and would love to hear your answers in 15 seconds or less. You ready? All right question number one. What's your favorite book?

Ofer (28:37.07)
I'm ready.

Ofer (28:41.614)
So recently I've been enjoying the book Life 3.0 by Max Tegmark. It really makes you think about where we're going with AI. As humans, we can develop what he's calling the software, like our brains, develop our ideas. But that's Life 2.0. 3.0 is where you can also develop your physical, your body or your physical presence and enhance that over time.

And so there'll be beings that continuously improve not just their software, but their hardware as well.

Prateek Joshi (29:17.101)
Amazing. Next question. What has been an important but overlooked AI trend in the last 12 months?

Ofer (29:25.89)
So I'm an investor in a company called Figure AI. They're building a robot that's gonna be running operations inside of warehouses. So there's all this need, for example, in the fourth quarter to have more people on the ground, in the warehouses, and there's also an issue of...

performance, you can get so much more out of like a robot doing those jobs. And so that combined with the LAS LLM models will be really powerful and that's not too far away. These robots are pretty impressive and figure I is just one of a few companies like that.

Prateek Joshi (30:08.349)
Yeah, yeah, I mean, they're doing some amazing things. All right, next question. What's the one thing about speech modification that most people don't get?

Ofer (30:21.462)
So I think one thing that people don't realize is how much pain we each feel every day. With COVID, we got to be more online, more remote, working with more people around the world. And we just suffer through trying to understand all those different ways people speak, the accents and all that. But we're not even fully aware that we're suffering so much.

Once this is solved, it'd be hard to go back to the days when we were suffering so much. So yeah, I'm really excited to get this out there and get more people benefiting in that sense.

Prateek Joshi (31:04.758)
What separates great AI products from the good ones?

Ofer (31:10.594)
This is where I think that it doesn't matter that it's AI. What's a great product versus another is how enthusiastic are people and is word of mouth working? So that's a good litmus test. Like is every person telling two people and so on, creating like a viral effect? So it doesn't matter if it's AI or not really. Don't get confused. It's about...

how enthusiastic people are.

Prateek Joshi (31:42.977)
Next question, what have you changed your mind on recently?

Ofer (31:49.506)
So I heard that cold plunges can be good for longevity and also work better than coffee in terms of waking you up. But I wasn't quite ready to do it. But in the last two weeks, I've been doing cold plunges in my pool, 53 degrees, and I can attest that it's great. Yeah, it really gives you like a great, like feeling in your body for many hours.

Prateek Joshi (32:11.452)
Um...

Prateek Joshi (32:17.669)
Amazing. All right, next question. What is your wildest AI prediction for the next 12 months?

Ofer (32:27.662)
So I believe that in the next 12 months, there'll be this acceleration of the workforce coming online that otherwise wasn't there. Like all of a sudden, there'll be all these millions of people available for work at lower price points because of the law of supply and demand. And we're not even seeing it. It's going to happen. But all this, and I'm not just talking about the work we do, but there's a slew of AI companies

For example, I'm now working with someone in India whose English is pretty poor, but I've asked him to send everything through chat.gpt before sending it to me. And so we can work better now than we would have otherwise, as an example.

Prateek Joshi (33:12.077)
Right, our final question. What's your number on advice to founders who are starting out today?

Ofer (33:21.006)
So my number one advice is, like anything else, it helps to have a world-class coach. And so if there's anything you're trying to figure out, the slow way is to learn on your own and make mistakes on your own. The fast way, the shortcut, is find someone that's done it and get close to them and get their help accelerating.

Prateek Joshi (33:45.957)
Amazing offer. This has been a fantastic episode. Obviously, you're a repeat founder, so many learnings here, and it's always fun to see what we can take from your previous startups and what's something new you're learning, and it's always nice to see that. So thank you so much for coming onto the show and sharing your insights.

Ofer (34:07.847)
Yeah, thank you for taking. Happy to be here.