Infinite Machine Learning: Artificial Intelligence | Startups | Technology

Voice AI Agents

February 12, 2024 Prateek Joshi
Infinite Machine Learning: Artificial Intelligence | Startups | Technology
Voice AI Agents
Show Notes Transcript

Itamar Arel is the CEO of Tenyx, where they are building intelligent voice AI agents. He was previously the founder of Apprente and Binatix. He was previously a visiting associate professor at Stanford.

(00:22) State of Play in Voice AI
(02:21) Modern Voice Agents
(04:01) Building Voice AI Models
(05:48) The Role of Developers
(08:28) Introduction to Tenyx
(09:52) Innovative Use Cases of Voice AI
(15:20) Challenges in Voice AI
(19:18) Scaling Voice AI Across Languages
(21:45) Combating Misinformation in Voice AI
(24:17) Voice AI and Human Creativity
(26:49) Sensitive Use Cases of Voice AI
(28:48) Voice AI in the Military
(30:32) AI-Generated Voice and Disclosure
(32:28) Future of Voice AI
(34:04) Rapid Fire Round

Itamar's favorite book: Consciousness and the Brain (Author: Stanislas Dehaene)

--------
Where to find Prateek Joshi:

Newsletter: https://prateekjoshi.substack.com 
Website: https://prateekj.com 
LinkedIn: https://www.linkedin.com/in/prateek-joshi-91047b19 
Twitter: https://twitter.com/prateekvjoshi 

Prateek Joshi (00:01.465)
Itamar, thank you so much for joining me today.

Itamar Arel (00:04.354)
Thanks for having me.

Prateek Joshi (00:06.201)
Let's get right into it. Voice AI, so much is happening in this field. To set the baseline, can you talk about the state of play in voice AI? And also, what can voice agents do well today?

Itamar Arel (00:24.042)
Yeah, it's a great question. So first and foremost, I think what we need to appreciate is that we speak very different than we write. When we tend to speak in natural speak setting, we use poor grammar and broken English and ums and ems. Yet, we humans understand each other perfectly. In fact, six or seven year olds can do that almost seamlessly. It's been traditionally challenging to build machines that are as robust, right? But the

revolution I would say in machine learning in general, not just the introduction of course of large language models, but the advances in speech recognition and text-to-speech technology has finally paved the way to build agents that are human-like, that really create that kind of conversational experience that we have, similar to the ones we have with humans, and that's pretty exciting as a space. Right now, there are, you know, the traditional, on the enterprise side, the traditional solution is still based on something called IVRs.

Now, somewhat outdated solutions that you're familiar with, if you call your favorite airlines or shipping company, they tend to be, historically they tend to be brittle and very limited in understanding, again, free language, particularly when it's natural speech, which again has poor grammar, broken English, and things of that nature. So it's exciting, I think we're at an inflection point where the next few years are going to...

see the introduction of really robust and human-like voice AI agents in the marketplace.

Prateek Joshi (01:53.821)
Yeah. And historically, people, or the average customer, they haven't had a great experience calling an airline and you're routed to an automated voice agent and it usually doesn't work. It's broken. You had to wait to get to a real person, but now it's changing. So can you, in simple terms, can you explain how a modern voice agent works in 2024?

Itamar Arel (02:22.462)
Yeah, so first to your comment, you're absolutely right. The current solutions that have been around more or less for 20 years, I mean, they've advanced, but these IVRs have been around for 20 years, still result in between 80 to 90 percent of the people sometime early in the conversation asking to talk to a person, to escalate to a representative. And I'm sure we can all relate to that experience. That's that's about to change. We really do feel like in the not so distant future, maybe it'll be.

flipped around 80 or 90 percent of the calls will be pleasantly handled by a machine and maybe 10 to 20 percent of the calls that tend to be more complicated, you know, this long tale of conversations that do require human handling. Like I said, the bottom line is that current solutions really didn't reveal the, certainly not the cost reduction by proposition if 90 percent of the people or so escalate immediately or fairly quickly to a person.

And the other benefits maybe before I sort of directly answer your question, other benefits beyond the cost reduction of just improving the customer experience. You may want to ask 10 things which humans sometimes tend to be understandably annoyed by machines of course don't have that. You as the service provider may want to do some A-B testing to try different responses to different questions, advanced analytics on the conversations. Again, these are all things that are fairly challenging to do with humans.

and far more easier to do with voice AI or with machines. So, you know, that's sort of where things are, that's the goal. I'm sorry, forgot what you were asking about.

Prateek Joshi (04:00.267)
Oh, how does it work in practice?

Itamar Arel (04:01.87)
So how did it work out? The modern pipelines have basically three fundamental components. One of course is the speech to text or the ASR, the speech recognition side of the system, which takes audio in, translates that to, fundamentally transcribes that audio of course, but modern solutions tend to add additional signals beyond the transcription. You may want to know, for example, the pitch or the speed of speech or anything that reveals something about the emotional.

state of the speaker. That could be very informative in responding appropriately to the customer. And then that information is passed to a natural language understanding subsystem, basically, coarsely speaking, which is charged with extracting intent and maybe pieces of slots of information that the customer is conveying or asking about. And then finally, of course, there's a natural

Itamar Arel (05:01.334)
deciding how to respond to the customer after understanding what he or she was saying. And of course, finally, there's the text-to-speech synthesis, the text-to-speech engine, which is, interestingly enough, also becoming a very crowded space. It's a very human-like speech synthesis systems that are sometimes difficult to discern from real humans. And so, at a very high level, that's really the modern architecture.

Prateek Joshi (05:30.149)
Right. And if you are a developer of such a technology, how do you build a model from scratch? Just assume that you're building a model, you have to collect the data, build a model, train it. So how does a developer do it?

Itamar Arel (05:50.006)
Yeah, that's an interesting question because nowadays with LLMs and open source large language models being so common and high performing, to be honest, it's tempting to think that, you know, if I just, maybe I can just get a speech recognition, an ASR and a TTS and slap on a, you know, Mistral or whatever your favorite open source large language model. And maybe with some quick prompt engineering, I can have an agent that can, that can...

hotels or flights and so forth. The reality is, and that's really the answer to your question, it's far more complex than that. Partly having to do with voice, again being very different than the way we write or we text. Obviously there's latency constraints with chat to text, yeah maybe sometimes you can wait a few seconds. With voice of course that's not really an option. You have to be below some kind of low latency that says maybe a second or so to feel.

responsive and natural, you need to appropriately determine when the customer paused because they are done saying what they had to say and now we should process it and respond versus when is it that they paused mid-thought as they were looking something up or thinking how to complete the response and then answering and that could be a few seconds long. Again, it's almost an AGI complete problem. It's the kind of thing that a six or seven-year-old naturally knows how to do.

to mimic that in machines has been challenging. So there's all these challenges having to do with building real world good models. Of course, the other side of it is these solutions have to adhere to strict business logic and rules. They cannot deviate or talk about the competition or talk about things that are truly outside the scope of the conversation. So things like hallucinations, the usual critique that's leveled at large language models.

is a real concern in this context. It's one thing that ChatGPT can help you write Twinkle, Twinkle Little Star in the style of Shakespeare and do that really well. But if you're building a customer service agent that really does need to provide service and not be offensive of course or otherwise harmful, but even offer a restaurant or a flight that doesn't exist, making these things super reliable and stick to the business logic and rules again.

Itamar Arel (08:08.958)
is really critical and that's all to say that there's a lot that goes in that has to go into building these systems in the real world, in the real business setting, if that makes sense.

Prateek Joshi (08:18.597)
Right. And I think it's a good stopping point to talk about 10X. You're the founder of the company. Can you explain what 10X does?

Itamar Arel (08:30.122)
Yeah, and maybe to do that, I want a 30 second background on myself because it ties into or my experience with ties into 10x. So the core team of 10x, myself included, were previously the founding team around a company called Apprenti that was building voice AI agents to automate the order taking process and drive throughs. So think Starbucks, McDonald's and all these other chains that we know and love. We actually ended up being acquired by McDonald's Corporation in October 2019 to build this solution.

for the drive-through, which was a wonderful ride. But all that, it was a short number of years ago, predated large language models. And it's amazing to think about what you had to do in order to get to 80, 90% automation before these more modern engines were made available. And that's really, that was really the trigger for Tenix. We've always been...

big believers that voice is the most natural way to communicate as long as you can build machines that can understand us robustly, again with the poor grammar and broken English and so forth, then the opportunity is huge on the enterprise side, particularly call centers or contact centers, but even beyond that. So I started Tenix about two years ago as the large language models revolution was about to happen. This of course before JGPT or before it became super popular as a household term almost.

But yeah, we felt that again, the pieces of the puzzle and the technology side were ready to build customer service automation solutions in particular that can address many, many verticals, not just food ordering or hotel reservations or flights or financial services and so forth. There's finally a path to build an architecture that can be applied to many verticals effectively. And that was, of course, we found that very exciting.

And so we started building, you know, we put together the team and raised some money and then basically began building the solution, which as I alluded to, things like endpoint prediction and mitigating hallucinations and out of context conversations, as well as fine tuning. Maybe that's worth mentioning, just a word or two. Fine tuning is a way of taking large language models and making them slightly better, more customized to any particular domain. Which in...

Itamar Arel (10:51.522)
the enterprise use case makes a lot of sense, right? So you take an open source, a very large language model, and you then also consider maybe 300, 400 prototypical conversations in some domains, say hotel reservation agent, just as a classical customer service use case, and you train essentially, you fine tune, you continue to train in a sense, that models in an effort to have that model be better in that domain. But of course, what

people realize, as we did over a year and a half ago when we started building our first prototype, is that fine-tuning using sort of the straightforward or conventional methods doesn't come for free. As you get better in the domain of interest, you tend to have these forgetting effects, these distortions in other areas and sometimes related areas, right? So knowledge is being lost, suddenly some reasoning capabilities that were there get messed up. It's a little bit of a game of whack-a-mo, because you're changing the weights of the network and it's not...

clear what did you change that suddenly maybe caused the loss of some capability that was critical before. And so we have a research team here. We took that on as a challenge because we realized that fine-tuning that mitigates forgetting is really critical and developed this technology, this new mathematical approach really translated into a set of algorithms that do exactly that, that allow you to fine-tune in a specific domain while making strict guarantees that...

99.99% of the LLM magic that was there before is retained, so no loss of knowledge and reasoning and so forth. The other interesting piece of that is that we even saw a deterioration of RLHF protection. You fine tune in a specific benign domain and suddenly you may have responses that can sound harmful or biased or racist, which of course again is unacceptable in any setting really, let alone the business setting.

Prateek Joshi (12:44.577)
Right, right. That's actually great. And it's a good segue into my next question. Earlier, you talked about the use cases of voice AI. And historically, if you ask an average person, average customer, hey, where is voice AI used? They're familiar with airlines use it. Or when I call to make a reservation or customer support or maybe drive-through ordering. What are some of

more innovative or newer use cases you are seeing in the market where voice AI can play a big role.

Itamar Arel (13:20.874)
Yeah, absolutely. We're on the precipice of a very exciting era, I think. So to agree with you completely with your point, on the enterprise side, call centers in the US today, there's over $100 billion a year being spent on voice customer service automation. So that's the elephant in the room in some ways. But switching to the more consumer focused sort of side of things, you can envision voicing many functions in our day to day life.

Some of that has been attempted of course by Siri and Alexa and the next generation of these personal assistants. So far with maybe you can argue some limited sphere of influence but I believe, many of us believe that that's going to change radically the true personal assistant that can constantly continuously listen to you, take actions on your behalf to achieve some goals are definitely a possibility in the not so distant future. And again, voice.

is the most natural way for us to communicate. As good as we've gotten in texting, it's still far more natural, far quicker for us to just say what we wanna say and convey information that way. There'll be much more integrated with other services in our lives, I think. And yeah, this next generation of personal assistants, for lack of a better term.

will be sort of transformational I think in our day to day life. You can be walking in the street and say, you know, how many calories are in this dish that I'm looking at? I know there's some early use cases or early solutions in that context, but again, integrating voice, having you talk to this virtual assistant if you will and it talking back to you in almost a human like fashion I think is really going to create this shift in this new world.

new possibilities. And so I think that's the more exciting stuff that will be happening soon on the consumer side.

Prateek Joshi (15:20.197)
Right. And when you look at voice AI tools, could be like an enterprise product, could be a consumer product, hardware, software. Voice is very natural. We use it every day. But in terms of technology, it's not as ubiquitous as it should be. So between today's reality and that vision, what are the biggest challenges that need to be solved?

so that voice AI becomes absolutely universal. Like every average person just uses voice for most of the interaction with the technology.

Itamar Arel (15:59.594)
Yeah, a lot of the challenges, some of which, some of them I touched on, you know, beyond the robustness to poor grammar, broken English and so forth, we tend to often interrupt each other, not intentionally, of course, well, sometimes intentionally, but sometimes not. And you know, yet the we're able to recover the conversation either by saying, I'm sorry, go ahead, or, or just listening to parts of the statement and really modulating your response.

that those pieces, which go beyond just LLMs, there's a lot more to build around that, are going to be the next wave of improvements that will create systems that are a lot less brittle and you know, so the conversation, IVRs are notoriously noted, has to be a perfectly choreographed conversation and you read your numbers very slowly. Give you a classic example, reading phone numbers. I'm always surprised how the existing IVR solutions, you have to read your phone number.

when you're prompted for it very slowly and even then it doesn't always get it. Whereas if you talk to a friend you say well my phone number is 555-333-2222, right? And they just get it. And so a lot of those improvements, some have already happened honestly the last two three years, others are about to happen and I think it's going to usher just a new era of these conversational AI agents that are going to transform our lives. You asked a little bit about

Itamar Arel (17:26.198)
You know, surprisingly, the requirement for data set has been very low. So low data volumes are typically needed, especially when automating customer service calls, for example. So imagine calling an error in the book of flight or change of flight. You can imagine that tens of calls, certainly a few hundreds of prototypical calls can give a strong sense of what, where 90% of the callers might be, right? So leveraging the power, if you will, of LLMs together with...

fairly modest amount of in-domain conversations can likely deliver a solution that can automate 80 or 90% of the calls, and that would already be a huge improvement relative to what exists today. And then, of course, we didn't explicitly talk about that. RAG, as in extracting sort of information from a knowledge base, is going to play a key role, right? Because if you are an airline or if you're a hotel and...

Somebody asked you about a particular hotel, whether it has a pool or what's the distance to the airport and so forth. You can imagine that kind of information could be dynamically sort of maintained and through a rag mechanism be extracted seamlessly. And you don't need to, of course, train the model further and so forth to have this virtual agent that is very informed, usually far more informed than a person could be and up to date about the latest details that you wanna provide.

Prateek Joshi (18:52.529)
All right. And if you look at all of the languages around the world, there's so many languages, each language has so many dialects and they have so many accents. So there's so many little variations when it comes to all the people. So how can we make a voice AI system work across all of them?

versus focusing on a specific group that can speak a specific language like English in a specific way, like I don't know, like how people talk in the US or maybe on the West Coast. So how do you scale that across the entire world?

Itamar Arel (19:32.514)
That's another place where I feel like the last two, three years have seen tremendous improvements. There are now multiple companies, both big and small, so they include the usual suspects of Microsoft, Amazon, Google, as well as smaller, of course, startups that are delivering close to human level accuracy of transcription, certainly in universal, sort of domain-free settings. And that's for sometimes tens, if not hundreds of languages, or 200 languages or above. So same for...

accents. It's of course never a fully solved problem, but getting to sometimes getting to superhuman word aerate across a good number of languages is definitely attainable. We at Tenix of course support English, but have experimented with Spanish and French and the other sort of predominant languages. And it's amazing that even a native speaker of one of these languages would often listen to a segment, something somebody says on the phone and

as a human it's hard to make out what the person was saying yet the ASR, the transcription gets it and gets it pretty correct. Like other players in this space, we partner with such ASR providers. You don't want to reinvent the wheel. Some of them have trained on millions. One of them told us that they've trained on 2.5 million hours of conversations. A very rich data sets that these ASRs have been trained on, which is the primary reason

accents, dialects, a lot of things are being now captured at oftentimes superhuman level, which again does not suggest it's 100% or a solved problem, but it's gotten to a point where it's comparable to native speakers.

Prateek Joshi (21:13.437)
Right. And that actually brings us to another important question. As voice agents get really, really good, the flip side is that it can be used, like any tool for good and bad. In this case, it can be used to spread misinformation or manipulating public opinion on a specific issue. So how are we combating that?

that specific use case of voice AI. Can we do anything about it? If so, what can be done about it?

Itamar Arel (21:45.206)
This is certainly one of the major challenges of our space and our time. I'm not sure that the community has converged on a solution. The particular approach we've taken is the one of multi-guardrail scheme or approach to mitigating misinformation, hallucinations, as well as just out of context conversations that veer off the topic or the sphere of conversations that are typical to a domain. That basically means you need to incorporate both...

classical sort of rule-based schemes, keywords and phrases and templatizing the templates you look for, coupled of course with deep ML-based solutions such as ones related or pertaining to anomaly detection. You can imagine if there was a way to look at the embedding space and figure out in that fairly high-dimensional space that the current output to be produced or input to be received deviates. It's unlikely to...

to be derived from the distribution of calls that you typically expect, then that multi, that additional layer, this multi-guardrail approach, we think is the more practical way to driving these harmful or misinformation or out of context conversations to as close as you can to zero. They can never be completely zero and hacks are always there, but we think we've made a long, we've made significant progress and I, my personal,

prediction is that, at least in the business enterprise context, which tends to be in a restricted domain. So 99.99% of the people that call an airline usually call to either book a flight or change a flight or ask about timetables. They don't usually talk about Shakespeare. So within a restricted domain, I predict that we as a community will get very good at mitigating these negative...

you know, negative misinformation, negative comments or negative outputs. So we'll see, but that's definitely a major challenge.

Prateek Joshi (23:48.309)
Right. I wanna talk a little bit about human creativity because we are talking about voice. And throughout history, music has been a huge part of how humans have expressed creativity. It's a form of art and singers have played a big role in some shape or form, they've been part of our lives. So now that we have insanely good voice AI agents, will it surpass?

human creativity on this specific aspect, meaning creating the perfect voice, saying the perfect things, and it sounds like the world's best singer. Is that a thing that can happen? Should it happen in the future?

Itamar Arel (24:32.91)
Well, later, should it? You know, those are separate questions. Sorry, you're referring to just in the context of speech, because you mentioned music and singing there. Are you are you kind of venturing into the musical side of things?

Prateek Joshi (24:46.869)
Oh, no, I'm just saying, will voice AI agents end up being creative as well? Because right now they're serving a tactical purpose. Like, hey, I'm going to call for a book of light. You understand me. You give me what I want. But this is a more artistic endeavor, like using voice AI to do something else.

Itamar Arel (25:05.494)
Yeah, yeah. The short answer is yes. I think already we're seeing a fair amount of creativity or ability within your domain to answer in a non-robotic, non-templatized way. So certainly there's early indications that these agents are going to feel almost human-like in their engagements. The other aspect of that is emotions and emotional state, both of the customer and the agent.

It's another interesting development of the past few years. In addition, just to be a little more specific about that, most ASR speech processing service providers offer in addition to the transcription, non-textual information such as pitch and prosody, things that reveal more about the state, maybe the emotional state of the speaker. And...

as a result allows the agent to respond more human-like, more appropriately, which I think indirectly addresses what you're saying. Changes in volume, changes in speech base. The next generation of voice AI agents will undoubtedly leverage these additional audio-derived signals if you will to improve the understanding of a customer and thus improve the response to be more creative, more human-like. No doubt that's where things are going.

Prateek Joshi (26:23.901)
All right, let's talk about the application of voice AI to more sensitive areas like childcare or elder care, where there's a certain, you need to be held to a higher threshold of what's acceptable. So when it comes to applying or using voice AI and for all the use cases, should there be limits on what...

what we should use it for or is there like, hey, for these use cases, we shouldn't do it because it's too sensitive or too risky.

Itamar Arel (26:59.402)
That's also an interesting question whether maybe a corollary to that is whether some of these domains might end up being regulated if they do involve children or even adults or older people. Yeah, I think it all goes back to the point we recently spoke about, which is how do you make sure this thing is safe? Right? How do you make sure that whatever it talks about or whatever outputs it produces,

tend to be, or not tend to be, are guaranteed to be within this sphere of topics or conversation content that we want to make sure they stick to. And some of these technological challenges and then some societal kind of decisions we have to make. I think that solutions that address children, for example, are likely to happen for sure, but it might take a while because they're going to rely on these...

the safety measures and technology pieces that we're going to be putting together over the next few years. So I would say the lower hanging fruit, if you will, are personal assistant kind of functions or customer service, where it's a bit easier within a restricted domain to kind of make sure the system is safe down the road. These things will grow and address all demographics and many other use cases.

Prateek Joshi (28:23.973)
Right. It's actually a good way of looking at it. And while we're on the topic, do you see use cases of voice AI in like military? Again, obviously there are ways to use it, but is that practical? If so, where can it be used in a reliable way?

Itamar Arel (28:48.47)
Yeah, military context, I imagine military voice solutions have either kind of very challenging setting as you can imagine, in which case some of the challenges may not be so much where we live, but more on the how do you get speech recognition and transcription correct when the channel is very noisy and variations in speech patterns and so forth.

And then it could, but by the same token, the military, like any other government agency tends to have customer service quote unquote functions where I definitely see these kinds of technologies playing part. So you know, out of the many AI use cases that people are rightfully concerned about, I tend to think that these voice AI solutions are probably less.

scary for sure but less concerning although again you always have to make sure that they stick to the script, they adhere to business logic in the military sense and so forth. So yeah, it will be interesting to see where that plays out. We have not had any military use cases of our solutions but it will be interesting to see how that space unfolds.

Prateek Joshi (30:04.597)
Right. And there has been debate and there are people on both sides of what I'm about to ask here. When you look at AI generated voice, some people are saying, hey, make it as natural as possible. Let it be out there in the wild along with human voice recordings. And some people are saying, hey, why don't you water market in some way so that if needed, we can identify it or recognize that, hey,

AI generated voice in the future. So where do you stand on that?

Itamar Arel (30:37.534)
So I can share where most of our customers stand on that, which I think will be a question. It's probably maybe would be surprising to some of our listeners that most enterprise companies do not want to misrepresent a virtual agent to be a real agent, right? First, yeah, it's an inherent misrepresentation if you do so, but also these solutions to date they're good. They sound very human. They tend to be more and more robust for sure.

But usually you can tell by the third or fourth or fifth turn that you're talking to a machine. That's still okay if the machine understands your robustly, answers correctly, saves you time, gets you on your way. That's great. But predominantly, I think in the near future, at least, companies are going to say, hey, welcome to X Airlines. This is your virtual agent. How can I help you? And so just disclosing upfront that this is a

virtual machine is key. Of course, people want to know, I don't think people want to be conned or to be tricked. And the customers of course, equally don't want to misrepresent. So as long as you're saying, they do want this agent to sound human and robustly understand us, but I think for the foreseeable future, there'll be this full disclosure that this is a virtual agent and that's fine. That's probably how it should be. And it addresses.

Prateek Joshi (31:40.989)
Hehehe

Itamar Arel (32:00.13)
some of the concerns that you raised about, oh, with this misinformation, this information, am I talking to a machine? What's real, what's not? I think my sense is the foreseeable future, companies are gonna insist on this full disclosure and it's probably the healthy thing to do.

Prateek Joshi (32:14.725)
Right. I have one last question before we go to the rapid fire round. So if you look at voice AI, the market in the next 12 months, what's coming next and what are you excited about?

Itamar Arel (32:28.746)
Yeah, you know, I personally and I'm sure many can relate to this and do call my airlines to change a flight or shipping company to ask about a package and so forth and it's not too, it's all too common for me to, because I do press zero zero, let me talk to a person and then it could take 40 minutes until I actually talk to someone, right? So in the case of airlines, if there's a storm system or something that otherwise impedes travel then that's super common. And I think it's going to be exciting over the next few weeks.

probably 12 to 24 months, maybe 36, to see a huge shift where you no longer wait, certainly you don't wait 40 minutes, you get greeted by a solution that feels almost human and can answer your questions, get you service, move your flight, tell you where your package is, whatever it is, and I think that's been long overdue in a way. The government, you touched a little bit about military, but government is known to be understaffed

burdened if you have family members, older members that ever called Medicare or Social Security and you know they're notorious for being understaffed, it's just the nature of their space and so you could be three hours on hold listening to soothing music until somebody answers the phone and that's less than desirable. So I think it'll be exciting to see three years from now and that doesn't happen anymore, right? You just get your service, Google Health Systems understands you. I know old people will be...

We'll bless it. So yeah, that's exciting. We're happy to be part of that revolution.

Prateek Joshi (34:04.533)
And the soothing music, I don't know who it's supposed to soothe. Usually, I think it tends to aggravate more people, but that's a topic for another day. All right. With that, we are at the rapid fire round. I'll ask a series of questions and would love to hear your answers in 15 seconds or less. You ready? All right. Question number one. What's your favorite book?

Itamar Arel (34:20.878)
Sure.

Itamar Arel (34:27.254)
There's a book called Consciousness and the Brain by Stanislas De Heijn. He's a famous French neuroscientist. It's a wonderful book about the brain, of course. He writes it for the laymen. He's a neuroscientist, of course, but he writes it for the laymen and just has a lot of cool stories about how the brain works and things that, for me, were enlightening. So Consciousness and the Brain.

Prateek Joshi (34:52.181)
Amazing. All right, next question. What has been an important but overlooked AI trend in the last 12 months?

Itamar Arel (35:00.026)
Oh, I think it's the emerging work on replacing feed forward like architectural like transformers with recurrent ones. I'm a big believer in recurrent neural networks. Obviously the brain employs a lot of recurrent and feedback loops and it's just that there've been traditionally several key challenges to having that transition or that advancement take place. But I predict RNNs will replace transformers in the not so distant future with strong benefits and speed and scale and other things.

Prateek Joshi (35:28.741)
Interesting. Yeah, that's an interesting trend. Next question. What's the one thing about voice AI that most people don't get?

Itamar Arel (35:39.254)
Oh, getting it right to work well is almost an AGI complete problem. Things like endpoint prediction, did we pause to complete the sentence or not? We use endless idioms and segments of phrases and we do that as humans a thousand times a day and it's far more challenging for machines to understand and to figure all that out, but I think they will.

Prateek Joshi (36:03.461)
What separates great AI products from the good ones?

Itamar Arel (36:08.418)
I think a good AI product always has an accepted or certainly expected level of performance. Below that it doesn't matter how cool the technology is, the customers won't buy it and you'll go out in flames potentially. So I think the good ones, you got to make sure you're confident about your accuracy level, automation rate, detection, whatever it is that you're shooting for and you meet and exceed it. It's not always trivial to ascertain.

the onset of the journey, but it's key. AI needs to work.

Prateek Joshi (36:40.361)
All right, what have you changed your mind on recently?

Itamar Arel (36:46.122)
Well, I think like many other people, I've come to realize that while LLMs may not necessarily lead us directly to human level AGI and beyond, they're nonetheless, they exhibit real intelligence in a lot of ways when it comes to solving problems that we humans attribute AGI or kind of intelligence to. And so that's been a shift in my view relative to the beginning of LLMs.

Prateek Joshi (37:11.025)
What's your biggest or wildest AI prediction for the next 12 months?

Itamar Arel (37:17.91)
I think full multimodal models that process sound, video and language concurrently so as to deliver ever more impressive and impactful output results, I think I'm not the only one who predicts it but I think we're halfway there and we'll get fully there, that's going to be the next sort of big wave. We need several additional breakthroughs to go beyond where we are in terms of on our path to building thinking machines.

Some have to do with richly modeling the world in which the world sort of with agents interact and perhaps a deeper goal-driven action selection, one that aims to impact the environment in some desirable way. We do that as humans, as mammals. But again, I'm optimistic that many of these things will come before long.

Prateek Joshi (38:04.421)
Final question, what's your number one advice to founders who are starting out today?

Itamar Arel (38:11.042)
Oh, okay, well that has to be that as soon as you have a thesis about a product or service that you want to offer, make sure you talk to as many prospective customers and potential partners as you can prior to launching your venture. We have a tendency to drink our own Kool-Aid, be in love with our technology and not really realize that you can never talk too often or too... Yeah, yeah, yeah.

You can never talk enough or too often to customers and they'll help you understand whether what you're building really answers a need and has real value. Indirectly, they will also help you position yourself against the competition which is important so lots to learn from talking to customers and potential partners. It's a key role for any founder in my humble opinion.

Prateek Joshi (38:59.317)
Amazing. Itamar, it's been such a brilliant discussion. I love how voice AI, I think it's at a cusp because of LLMs, really excited about its future. So thank you so much for coming on to the show and sharing your insights.

Itamar Arel (39:13.558)
Yeah, thank you for having me. Thanks so much.