Voice-to-Voice Foundation Models Artwork

Infinite Curiosity Pod with Prateek Joshi

The best place to find out how AI builders build. The host Prateek Joshi interviews world-class AI founders and VCs on this podcast. You can visit prateekj.com to learn more about the host.

All Episodes

Infinite Curiosity Pod with Prateek Joshi

Voice-to-Voice Foundation Models

October 30, 2024 • Prateek Joshi

Alan Cowen is the cofounder and CEO of Hume, a company building voice-to-voice foundation models. They recently raised their $50M Series B from Union Square Ventures, Nat Friedman, Daniel Gross, and others.

Alan's favorite book: 1984 (Author: George Orwell)

(00:01) Introduction
(00:06) Defining Voice-to-Voice Foundation Models
(01:26) Historical Context: Handling Voice and Speech Understanding
(03:54) Emotion Detection in Voice AI Models
(04:33) Training Models to Recognize Human Emotion in Speech
(07:19) Cultural Variations in Emotional Expressions
(09:00) Semantic Space Theory in Emotion Recognition
(12:11) Limitations of Basic Emotion Categories
(15:50) Recognizing Blended Emotional States
(20:15) Objectivity in Emotion Science
(24:37) Practical Aspects of Deploying Voice AI Systems
(28:17) Real-Time System Constraints and Latency
(31:30) Advancements in Voice AI Models
(32:54) Rapid-Fire Round

--------
Where to find Prateek Joshi:

Newsletter: https://prateekjoshi.substack.com
Website: https://prateekj.com
LinkedIn: https://www.linkedin.com/in/prateek-joshi-91047b19
Twitter: https://twitter.com/prateekvjoshi

Prateek Joshi (00:01.856)
Alan, thank you so much for joining me today.

Alan Cowen (00:04.898)
Thanks for having me.

Prateek Joshi (00:06.91)
Let's start with the basics. Let's start with defining voice to voice foundation models. So for someone who's walking into this fresh, can you explain what that is?

Alan Cowen (00:20.588)
Yeah, so you have large language models, which everyone sort of knows what they are at this point. They take in text and they output text. Voice foundation models take in voice and output voice. And the reason that's different than having a large language model connected to something that separately turns the text into voice is that voice has a lot of information that should be determined intelligently and not based on a

model that's disconnected from the thinking. So the voice should be like intimately involved in the thinking. And actually like we think with a voice that has emotionality to it and it influences the patterns of our thoughts. And we then use our voices to express a lot of things that are not present in the words that we're saying. And we also listen to the other person and a lot of what we infer about their preferences, about their beliefs.

about their personalities comes from the way they're saying words and not just the words they're saying as well. So all of that should be wound up in the intelligence, in the AI.

Prateek Joshi (01:26.164)
Now, historically, like when we first started figuring out how to handle voice, how to understand speech, the way it was done is you speak into a microphone, then it gets converted to text. And then you use NLP to understand the sentence. Then you construct a response. you convert that to an audio signal, which then goes to a speaker.

It's a circuitous like a long route, right? And now we're completely short circuiting the whole thing, like input-wise, output-wise. as we go from the old premise to this premise, what are the advantages and also what are we giving up on this trade-off?

Alan Cowen (02:09.89)
So the advantages are that you have a model that really understands what it's saying, sounds like it understands what it's saying, and is able to infer more from your query. our interactions with AI are sort of rate-limited by the interface. And if it's just a text box, we're going to be able to convey a quarter as much information in the same amount of time as voice, not just because the

typing, takes four times longer than speaking, also because, and for many people, that's not true. Some people are very fast at typing, but that's the minority. But even if you're really fast at typing, you're not going to be able to explain your preferences as well. with text, in the same amount of time, you'll have, you'd have to say like, I actually, I'm also frustrated by this and not just be able to convey frustration in your voice or, or convey relaxation or energy or whatever it is you want to convey. And that's every single word. the, you know, that's the advantages.

The disadvantage is, you know, you might incur some latency cost because the model is processing more information per second, basically. And so to offset that, you just want to use a model that is fast enough where you don't really care about the latency cost. And at a certain point, the amount, the richness of the information you get from the interaction is great, is well worth the latency you incur. And then you can use.

that model to call slower models if you need them. So the model that's doing the orchestrating should understand your voice. And then if it needs a slower model where latency is really a consideration because it's so slow and so advanced, then it calls that slower model.

Prateek Joshi (03:42.542)
you

Prateek Joshi (03:54.414)
the ability of AI to understand the emotion at that point in time. I think it's critical because in a simple use case like customer support, I'm calling because I'm angry or I'm frustrated because something didn't work or my order is not here or they sent the wrong thing. And I think detecting that is an important component. how do you train an AI model to...

to learn to understand human emotion in speech. how, can you explain in plain English how an AI model does that just like a human does?

Alan Cowen (04:33.536)
Yeah, so there's really a few components of it. The first component is just being able to actually perceive the voice modulations that occurred during words, the non-linguistic, non-lexical voice modulations. And that includes prosody, which means the tune, rhythm and timbre of speech. And it also includes vocal bursts, laughs, sighs, grunts, et cetera, which actually are pretty common and things like ums and ahas and interjections.

so forth. And to train a model that does that, you just need a lot of recordings of those and you need them to be labeled really well, ideally by the person who's actually forming those sounds, which that's what Hume does. We collect data from people around the world, forming these sounds naturally and also reacting to them and rating them. And we try to understand what they mean to the people making them. if you can predict with 100 % fidelity what they mean to the people making them, then

irrespective of who is, of the person's gender, etc. Then you have a representation of the underlying physical movements that generate those sounds, essentially, because there has to be latent in that. And that's the measurement process. And then there's kind of interpretation. And this is sort of analogous to language transcription and language understanding.

where the transcription is just the measurement part of language. understanding is when you take those words and you put them in context and you try to extract meaning from them. And that's the same with the nonverbal voice modulations. We take models and we train them on tons and tons of conversation data in order to understand what those voice modulations mean. And that data includes things like customer service calls. And it includes, you know, creative...

content and it includes a lot of other stuff, humor and arguments between people and so forth. And the model has, and data from many cultures too, and the model has to be able to figure out in a given context if the person who's listening to somebody speak is going to laugh. Like that's one example, right?

Alan Cowen (06:56.514)
If they're going to, if the model thinks they're going to laugh, maybe it's something that's funny, right? Or there's also polite laughter and other kinds of laughter, which the model can disassociate based on the acoustic patterns in the laughter. And that's just one example, right? And what's going to make a customer service experience frustrating is another thing that the model has to learn, right? In order to know whether somebody's going to sigh in frustration.

Prateek Joshi (07:19.0)
Right. And that's actually a good segue into my next question. If you go across different cultures, different countries, people express the same emotion non-verbally in different ways, going from the US to England, India to Japan. It's just different. even though they're conveying the exact same emotion. So one, how do you train your model to deal with this variation? And also maybe part B is where does the data come

to train your models.

Alan Cowen (07:51.094)
Yeah, so the model has to be able to first encode expressions irrespective of who's forming them. And then it has to then take the context that it has, including data relevant to the culture of the person, like their accents, including the content of the speech, what they're talking about, and adjust its interpretation of voice modulations based on that context. That's really important, right? And so you have to have data for many cultures to do that.

We collect data at Hume, we both use data that we collect experimentally. So we get lots and lots of people, this point, millions of participants to engage in conversations with each other and to act things out and to react to emotional content and so forth. And then we have them rating themselves and rating their voices and rating how they feel. And that's all very important data, particularly for the measurement side and also for post-training.

the model that is trained for speech understanding. And then we also collect a lot of public data and we use that as well, like every other AI company. Public data is very important part of training these models.

Prateek Joshi (09:00.426)
Right. Let's move to semantic space theory. And it's amazing, it's so many questions here. But to get started, can you explain what is semantic space theory?

Alan Cowen (09:15.746)
Semantic space theory is a new way of looking at emotion. So traditional views of emotion posit a certain set of emotional experiences that underlie expressions. And then through confirmatory studies, they try to draw those mappings and say, okay, this angry expression is recognized as being the signal of anger across different cultures. And they do this with small scale surveys. Semantic space theory works in the opposite direction.

It defines the space of possible ways of conceptualizing emotional behaviors in terms of their dimensionality, distribution within a space and how you conceptualize them with words. And then it derives those properties from data. So we take lots and lots of data, including judgments, for example, of expressions and expressions people form in different contexts and across different videos.

And, you know, so all these different associations and then between voice and facial expression and those associations all have some signal in them, right? In other words, some degree to which they're preserved across different people and some degree of what we would call noise or like we don't have enough context to be able to explain it basically. So anything where you have an isolated signal, but not enough context to explain it, there's noise basically unexplainable variants there.

And we try to map out the distribution of underlying states that can explain those associations that are reliable, that have signal in them. And that's what semantic space theory is. And semantic space theory studies, because they're data-driven, have addressed a lot of unknowns in the space. So we didn't know how many emotions there were essentially, meaning how many distinct kinds of behavior were associated reliably with different emotional states.

which is another way of looking at that. And in order to even define the distribution of emotions in a space, we had to come up with some mathematical assumptions and axioms that we could work with and come up with some new statistical methods to derive, for example, what is a reliable signal dimension in this space. And that's how we derive our findings. And we found that there's over 20 different dimensions of

Alan Cowen (11:39.122)
emotional behavior in the voice and over 30 in facial expression. And when you combine them, there's some non-redundancy there. So there's a lot of different emotional behaviors associated reliably with distinct emotional states. And we also look across cultures to see how well that's preserved across cultures. we measure how many dimensions are preserved and what are the differences, where are the similarities. And so we've done these large scale studies for the first time. And so we have what we would

consider the most exhaustive taxonomy of emotion.

Prateek Joshi (12:11.546)
You have talked about basic six categories and more importantly how when we have to understand the complexity of human emotions, these categories are kind of limiting. to start with, can you talk about these basic six categories and also why is it important to look past these discrete categories and view emotions as this like a high dimensional vector?

Versus like, hey, let's categorize into one of six and move on.

Alan Cowen (12:41.058)
Yeah, so my PhD advisor is postdoc advisor, Paul Ekman. He came up with these six categories to study of emotion in the 1960s, kind of off the top of his head. Anger, fear, disgust, happiness, sadness, and surprise. Notice there's only one positive emotion there, happiness, which is pretty bad, but.

These are pretty, you know, extreme categories of emotion that you don't see every day, but these are the ones that he wanted to test because he wanted to start with something really obvious. These are things that people might be able to see across cultures and agree that these are different facial expressions, for example, mapped to these categories. So that's why he started with those. For some reason they've stuck, but what he actually did was he took really stereotyped pictures of people forming facial expressions that we in the US and

Prateek Joshi (13:04.27)
You

Alan Cowen (13:31.924)
in Western countries for the most part, recognized as belonging to these categories. And he took them to remote cultures. Well, first he took them to like just not so remote cultures like Japan and Asian cultures. And then he took them to actually remote cultures like Papua New Guinea, where there was less contact with the West. And he just asked them, what do these mean to you? Can you map these expressions to these different words, for example?

and found there was some reliability in the mapping across cultures. For some reason, those six emotional states stuck for a really long time, even to the point where people were arguing, including Ekman, that these are these six basic emotions. In other words, these are the six emotions that people are born with. Everything else is, you know, offshoot of these six with a lot of cultural accents and a lot of...

variability across people. But like he had never, there was no data backing that, right? That's the important thing to understand is that this was never a data-driven theory. And people argued based on very little data that actually there were fewer than six emotions, that really it was just valence and arousal, meaning like how unpleasant or pleasant something is and how calm or excited somebody is.

And those were the determinants of all emotional states. That was even more reductive taxonomy. And the reason that these reductive taxonomy stuck is because nobody really gathered enough data to be able to support a more nuanced taxonomy of emotion. Because if you have 100 samples, can't have, two dimensions are gonna explain 90 % of the variance a lot of the time. Or six categories is gonna explain 90 % of variance.

And you're a hundred samples, especially if you collected them on the basis of comparing two dimensions or six categories. It's not the same as looking at data at large in video and millions of videos and just like asking how many expressions people form and what their reliable associations are with different contexts. That didn't happen until we published our Nature Study in 2020.

Prateek Joshi (15:50.722)
Right. Another topic that comes to mind here is blended emotional states. Meaning, on average, you you could be surprised and happy, or you could be surprised and annoyed, or you could be surprised and scared. So it's not like humans like, okay, I'm in this specific state now, let me move on to the next state. No, it's not that. Usually these states blend, they're continuous. So can you talk about how

semantic space theory deals with blended emotional states or more importantly blended emotional states just recognizing that and getting that data out. How is that useful in a modern AI system?

Alan Cowen (16:33.204)
Yeah, so, you know, the traditional way of looking at it is you have discrete categories and blends between them, but that's not really a mathematical approach. The mathematical approach is to say, all right, there's some finite set of dimensions along which these different categories are represented, or you can predict reliable variance in people's evaluations in terms of these categories by mapping things along these dimensions. And another way of looking at it is like, how many variables do you need to represent this space?

If you only have six discrete categories, technically you can do it with one variable. It's just a number of one to six, right? Once you have blends among these categories, you might need more than one variable of their continuous blends. To the extent that it's discrete, you don't, like, you know, if there's continuity in the space, you do. And so the way we formalize this question is to ask, how many dimensions do you need to explain the reliable variance in people's judgments?

Prateek Joshi (17:05.294)
All right. Right, right, right.

Alan Cowen (17:27.266)
And then we ask, okay, let's look at how things are conceptualized along these dimensions and the distribution of states along these dimensions. Meaning that if something is, if there's a facial expression, for example, that's exactly halfway in between disgust and anger, is that viewed as one or the other? Or is it viewed as 50 % disgust and 50 % anger? And yeah, it's viewed as 50 % disgust and 50 % anger.

Right? So there's a continuous, and it turns out there's a continuous gradient there. And then how many distinct patterns of expression do you need to explain people as inferences that they make when they see facial expressions? And it turns out you need over 30 different dimensions. So there's the dimensionality, the distribution, and then the conceptualization of those dimensions. the conceptualization is the question of

Prateek Joshi (17:56.462)
Right.

All right.

Alan Cowen (18:23.342)
What is the most precise and reliable way people talk about facial expressions? And how well are these different conceptualizations preserved across cultures? What is the kind of conceptualization of facial expressions that's most well-preserved? So it could be that you see a facial expression in the US, and you infer anger, and you also infer that the person's feeling negative, and you also infer that the person is high arousal, and maybe that they're seeing something novel.

or are they making a judgment of unfairness? There's a lot of kind of equivalent things you can ask, but then you can also ask, how well are those different judgments preserved across different people? And it can turn out, everyone, it turns out that, I'll just give a cut to the chase, a lot of people agree that the expression's angry, and fewer people agree that it's negative, actually, both within and across cultures, and even fewer people agree about the appraisal that the person made of their environment, because...

how different appraisals map to different emotions really varies a lot across cultures. So, does turn out that it's useful to have these words that represent feelings people have. And the feelings are really what's well-preserved across cultures. And the appraisals and the valence and the arousal and all of that, those are more loosely correlated with those feelings. mean, obviously, in order to define feelings, because we can't...

that they're internal to us. have to make these comparisons and metaphors and talk about appraisals and talk about villains, talk about arousal. But at end of the day, it really seems like these feelings exist, right? I mean, we know intuitively they exist, but it really does, the science bears it out because the things that these feelings are correlated with aren't as well preserved across cultures as just the general feelings themselves in terms of their association with specific behaviors like facial expression.

Prateek Joshi (19:56.461)
So think there's also a bit of time for you to...

Prateek Joshi (20:15.872)
Now, when you hear the word emotion, to the average person, it means, it's subjective, meaning you feel it and you have a subjective rating, you move on. But there's emotion science, the modern study of emotion. it is, it's now we now know that it's not that subjective. It can actually apply scientific methods to understand emotion. So can you explain how

modern emotion science, dissects emotions down to measurable components. how do we, like, what are we using or tools are we using to understand emotion today?

Alan Cowen (20:56.106)
Yeah. So things that I think this is where things have really changed in the past few years and what our approach, our current approach is different from most emotion scientists, a traditional approach. You know, our approach is very data-driven. We take tons and tons of data. We measure facial expressions. We measure the voice. We, we ask people in really huge controlled experiments with thousands of people what they're feeling and also measure

their facial expressions and voices at the same time and have people in conversations with each other in semi-controlled environments and so forth. That is totally new. And then we use machine learning to kind of derive the underlying measures that predict different kinds of states and different self-reports and so forth. That's totally new. And the way that people used to do things was very different. I'll give you like a short history of emotion science, basically. It started with modern emotion science in many ways started with Charles Darwin.

Prateek Joshi (21:48.643)
Yeah.

Alan Cowen (21:53.162)
He was the first person to, well, maybe David Hume, we're named after David Hume, he was a philosopher. So maybe David Hume first, but he didn't really have data. He just was positive, he posited that emotions are kind of the driver of cognition. And so that became very important philosophically because they are. And then Charles Darwin came along and he went and he actually has a book, The Expression of Emotion in Man and Animals, and he went and documented.

Prateek Joshi (21:59.842)
Right. Right.

Alan Cowen (22:21.802)
across different species and across humans, what the different expressions people form were and what the different situations in which they form them. Of course, this wasn't like a statistical analysis with data. This was more, know, he went and documented things, which is what Daryl Darwin did and kind of sketched things out. And then in terms of when statistics was introduced,

It was actually like statistics, the foundation of modern statistics is really for psychological inquiry. And so it made its way to emotion science or affective science as well. Now in emotion science, much like in psychology, there was a movement to sort of take out the subjective and focus on overt behavior. in psychology at large, this was called behaviorism and like reinforcement learning and stuff that behaviors could be reinforced.

Prateek Joshi (23:07.758)
All right.

Alan Cowen (23:18.624)
And so kind of tried to take emotions out of the equation. And the same thing happens with an emotion science, which became called affective science. Like I think the word affective science really stems from this, which is that we're going to take out emotion and like not talk about emotions. And we're going to talk about affect. And affect is meant to be something more observable. It really isn't, but that was kind of the point of it. then an affective science ended up being very reductive, just like behaviorism.

Prateek Joshi (23:40.342)
you

Right.

Alan Cowen (23:49.058)
And part of the reason that that became popular is because there's limitations to the data that you could collect. And so you couldn't support high dimensional theories of behavior. so the more reductive theories, which were based more on valence and arousal often, were just more easily testable. And now we have much bigger data and we can explore bigger latent spaces with a lot more dimensions.

Prateek Joshi (23:58.83)
Yeah, now we are after you. I guess you told me twice, but I didn't want to you.

Alan Cowen (24:18.102)
That's where you can go back to studying feelings, right? And feelings, of course, are not studyable directly, but you can look at all of the correlates of feelings, language, facial expression, vocal bursts, what we would call emotion-related behaviors. Because that's what emotion science studies is the latent dimensions underlying emotion-related behaviors.

Prateek Joshi (24:37.518)
Amazing. Love the brief history here. And it's so much to it. And the funny thing, the word emotion is almost the universe. Everyone knows it. But I think what drives it, the way we understand it, it is not as common. So I think that's great. Thanks for the brief history. All right. Now I want to move to the practical aspects of deploying a system. Because at the end of the day,

The average customer doesn't care about all of it. They just want the system to work. They have a thing, they want to talk into it, and it has to work 100 % of the time, all the time. That's the average consumer for us. when you deploy the system in the wild, can you talk about what are the biggest bottlenecks in terms of, let's start with latency, meaning that the stack comprises of so many things. What part gives you the most heartache today?

Alan Cowen (25:30.988)
So, you know, traditionally the latency was driven by having many different models that you had to run. And so you had like a text to speech model and you had a speech to text model, and then you had a language processing model. You know, now you have LLMs, but even that was subdivided into different kinds of tasks. And you can make them all as fast as you, as you possibly can. But then there's like the orchestration of it was, was a challenge and, you know, still to some extent continues to be a challenge. But.

what we've started to do is put all of those into one model. And when you put all of those into one model, you can actually eliminate a lot of the redundancy between these different models. So for example, even like a transcription model has to understand language because it has to understand not just how sounds map to words, but also how a sound that can map to multiple words is disambiguated by context, right? And so if you say like,

Snoop Dogg, you know there's two G's. That's a very simple one. So at least you have to understand the difference between dog in isolation and Snoop Dogg. But that's like a really simple example where, you know, it's just kind of a bigram that you need to know. But it gets obviously a lot more complicated when you have, you know, complex words that can have multiple meanings. And also like sometimes there's ambiguity in the sound and...

Prateek Joshi (26:31.662)
Alright, alright.

Alan Cowen (26:55.82)
humans are very predictive in how they understand things. And so we use a lot of top-down prediction to understand speech. And that's sort of implicit in how we understand speech. And so, you we end up doing a lot of language understanding to do that. And so the most accurate transcription models actually have language understanding in them. They basically have like little LLMs inside. And then on the same, the text to speech is obviously the same exact problem, but in reverse. So, you know, it's a, it's a problem of how this is pronounced and it's one to many.

But in the middle you have language understanding. And the language understanding model also has to understand language. So all three of these models have to have all these redundant weights, right? So when you put them into one model, you can get away with that model having less redundancy. At the end of the day, it's still a bigger model than any of all three combined used to be because it also has like a high level understanding of all of these things. doesn't...

Prateek Joshi (27:34.659)
Right.

Alan Cowen (27:49.154)
Maybe we kind of skipped a generation of maybe small models understanding these things, but now we're like, all right, now we're going to have the thing actually have intelligence too and come up with what to say. that turns out to be a problem where you need bigger models, but the efficiencies you get by putting all of these together result in having lower latency even with these big models than you used to have with a bunch of small models.

Prateek Joshi (28:17.26)
Assuming that the models algorithmically, let's say that is not the bottleneck, between compute and bandwidth and maybe something else, in a real time system, I'm talking into my phone. Let's say that's the, I'm in the middle of no, I'm in the middle of the downtown. I'm talking into my phone. Let's say algorithmically, we have squeezed everything we can. Now, what's the next?

Alan Cowen (28:30.114)
Mm-hmm.

Prateek Joshi (28:44.206)
thing to attack. Is it compute? Is it networking? Is it something else that we need to squeeze performance out of?

Alan Cowen (28:51.522)
So I think we're still reducing the latency, that's going to be a thing, and we have to make it more reliable and so forth. because we've put everything into one model, we've actually introduced a lot more, it's a much more powerful model that can do many more things. It can form really emotional expressions, it can do different accents, it can play different characters. And so a lot of the next step is to make that a lot more controllable.

and give developers more freedom to control those things and more leverage to make the character what they want it to be and have it adjust to the user in a contextual way that makes the user's experience as good as it possibly can be. And so we were actually able to go, I mean, it used to be the challenge was just saying the words, right? And that we sort of solved that, right?

Prateek Joshi (29:47.332)
Right.

Alan Cowen (29:47.346)
And we're introducing more flexibility to the model. can, it can make more mistakes because of that, but it also unlocks this whole new realm of, of you can actually have conversational intelligence that's fluid, that understands the voice. So person's on the phone and, and the model can understand whether they're talking loudly or silently and whether it should talk loudly back or quietly. Very simple. How fast, like, are they in a hurry?

The model should be able to speed up its language and give more concise responses when they're in a hurry and they shouldn't have to ask for that. should just happen. And the model should know what its role is and basically understand how to modulate its voice as a result of that. So if it's in a customer service position, it should be very polite and the customer is always right. If it's in a different kind of life coaching situation, maybe it's supposed to give advice to the customer so that the customer is going to change their behaviors and not believe the customer is always right.

Right? But and that's, you know, there's different language they would generate, but also it should generate different tones of voice based on that. Right? It's a life coach is not polite in the same way that a customer service agent is. A life coach sounds different. Our ideal vision for a life coach, I mean, when you ask people what they want for a life coach, they always say Morgan Freeman. That's not actually true, but like, but like you don't necessarily want Morgan Freeman on customer service, right? It's like a totally different kind of personality that you want.

Prateek Joshi (30:44.683)
All right, all right.

Prateek Joshi (31:05.76)
Right, right.

Prateek Joshi (31:14.478)
I have one final question before we go to the rapid fire round. Now, there's so many developments happening in AI, specifically in the subfield of voice AI. What advancements are you most excited about?

Alan Cowen (31:30.484)
In the field of voice AI, I'm most excited about any advancements that increase the amount of information that we can convey to the AI model and how much it can give back because at the end of the day, that's going to constrain our ability to control AI and get what we want out of it and actually have a volition in the world. mean, imagine that you have no, if you had no ability to communicate with the AI, then you would be like a mouse in a cage. And, you know, basically it's just the AI is optimizing its environment.

based on assumptions about what you want. But we're human, we have this ability to communicate. And the throughput of that is entirely controlled by how well it can understand the voice and eventually facial expression. We have facial expression understanding as well. And, but, you other indicators that we can actively communicate and have volition in our lives. And, and that's going to be built into everything. So I'm excited for people to move away from, let's build a text box into our customer service. And instead,

let's build a voice into our app and make it so that with as little input as possible, can decide this is what the customer probably wants and just bring up a confirmation window and then we can do things faster. The AI can take action faster and understand our preferences much faster. I'm excited about the world where people start building that into everything. It is going to be built into everything.

Prateek Joshi (32:54.616)
with that we are at the rapid fire round. I'll ask a series of questions and would love to hear your answers in 15 seconds or less. You ready? All right, question number one. What's your favorite?

Alan Cowen (33:02.754)
Sounds good, yeah.

Alan Cowen (33:08.33)
Ooh, that's a tough one. I think 1984 is a good one and I think it really, I think it captures really well what could go wrong with AI too in terms of centralization and taking volition away from people.

Prateek Joshi (33:21.27)
Yeah, I love that book. So good. All right, next question. What has been an important but overlooked AI trend in the last 12 months?

Alan Cowen (33:31.914)
I think people talk a lot about AI agents. I don't think people are talking enough about AI interfaces. So agents come after interfaces, in my view. Agents come when you really understand what the user wants and you don't even have to ask them anymore. But the first step is to actually embed an interface into the app that the user would normally do to do that, and then have the AI learn from that interface. And eventually the AI can take more actions on its own without user input. So interfaces are going to come first, and that's sort of like

the human and loop version where you're watching the AI take actions and you're able to correct it actively.

Prateek Joshi (34:09.1)
What's the one thing about voice to voice AI models that most people don't get?

Alan Cowen (34:15.51)
I think people think of it as an isolated thing. Like, ChachaBT Voice is an isolated interface. You don't see anything happening. You don't, it's not meant to be looking at your phone while you're using it, but I don't think that's how most voice interfaces will work. I think most voice interfaces will also have a visual component. And that's unlocked by having a voice-to-voice interaction because you don't need to go into a text box and type and get back text responses. You can actually just be watching what the thing is doing while you're talking to it.

Prateek Joshi (34:45.942)
What separates great AI products from the merely good ones?

Alan Cowen (34:50.966)
I think a great AI product makes the user feel like they have more control. And they, you know, I think there's, there's good ones where you get like a delightful experience at first. So you put in like a prompt and you get back a video or, know, and I think the problem with that is that where do you go from there? What's the next step after that? It could have, it's delightful at first, but it's sort of the end of the experience. And I think a great AI product gives you continuous control and the ability to really.

really cater that experience to your specific needs at a given time.

Prateek Joshi (35:22.678)
What have you changed your mind on recently?

Alan Cowen (35:27.362)
I was honestly, I was worried. I don't know this as recent. I in the last few years, I was more worried about traditional conceptions of AI risk than I am now. Some people have gone in the opposite direction. And the reason for that is because of, I've realized that intelligence is not magic. It's not something you can like code in intelligence. Intelligence is just learning. That's all intelligence is. And the other thing I've realized is that learning is really constrained by data and compute.

And so we're not going to suddenly get super intelligence. It's going to happen through a scaling process that we can observe and test where you put more data in or you put more compute in or both. And the thing learns more and all you can do to control the efficiency of learning is make the algorithm more efficient. But I think the learning algorithms are already somewhat efficient. There may be like 20 to 25 % there for most tasks.

And at best things are going to learn four times faster, but we're still going to be constrained by the fact that we need trillions and trillions of tokens and we need a huge amount of flops, right, in order to actually get the thing to learn. And so we're not going to have runaway intelligence in a traditional sense.

Prateek Joshi (36:39.042)
What's your wildest AI prediction for the next 12 months?

Alan Cowen (36:43.83)
Wildest AI prediction for the next 12 months is that we're going to have, I mean, maybe this is my myopic view, but I do think we're going to have voice interfaces in a lot of different products. And, we had ChatGPT was like the first revolution and having text interfaces with the first revolution. But once people realize what voice interfaces are capable of doing in conjunction with a graphical user interface, we're going to see them take over.

Prateek Joshi (37:12.302)
All right, final question. What's your number one advice to founders who are starting out today?

Alan Cowen (37:19.458)
So I think that, well, one piece of advice is that any company you're gonna found today is in some sense an AI company, if it's scalable, if it is a technology company, there's lots of other plays that you can make, but if you're gonna start a tech company, I think in some sense it is an AI company at this point. And you should focus on what is going to be possible in six months with the, and really understand where the models are going. And that's sort of an essential thing to do.

because otherwise you're going to be building kind of to fill gaps that exist now that won't exist in six months and it'll take you six months to build it. And by the time you build it, the gap won't exist anymore. So you have to really make, you have to be much more predictive of like, macro trends than you used to have to be.

Prateek Joshi (38:06.518)
Yeah, that's fantastic advice and so good. So, Alan, first of all, thank you so much for sharing all this insight. Love the topic, love everything about these are hard learned insights. So thank you so much for coming onto the show and sharing it. Of course.

Alan Cowen (38:21.418)
Of course, yeah. Thanks for having me.