Digital Replicas That Can Have Real Conversations Artwork

Infinite Curiosity Pod with Prateek Joshi

The best place to find out how AI builders build. The host Prateek Joshi interviews world-class AI founders and VCs on this podcast. You can visit prateekj.com to learn more about the host.

All Episodes

Infinite Curiosity Pod with Prateek Joshi

Digital Replicas That Can Have Real Conversations

October 11, 2024 • Prateek Joshi

Hassaan Raza is the cofounder and CEO of Tavus, a video API platform for digital twins. They've raised more than $28M in funding from investors such as Sequoia and Scale VP.

Hassaan's favorite book: Go Like Hell (Author: A. J. Baime)

(00:01) Introduction
(00:38) Overview of AI in video generation
(01:44) AI models used in video generation
(03:35) Capturing intricate facial movements in real-time
(06:46) Data capture and 3D modeling from basic video input
(09:01) Explanation of neural radiance fields and Gaussian splatting
(10:14) Capturing facial expressions for video generation
(15:22) Temporal coherence in video generation
(18:05) Challenges in conversational video, including lip-syncing and emotion alignment
(20:38) Inference challenges in conversational video
(22:47) Bottlenecks in the pipeline: LLMs and time-to-first-token
(26:58) Multimodal models and trade-offs
(27:36) Advice for founders running API businesses
(30:04) Pitfalls to avoid in API businesses
(32:15) Technological breakthroughs in AI
(34:10) Rapid-fire round

--------
Where to find Prateek Joshi:

Newsletter: https://prateekjoshi.substack.com
Website: https://prateekj.com
LinkedIn: https://www.linkedin.com/in/prateek-joshi-91047b19
Twitter: https://twitter.com/prateekvjoshi

Prateek Joshi (00:01)
Hassaan, thank you so much for joining me today.

Hassaan (00:04)
Yeah, thanks for having me. I'm super excited.

Prateek Joshi (00:07)
Let's start with the fundamentals. Video generation, obviously it has taken the world by storm. People are excited. People are doing so many things, which, I mean, to me, 10 years ago, was completely impossible unless you know some deep Adobe secrets. You can't do anything with video, but now it's amazing. So can you walk us through how AI is being used to create these videos?

what are the basics that we need to know.

Hassaan (00:38)
Yeah, yeah, great question. So think there's a few different domains, right? I think there's, course, novel video creation, especially through large vision models and large image models, where now, essentially, you can have an idea, something completely crazy. And you can get a very, very good representation that essentially is just like,

helps you actually take your idea and put it to paper, really, which is amazing. So that's one element. And then there's another element, which is one that we've more physically focused on, which is the ability to extend your likeness and the ability to create a replica of yourself so that you only have to record once. And then now you can create millions of videos without having to record them. And I think it's very different targeted use case where

or focusing on extending your likeness. I think those are the two different spheres that we see, and both are incredibly compelling, especially if you're trying, if you're someone who's interested in video, or if you wanna get into video, there's been no better time than now.

Prateek Joshi (01:44)
All right, let's talk about the underlying models that are used. Obviously, there are so many different models people have tried. There are variations, but in your experience of doing this and deploying this, what models have worked well in terms of creating just good, high quality videos that they don't obviously fail in like basic things like breaking.

laws of physics are having seven i's. So what models have worked reasonably well?

Hassaan (02:16)
Yeah, yeah, yeah. So yeah, on the novel video generation side, recently I've been really impressed with Kling. It's been a really, really impressive model that I've been playing around with a lot. Of course, know, anytime I see insight into Sora and outputs there, those are amazing. And I'm super, super excited about that. But then, you know, on the...

on the sort of replica side of things and being able to create videos through your likeness. That's also a very, very difficult problem in making sure that it looks like you and sounds like you and still looks very, very realistic and authentic. And so what's really interesting is that the approaches for both of them are different right now. Creating these novel videos

focuses on these very large diffusion type models versus creating a replica of you or an avatar, some might call it, is based on neural rendering and Gaussian splatting and nerfs. It's creating a three representation of you and so there's very, different ways to accomplish those today. And I think in the future, some of those will merge together.

Prateek Joshi (03:35)
You mentioned lightness a couple of times. So let's talk about that for a minute. How are today's systems capturing intricate facial movements in real time? And also what technologies are involved to make sure that it looks smooth to the average user?

Hassaan (03:57)
Yeah, yeah, great, great question. So on a high level, whenever we're extending your likeness, we're creating a replica of yourself, the input requirements are, for us at least at Tabas, is about two minutes of video of you talking. And we want you to talk something interesting because typically the videos that you're putting back out are gonna be something where you wanna seem excited. And we'll get into some of the expression stuff in a minute, but the idea is from that two minutes of training video,

What is happening is that the models are learning, they're essentially creating representation between what you're saying, the audio itself, and then the expressions and how you're moving your face, how you're moving your lips. And then once that representation is created, it's also creating a 3D model of you using either nerfs or gaussian splatting to actually create a 3D morphable model that

can then be manipulated. And then finally, it's actually taking in the textures and putting them into space as well. And then on the output side, then it's almost reversed. So what you do is you first generate audio. And the voice cloning models have gotten really, really good these days. the idea is from a tabless perspective, once we have the audio generated, we do audio to expression. So then we transfer audio to

expression parameters that match your likeness or your face. We put those expressions into this 3D -morphable model that then manipulates your face in a 3D space. And then we have, you know, then we are able to apply textures back on top. And then we can run a model that basically ensures that all that pieces together accurately. And so obviously there's a lot of sort of...

loss functions that determine if likeness is preserved and if textures are preserved. So that's like a high level of how it works.

Prateek Joshi (05:57)
That's amazing. As a user, as you said, I come in, I talk for two minutes, you capture everything you need to know so that I can create videos of myself saying a bunch of different things, a bunch of different expressions, different contexts so that I don't need to keep re -recording the same video, which is fantastic for scale because I can do so many things now. Now, you mentioned 3D modeling. So can you talk about how

the data capture happens, especially when you're given a pretty okay cam. Most people have okay cameras, like a simple 2D capture camera. And from that, you have to extract a whole bunch. So can you talk about the modality of data capture and also what are all the things you have to do to extract information from a basic video?

Hassaan (06:46)
Yeah, absolutely. So a lot of focus at Tavis and I think in industry has been how can you make it really, really easy to create a replica of yourself and avatar of yourself? Because the idea of going to a studio where they have many, cameras around you and you're doing all sorts of stuff is both prohibitive, right? But also just a very time consuming and very stressful sort of situation.

Prateek Joshi (07:10)
Right.

Hassaan (07:15)
And so the models, especially as we've seen with NURFS and GUNS spotting, what has happened is there's a translation layer where you can actually take a essentially 2D video, so just a webcam video, and it's been trained enough to be able to then create a 3D representation through essentially estimation. And so can create a really, really high quality model of you. And the idea is that as you move around, so you know,

we ask you to talk naturally and as you're moving around, it's capturing more and more of your data. And so certainly, it probably won't get the back of your head perfectly right, but it can do a pretty good estimation. But if you do turn around, which we don't recommend in the video, but if you do that, then it can actually create that entirety. so, nerve specifically, Gaussian splats, these are very, very good at being able to take a 2D video and turning into a 3D space.

And from our end, you know, that's all that we require. We try to make it super easy for the end user. So we say, hey, like, you don't even need two minutes, even a minute will suffice. And what we're doing then is where we have a model that is very, very good at taking in that data. It's trained on a corpus of data on, you know, on like the human form and human faces and everything. And it's able to create a really, really robust representation of you in a 3D space. And so...

We want to make it easy and we think that the future is that we want to make it even easier. It's not even a minute required. It's very, very little time required. So I think it's going to get much, much easier over course of next year.

Prateek Joshi (08:52)
You mentioned the terms nerf and Gaussian splatting a couple of times. So maybe quickly for people who may not know, can you explain what they mean?

Hassaan (09:01)
Yeah, so nerfs and Gaussian splats are essentially, they're essentially a type of, know, the nerfs stand for neural radians field. And so they're essentially a methodology to create a 3D form, right? And, know, Gaussian splatting is the newer way of doing it. Nerfs were the first thing, which is, you know, essentially you can create 3D form from a, from 2D images.

And so it's a way to create a representation. If you ever see a clay model, that's typically that can be created using one of these methodologies.

Prateek Joshi (09:43)
Coming back to the likeness and also how realistic it looks, one of the key parts of this is facial expressions. The facial movements are important, but facial expressions are also very important because people notice that. So can you talk about how you capture facial expressions for video generation? And also, how does it differ from something simpler like emotion detection?

versus creating a full video that looks very realistic.

Hassaan (10:14)
Totally. I think that, and this is something that we also think a lot about at Tavis, which is there is this trade -off actually between making it easy for you to record and then also having the ability to have a lot of optionality and easy customization of the output, right? And what I mean by that is,

Is it that you want it to be as close to what you look like when you're surprised? Or do you want it to output surprise without you having to sit there and recording a surprise expression? That's something that's often a trade -off that we think about at Tavis. But the idea is that people care a lot about keeping their likeness intact.

they want the video output to look a lot like what it would look like if they said this. What's really important is that we all move our faces in a different way for different expressions. Whenever we talk about surprise, well, Hassan's surprise is a very different form because my face is a very different structure and how I actually represent it and all those things, how my mouth moves, how my cheeks move, how my eyebrows move.

all are very, very unique compared to someone else. And so in keeping that likeness, it's important to, if you want to keep it as close to reality as possible, you want to try to have that training data. You want to be able to try to have someone in with a surprise pose or in a sad pose. Like I look sad probably very differently than you do, right? But on the hand,

If you have enough data, you can do an estimation of what someone might look like whenever they're sad or happy. It's just about controllability, right? Like where do you lie? Do you care more about the easiness of being able to say, hey, say this in a happy tone, and it can just estimate that you're happy? Or do you want to look exactly like you would in real life?

Prateek Joshi (12:19)
Yeah, that's actually such a good point here. it's hard to, it goes into the realm of product design in some sense. It's not just about the modeling accuracy. Sometimes a very, very accurate model doesn't translate to what the user wants or what humans perceive, right? So it's amazing. Okay. Now you mentioned earlier how you capture the two minute video and then you analyze that and then you

create the output. Now, in the final step, when audio comes out first and then everything else follows, can you talk about, in the final step, how facial expressions are translated in the final video? And also, there are subtle nuances that you may have to capture. So how do you ensure the quality, if you will, of the facial expressions? And how do you benchmark?

Hassaan (13:13)
Yeah, yeah, great question. So on the benchmarking end, I think it's all about evals, right? And creating robust evals that compare to ground truth, right? And ground truth in this perspective is like, you know what, especially if you have original training data and you know how someone is saying something and how their face is moving, and you can actually then compare it against generated versions of that, and you see how close those things are together, that's a great, very simple eval, say,

OK, this model is performing at a high likeness and high similarity index compared to Brown Truth. On the side of the representation of making, taking in this data and then being able to generate it again for novel audio, the idea is during training time, as you're moving your face, as you're moving your eyebrows, all those things are converted into expression parameters. And it's sort of on a,

linear basis and it's converted into expression parameters that then the model is learning and it's specifically fine tuned for you, right? So every time you create a replica, there's a fine tune sort of model being created for you, whether that's fine tuned weights just for you. And it's not just saying, hey, you know, repeat this expression over and over again. It's actually creating a relationship between the words that you're being said, the audio that's being said, the actual tagged expressions for that.

and then how you actually manipulate your face. And then for novel audio, once it comes in, the same thing sort of happens, which is like, given novel audio, you deconstruct it and you say, this is what this audio looks like, and this is the expressions that this sort of captures. Now let's actually create new expression parameters that would match this waveform more accurately, right? And then the idea is that the eval again here is comparing it to like ground truth, like...

How similar is this? Does this deform the face at all? Does deformity measurements you can do? All of those come together to ensure that it's a very, very high quality representation of you.

Prateek Joshi (15:22)
And I want to spend a minute on the differences between just image generation and a video that you have to produce. So in many cases, a still image, can do your thing, the image comes out. It's doable. But in video, you have to worry about things like consistency and temporal coherence between frames. So how do you do it? How do you ensure consistency?

and temporal coherence between frames here.

Hassaan (15:52)
Yeah. Yeah, so it's a good point. And it's one of the reasons why thus far, the models that we have in production today, they don't use a diffusion -type approach. Just because diffusion models today, they have a speed or a temporal consistency issue. In our case, without diving way too much into details, in our case, the use of

creating this 3D morphable model means that essentially, because we actually have a model of you in this 3D form, and we have this audio to expression model, we can accurately convert waveforms to expression in a way that's temporally consistent, and then the textures are put on top of that. So because of the architecture itself, we don't necessarily have to worry about temporal consistency as much, especially

given that like with neural rendering techniques now, you can essentially use that to ensure there is some poor consistency frame to frame. And so it's just a very, very different way of doing it, right? I think that, and that's why getting into like, know, with diffusion models and how they relate to our industry, I do think that in the future, these will sort of combine together and there will be sort of a convergence in the future. But for the time being, know, diffusion models inherently

you know, will struggle with this dim poor consistency issue, which is very, very evident whenever you're trying to create a video that looks like a human.

Prateek Joshi (17:30)
Going into conversational video, when you want your likeness to hopefully eventually have a real conversation with, I don't know, like if it's a customer support agent or somewhere else, you want it to have a real conversation. So in the videos that you generate, how do you manage things like lip syncing and emotion alignment when the output comes out? And also, what are all the things you have to sync?

in the first place. I notice it's not just like video and audio, there's other things. So what do you have to thank you?

Hassaan (18:05)
Yeah. Yeah. So for conversational video, it's like you take video generation, you're like, all right, let's just 10x this problem. you have to do it like, know, with video generation, you're like, OK, if it's not fully real time, you're OK. Now you have to do everything, but you have to do it in real time. And there is a little bit more forgiveness because it's only one frame that they'll see. They can't replay it too many times over and over again. But nonetheless, the models that we create for video generation is actually the same models that are power

a conversational video for us as well. And so the similar problems, know, they're the same problems apply. But the difference is you have a lot less time to do all those things. And that's where the complication comes, right? You have a lot less time to do any form of audio generation. You have less time to actually do the, like, because in about 500 milliseconds, you have to both transcribe audio in, figure out what that audio is,

using an LLM or some sort of language model, figure out what the output needs to be. Then you have to create audio from that. Then you have to put it into this real time replica thing. And you have to get the first frame out in 500 milliseconds, right? And so all of those things have to be actually orchestrated very, very closely together. So I think the thing about conversational video is that for a high quality experience, all of these modules have to be very, very tightly coupled together, right?

You essentially have to really optimize for inference here. And so it's a very different problem set. We have to worry about very different things often than video generation, even if it's the same model powering.

Prateek Joshi (19:47)
Yeah, it's a great point. It's like you take all of the problems of video generation and you turn exit and you get to face conversational video. Okay. Yeah.

Hassaan (19:54)
Exactly, exactly. The good thing is that video conferencing over WebRTC is limited in resolution, so you can't see everything in 4K. So there are some ones here.

Prateek Joshi (20:07)
Right, right, 100%. Now you mentioned inference, which I think is a great next topic to dive into. Now, inference becomes a lot more relevant in a set up like conversational video. So just in general, during the inference stage, video generation or real time conversational video, during the inference stage, what challenges

do you have to deal with when it comes to just influence, just in general?

Hassaan (20:38)
Yeah, on conversational video specifically, I mean, this is also true for video, Jen, but on conversational, think the biggest thing you have to worry about is being able to generate frames faster than real time, essentially. And so let's say that we output at 25 frames per second. You have to be generating at over

30, 35 frames per second. And what that requires is it requires a pretty powerful piece of hardware, right? And so it's not something you just run A10s, it requires A100s. And then also, there's so much memory optimization you have to do, fitting these models on all these things, because it has to be all tightly coupled together.

you really have to, you're sort of running this all locally, And so all those things together create a infrastructure challenge for sure, an architecture challenge. But then also it's like these models have to be hyper optimized, right? So the difference between Videogen is they don't even take a little bit longer and you can say, hey, I'm gonna take a little bit longer because I want higher quality or, it doesn't really matter. Like what's 300 milliseconds, right? It doesn't really matter.

But for conversational, like 50 milliseconds is a massive add -on. We're measuring 50 millisecond differences. We're measuring 10 millisecond differences whenever we do something. So it's a very different game. But the biggest, the most important thing is being able to generate frames faster than real time.

Prateek Joshi (22:19)
And earlier you mentioned kind of like the pipeline in conversational video where audio goes in, there's audio in, transcription, understanding it, then creating a text response, then creating audio, and it goes on. So between that, during that process, what is the biggest bottleneck, or rather if you had to arrange them in like descending order of like, this takes the most time.

versus the second most. What are like the top three?

Hassaan (22:47)
Yeah. Easy answer is, is an LLM is like, especially if you're using your own LLM easy answer, you know, at Sabbath, we've optimized every other step to be really, really, really fast. And so in, in like, let's say in like 600 milliseconds, probably half that is just the LLM and that's, that's a play a fast LLM. Most people's LLMs are really, really slow.

And I don't think there hasn't been a, know, there is more so now, but historically there hasn't been a massive focus on what makes for a really fast LLM for this use case. You know, people talk about like tokens per second and sure, that's really great, but what really, really matters is time to first token, right? Because if your time to first token is long, then it doesn't matter how fast you can generate the rest of it. It's gonna take a really long time for that first one to come back. And so the LLM by far is the

is the worst offender today and why often we have to work with our customers really deeply to be like, hey, we need to figure out how to optimize this because we don't want to wait two seconds for a response.

Prateek Joshi (23:52)
It's very interesting because if you ask the average, if you conduct a survey and ask them, hey, there's a thing and it has text, audio and video. What do you think takes the most time in this pipeline? They'll say video because historically video has been like, you need so much compute. But in this case, LLMs are like the biggest blocker, which is very, very interesting. Now, is a solution to that a very custom, very small use case specific train LLM?

Hassaan (24:10)
Absolutely.

Prateek Joshi (24:22)
What needs to be done here to remove that blocker?

Hassaan (24:26)
There's a few things, right? One is absolutely smaller, more domain -specific LLMs that can really optimize for time to first token on optimized compute. I the challenge that comes is that we're providing this technology to developers, and developers already have their own LLMs that are trained on corporates of information. And so us providing a very, very small LLM, which we do,

Doesn't always work because people have very custom use cases. And so they're LLMs and they're being a little slower. But I think the actual long term solution to this is that these pieces don't need to exist separately. They actually need to exist together. And it's something that we work on at Tavis. And so what that means is we already see this. We're seeing this with voice -to -voice models. And there's not many that are really, really that amazing yet. But if you can actually

reduce the amount of conversions that you have to do from one modality to the next, that really optimizes the latency and that really optimizes the time it takes to generate an output.

Prateek Joshi (25:32)
interesting. And if you do build a multimodal model like that, and on one hand, the coherence of speed increases, what are we, there must be a trade off, like what are you giving up to achieve that speed? Do you have to buy bigger hardware, more computer? Like what are you giving up to get?

Hassaan (25:50)
Controllability is what you're giving up primarily today, at least. Right now, with these different modules, you have a lot of control over each of these modules. And they can be trained in a very specific way. Whenever you try to combine all these together, you lose the degree of controllability that you had before. There's a loss in converting from one to another, but there's also control, where you can intercept and say,

Okay, that doesn't look quite right. Let's do this before we transfer to the next step. And so that controllability is what you really lose. And I think that's going to be part of the challenge, right? Especially because like, it's one thing if you, you know, like, if we talked about hallucinations, right? Like, you know, if an LLM hallucinates text, well, you know, there is llama guard and all these guards that you can put on top that sort of check for these outputs. Now, if you, you, if you remove that representation that converts from text to audio,

Well, now it's all being done more so in this latent space, and so it becomes a little bit harder to do some controllability. that's what you lose in the process, which is going to be a trade -off for sure.

Prateek Joshi (26:58)
Moving on from product to the business that you're running, more broadly, video API business. Historically, API has been a phenomenal category of businesses to run. People consume it. The more they consume, the more they pay. It's a great business model. having done this, you've done this, what are some of the things you'd advise other founders who are running API businesses, not just video, but any API business?

What are some of the things that they should do? And also more importantly, what are your learnings on things that don't work or people shouldn't do when they're running an API business?

Hassaan (27:36)
Yeah, I think focusing on the developer experience is like the most important thing that you can do. And I think on that note, like considering that the developer experience is your product, if you're an API company, Like docs aren't the separate thing that you do. Docs are the product, actually. And creating a really good experience is your product. And so if you're starting API, I think that's the most important. I think that just like

Having API, even if it's like the most amazing API that everyone wants and just like putting it out there, like sure, can, you know, that might work, but ultimately if you really want to have people build deeply, especially as like we've seen a lot more concern recently around like platform risk. People are like, they don't want to rely on open AI for everything and all these things like building this really great developer experience is paramount to building a long -term like API company that sort of stands test of time.

Prateek Joshi (28:36)
Right. That's a great point. also any pitfalls to avoid. Through experimentation, maybe you discovered things that, what have you tried and things that didn't work.

Hassaan (28:49)
Don't overwhelm people, right? think that people, especially two things, I think don't overwhelm the customer, even if they're developers, just throwing a bunch of stuff at them and saying, create something. Often the results in it being very difficult to create things.

you do have to have a little bit of incremental disclosure and you have to showcase, like, hey, this is how you can build this thing and then this thing and then this thing. Because otherwise, like, the assumption that someone else will understand everything that's in your head and be able to build what you're thinking is like a common field fall that, I mean, we've fallen into in the past because we're like, let's just release everything. But it's like, okay, actually, let's take the time to help prescribe a little bit how you can use these different things.

So that, and then the most important thing is focusing on reliability and stability, right? Which like, there's this concept of moving fast, like in Tavis, we always talk about moving fast, but often what it's not considered is like, well, there are certain things that you can do to set yourself up to move faster, even if it means being a little bit more thoughtful right now. So those are two things I would say.

Prateek Joshi (30:04)
Right. Yeah. No, that's amazing. I think especially the first point you mentioned, such a good point because many times I see the young founders, time founders, they build dev tools and they have some basic documentation. And if things are stalling and not working, they think, the solution is to write much longer documentation, add a lot more details. Paradoxically, funny thing is if things are not working, you go in the other direction. it's like, what

the Collison brothers did for Stripe. They showed up at the door and say, the solution is we'll do it for you, white club service. Forget the documentation. We'll sit next to you, do it for you until it installs. The solution was not creating more documentation because at that point in the early days, people, they don't really care about you or the product. So it's a great point you mentioned here.

Hassaan (30:42)
example

And I think that building with your customers is so, important, right? And I think that's also like, I think often the thought is like, it's a SaaS business. You just give it to people, you don't have to give any support because Google doesn't give support. Why do I have to give support? Well, it's like, just give a Discord channel. people will discuss. But actually, on the point of the stripe, that is still true today. Helping people build, helping people figure out what to build.

Prateek Joshi (31:08)
Alright. Alright, alright, alright.

Hassaan (31:22)
really being their partner and at Tavis, like what we like to say is like, our customers consider us an extension of their team. And that's the best thing that you can have happen for a company is whenever your customer feels like you're just part of their team.

Prateek Joshi (31:36)
Right. And funnily enough, even for a company like Stripe, very customer obsessed, their documentation, I think is the world's best. mean, for a long time, that was like the gold standard of how do you document a dev tool? Look up the Stripe and they're the ones who are providing the best white glove service too, on top of the world's best documentation. So that's something to, yeah, that's fantastic. All right. I have one final question before we go to the rapid fire round. Today, so many...

breakthroughs happening, so many things in AI. What technological breakthroughs in AI are you most excited about?

Hassaan (32:15)
Yeah, that's a great question. I'm really excited for the next generation of multimodal models that are coming up from open source. So specifically, I'm really bullish on the MetaLlama team. I think they are single -handedly accelerating progress in our industry a crazy amount. So I'm super excited. That's the thing I'm most bullish about.

Prateek Joshi (32:38)
Amazing, I agree. that was a fantastic change of direction. It's landing really well for Meta, the company, going from VR to now they're like the champions of AI open source leading the good fight. All right, so that's amazing.

Hassaan (32:55)
Exactly. you know, was, was, it was, was all this, like, there was so much questioning. were like, Zach doesn't know what he's doing. Why would he open source? How will he ever make money on this? you know, in the, in the, in the investor relations calls, like people were like, well, we ever make money? Like, I don't know, maybe not. Now it's, it's become clear that like, it is completely transforming the industry, right? Like what a good bet.

Prateek Joshi (33:17)
Yeah, amazing. And also I recently read something where people are wondering like, what's the ROI? As you mentioned, they're asking big companies, Facebook, what's the ROI on all the investments you're making? Where's the money? And forget the ROI, the fact that they're building and deploying these models, the tiny incremental targeting, ad targeting they can do, they've already done with this, has already returned the capital.

Hassaan (33:43)
percent.

Prateek Joshi (33:46)
People are not looking at that, internal improvements on all this insane AI capabilities they're using to target better. They've already gotten their money back. And now they're just like doing it for, know, extend their advantage. So yeah, 100%. Yeah. All right. Great. With that, we are at the rapid fire round. I'll ask a series of questions and would love to hear your answers in 15 seconds or less. You ready? All right. Question number one. What's your favorite book?

Hassaan (34:10)
Yep, let's do it.

Go Like Hell by A .J. Bane. It's a Ford versus Ferrari Le Mans battle book. Very good.

Prateek Joshi (34:21)
Amazing. I love that whole story, the movie too. All right, next question. What has been an important but overlooked AI trend the last 12 months?

Hassaan (34:35)
Yeah, we talked about this briefly, but like how specialized small models on very optimized compute have made multimodal conversational AI like incredibly immersive and low latency.

Prateek Joshi (34:45)
What's the one thing about video generation that most people don't get?

Hassaan (34:51)
The complexity and nuance in getting a realistic video. We always see people always go for like high -end studio production in front of a green screen and it just looks really bad. We always say webcam wins. It looks authentic. It's what it's what people expect.

Prateek Joshi (35:05)
What separates a great AI product from a merely good one?

Hassaan (35:09)
The obsession of the small details, like focusing on the user experience, especially for those that are new to using AI, right? So that's the most important thing, I think.

Prateek Joshi (35:19)
What have you changed your mind on recently?

Hassaan (35:23)
AI therapy. I was pretty against it at first, but now I think it will allow people to be way more vulnerable and honest and give way more people access to therapy.

Prateek Joshi (35:34)
What's your wildest AI production for the next 12 months?

Hassaan (35:38)
We'll be talking to AI digital twins every day for various tasks and it will be completely normalized.

Prateek Joshi (35:43)
Amazing. All right, final question. What's your number one advice to founders who are starting out today?

Hassaan (35:51)
Don't fall into the hype trap. Focus on building a great product with a strong foundation that you have conviction rather than trying to get viral. Heads down and build, everything else is a distraction.

Prateek Joshi (36:05)
Perfect. Hassan, this has been a brilliant discussion. Love the details on both the product building AI, but also the business side of it. What does it take to run a real company? Because there's product design, users, customer obsession, all these things come into mind. So thank you so much for coming on to the podcast and sharing your insights.

Hassaan (36:25)
lot of fun. Yep, thank you so much for having me.