
Infinite Curiosity Pod with Prateek Joshi
The best place to find out how AI builders build. The host Prateek Joshi interviews world-class AI founders and VCs on this podcast. You can visit prateekj.com to learn more about the host.
Infinite Curiosity Pod with Prateek Joshi
Behind the Scenes of AI Video | Amit Jain, founder of Luma AI
Amit Jain is the cofounder and CEO of Luma AI, an AI platform to create realistic looking videos. They're backed by Andreessen Horowitz, Amazon, and AMD. He was previously at Apple where he worked on VisionPro.
Amit's favorite book: Skunk Works (Authors: Ben R. Rich, Leo Janos)
(00:01) Introduction
(00:26) How Can AI Models Generate Realistic Videos
(02:38) The Current Landscape of Video Generation
(06:16) Teaching AI Models the Laws of Physics
(09:47) Founding Luma: Deciding on the First Product Version
(12:56) Validating Market Need & User Feedback
(16:42) Key Learnings from Ray 1 Before Launching Ray 2
(21:24) Growth Hacks That Moved the Needle
(24:27) From Zero Users to Today: The Growth Journey
(27:53) Building a Community Around Luma
(30:57) Luma’s Technology Stack & AI Infrastructure
(36:42) Biggest Technical Challenges in Building Luma
(39:24) The Future of Video Generation & AI's Role
(41:54) Rapid Fire Round
--------
Where to find Amit Jain:
LinkedIn: https://www.linkedin.com/in/gravicle/
--------
Where to find Prateek Joshi:
Newsletter: https://prateekjoshi.substack.com
Website: https://prateekj.com
LinkedIn: https://www.linkedin.com/in/prateek-joshi-91047b19
X: https://x.com/prateekvjoshi
Prateek Joshi (00:01.436)
Amit, thank you so much for joining me today.
Amit Jain (00:04.844)
Hey, Prateek. Thank you so much for having me here. Excited to chat.
Prateek Joshi (00:08.636)
Let's start with the basics. Now people have heard of AI generated video, we've all used it, we've seen it. Can you explain what it takes to teach an AI model to generate one of those realistic looking videos?
Amit Jain (00:26.904)
Yeah, absolutely. So the core technology behind video generation and image generation are diffusion models. People are very much used to thinking about LLMs, which are large language models. Diffusion models are also very correlated to language models. They're transformer models. They are very large. They train on large amounts of data. But they also differ from language models significantly. Like language models generate one piece of text at a time.
And then they keep typing as you see. Diffusion models, which is...
They generate basically a whole video at a time. And the way you teach them is you show them billions of videos and the objective works like this. You have a video in your training set. You completely noise it. Like you corrupt it so much that it's actually very difficult for a human to understand what is happening in the video. And the objective function for the model is to then reconstruct that source video from this extremely noisy environment.
So this is the bottleneck. This is the challenge that we present the model with. And how well it is able to reconstruct the source video is this idea of diffusion loss. So generating video basically is this process in reverse. we train these models with text and video pairs. So it understands, like, OK, if I'm saying, bunny,
jumping in a field. Just like how language models learn what the meaning of words are, video models jointly learn what words are and what they mean in the visual space. So now when you go and type in, like, hey, bunny jumping in a field, it will be able to produce many, many, many examples of bunny jumping in the field. You could also go and say, like, bunny jumping on the moon. And because it has learned the moon,
Amit Jain (02:29.526)
And because it has learned what a bunny jumping looks like, it is able to actually produce this extremely unrealistic scenario that was never in the training day.
Prateek Joshi (02:38.994)
Amazing. Now, of all the tools available in the market, different tools have different capabilities. How would you describe the current landscape of video generation? Like what are the current tools good at and where are the gaps?
Amit Jain (02:56.43)
Right. So the way I think about video generation landscape is it's pretty much in the terms of language models, they're generations. So let me kind of give you the analogy here. So with GVD2, OpenAI for the first time showed that, you could generate one or two coherent sentences. But if you ask GVD2 to generate any length of prose, it will completely lose track of what it was talking about.
because the model was very small, didn't have much capacity, and also the architecture was pretty primitive. Then came GPD 3, and it was able to generate not only pros, but actually somewhat long-form content, keep track of things, make sure that it is not hallucinating a lot. It was still hallucinating. And now with GPD 4, you can actually generate opuses, and it's OK. Language models, up until actually 2024 or end of 2024 this year or last year, they were in the GPD 2 era.
They could generate mostly two to five second clips, but the more important thing about that is like the only one action could actually happen in the clip. And it would happen very, very slowly because the models really don't have the capacity to represent. to be able to generate coherent, real looking videos, you need to really generate a lot of information. Like language models are like, know, 100 tokens per second is good, right? We generate
thousands and thousands of tokens per frame. so video models produce immense amount of data and they have to keep track of like, if my hand is moving, what happens at this time? What happens at this time? What happens at this time? What happens in the environment around it? All these kinds of things. So they need a large amount of representation capacity. They need to be scaled. So pretty much every model before 2020 or before 2025 was exactly like that.
With the launch of Ray 2, so we just released a new video generation model. It's called Ray 2. Feedback and impression from pretty much the entire community is that this is the GPD 3 moment for video models. We scaled the model with about 10 times more compute and about 17 to 18 times more data to achieve this capability. And what you see in Ray 2 is this really fast, natural motion. Videos look like, I just recorded it on my phone.
Amit Jain (05:17.79)
or someone actually made it in cinema. So this is kind of like the division of the landscape. So you have companies like Runway and OpenAI, Sora, which are like these older generation models. They're kind of okay. can think of them using them for some demos or for maybe sometimes B-roll footage, stock footage, things like that. And then there's models like Ray 2, which actually you can start using them in production. You can use them for actually making clips you will see in a show someday.
Prateek Joshi (05:49.116)
Now, a while back, there was a period of time where videos were taking off, you know, videos where the gymnast seems to just ignore physics, right? So in those videos. And again, even though mathematically the error rate was low, still the laws of physics need to be obeyed. So how do you teach physics to these AI models? sometimes, not sometimes, every single time,
Amit Jain (06:01.326)
That's right.
Prateek Joshi (06:16.358)
optimize for the loss function but you have to obey the laws of physics.
Amit Jain (06:19.822)
That's right. That's right. Yeah, this is a very, very interesting problem. my background is in physics and math academically. So for me, this problem is particularly interesting because the logical conclusion of what we are doing here is being able to model the world. That's what these are. They're world models, ultimately. We want to be able to have a replica of the universe that is outside the model, inside the model. So of course, they need to be very good at physics.
They need to be very good at following the laws of nature and of course violating them whenever you ask them. And it's a challenging problem. So if you look at like the previous generation of models, as you're saying, like the gymnast problem, So those who are not familiar with that problem, when you asked a video generation model to generate a ballerina doing ballet or a gymnast doing some fast moves like on a beam or things like that, model just completely mangles it up.
This is the finger problem from the image model land, And why that happens is because, so there's nothing really special about this problem. What is happening here is that these are movements which are, first of all, not seen regularly. So it's not present in the data that much. But two, these are fast movements. And
Prateek Joshi (07:24.724)
That's right.
Amit Jain (07:44.642)
The model needs to now learn that, in the span of two to three seconds, how does it actually produce those movements of very soft organs, right? Like, you these are not rigid things that are moving just in a very deterministic way. hands can bend and then things like that. So it needs to really learn how to represent that. Ray-2 makes actually quite a good attempt at solving it. Ballerinas are pretty much fully solved. Very fast gymnasts are not there yet.
Prateek Joshi (08:10.324)
Yeah.
Amit Jain (08:12.952)
So learning physics, now let's actually come to that. it's a problem that is very similar to the hallucination problem in language models. Language models hallucinate means they are violating the narrative that was said before. So let's say you said the narrative of these two characters or these three characters. These are their personalities. This is what they do and this is what happened to them. And now let's say the model hallucinates five paragraphs down about their identity or something like that. It's because it forgot.
Or it's because just the representation capacity is not there. So first, trivial answer to your question is that we need to scale. We need to scale video models until they have enough representation capacity to solve this problem. But there's more nuance to it, which is data. You need to actually teach the model and show it enough examples with great corresponding captions that describe those actions. So the model learns this is correct, this is incorrect.
And third answer to that question is also reinforcement learning that we are exploring and starting to use quite a bit, where we want to teach our models what realistic physics is and reward it for when it does the right thing and penalize it when it doesn't. But...
The overall outcome of this is that when you scale these models, show them the right amount of data and give them the right direction, they are bound to learn physics because they're modeling the world. So that is going to happen. We just need to scale.
Prateek Joshi (09:47.289)
Now, let's talk about the point at which you founded Luma. So obviously you have a background in this and then you decided to found the company, Luma and I. So how did you decide, hey, these are all the things that need to go into the very first version of the product, the one you gave to your close friends beta testing. Like what went into that initial version?
Amit Jain (10:15.35)
Right. You know, that's always a hard question or the answer is always a hard thing to figure out.
First question for us was to figure out how to train these models. Video model training is wickedly difficult. you think about it, language models, even if you're training on 15, 20 trillion tokens, at most that's about 300 terabytes of And I mean, that can fit on any node, any computer. But then when you start to think about video models, we currently train on tens of petabytes of data.
And that's like, you know, after filtering and doing everything. So it's a very hard problem. our first goal was, can we build a model that at least gives us enough understanding of what people want?
And it's a new market and it's a new world, right? Because people are used to making videos with cameras and with the physical world. What these models do is they change that process that happens in the physical realm in real life to entirely being digital. So we wanted to understand how would people use it? What would they make out of it? How would they want to control it? All those kinds of things. So when we released our first model, which was Ray-1, our goal was basically like,
We are not going to solve all the scaling problems. We are not going to solve all the data problems, things like that. We want to get this out the door and make it good where people can find it useful. So that was our focus. we built it over the period of about five months or so from scratch, from having nothing. And RayBond was the first video model that people had ever seen in the world. At the time, there were image models.
Amit Jain (12:04.14)
that people were like, know, these companies like Pica and others were trying to animate the images. they were very weak, facsimile of video models. So when Ray One came out, actually, we used to call it Dream Machine back then. It caught on fire. It was on Good Morning America. It was on news stories, everything like that, because people had not seen that you could generate video, right? Of course, we didn't engineer for that, because how do you do that? But what we decided was like,
Can we get something out that teaches us what we need to focus on on the scaled worship? because, you this was very interesting to think about. is a research company, but research always needs direction. It's so open-ended. You could do a million things. So you need to figure out what to actually do. And that was our main motivation.
Prateek Joshi (12:56.212)
Amazing. Actually, I would love to explore that more because many of the listeners, they're either tinkering, they're building, they're founders, they're in the early stage. And especially the launch moment is very, it's easy to screw it up. going in that direction, how did you go about validating the need? Meaning what questions did you ask your early users? What worked? What didn't? So kind of how did you go about narrowing the scope to like the one thing that works?
Amit Jain (13:26.816)
Absolutely. I mean, you have to talk to users. You have to talk to people who are trying to do things with it. And we talked to a lot of people, everyone who was experimenting and using existing models, people who were messing with image models, people who were using language models to write scripts, to think about these kind of things. We talked to a lot of people at that time. This is also how we gathered and curated data.
I mean, the advice I have is honestly, it's okay. I'm an engineer. have been, uh, like, you know, programming for now 20 years, like since I was 12. It's very difficult, honestly, as first time founder to get into the groove of talking to users, because, uh, you have to take a very humble approach of like, you know, not in a, not in a very grand way, but just like, you need to, you need to be willing to listen and to ask like, uh, what are you trying to do with it? And then half the time, my instinct is that like, but that's not what you do with this.
That's not correct. you know, when your product is out, my voice will not travel next to it. People will do with it what people will do with it. So you really have to get good at that. You have to build the muscle. You have to build the channels where you can actually talk to people and do it in a very natural way where you're able to understand what they're trying to do.
Of course, everyone has heard the thing about 1,000 horses versus building a 1,000 horsepower thing. But when you're making frontier models, nobody's going to tell you exactly what they want to do with it because they've never seen anything like this. It's not like a SaaS product where I can go and find out exactly the set of requirements a company has or someone has. And then I can just fulfill them and I'm done.
It's truly building the future and 99 % of people cannot imagine the future that they're not actively working in. Right? I mean, they might have all the capabilities on the planet in their mind, but if they're not thinking about it day in and day out, they don't think about that. They can't imagine that. So they ask you, know, things which are like, okay, this is what I want to do. But then you imagine like, okay, but if you build this, it's going to be completely different. But only after you've talked to them, you learn what those things are.
Amit Jain (15:43.458)
how to make them very good and those kinds of things. So I think that's the first thing. Second thing in my mind is like, release, release early. Like so many founders have the phobia of releasing or come from companies which, you know, basically ingrain in them that like, no, this needs to be perfect before you release it. Right. Ray one, when we released, I mean, it was the best video model when it was released, obviously, but man, did it have flaws.
Like, Ray 2 is not like a two times or four times better improvement on Ray 1. Ray 2 is about 10 times better because Ray 1 was also bad in hindsight. But nothing existed. So we didn't have any comparison. So it was very good to release and get all the feedback that we got, all the traction that we got, everything that come with a release, I think, and learn, basically.
Prateek Joshi (16:35.774)
There's a famous meme video, like a joke, where the creator, make a nice puzzle with different shapes and different wooden blocks to go into those shapes. And they give it to somebody. And one by one, they take every wooden block and put it through the square hole. And you're like, no, that's not how you're supposed to use it. But the user, they'll do what they'll do. It's very funny. OK, that's great. Now, going back to Ray One, you released it. And it started working. It's great.
Amit Jain (16:43.982)
yeah.
Amit Jain (16:51.116)
That's right.
Yeah? Yeah? That's right.
Prateek Joshi (17:06.248)
Before you could release the next version, what did you learn before you release R8.2? What were the learnings? What were the things that, hey, this thing worked, we'll take it to R8.2, that didn't, and that's like, we are relaxing.
Amit Jain (17:20.844)
Yeah. Right. Okay. So I'll answer what didn't work afterwards, but what we got feedback on, People aggressively want to use it to finally produce the thing, or to produce the final thing is what I mean. I mean, in my mind, every other model that exists out there right now is a very good stock video generator. It can make stock videos like Adobe's products today, like Adobe's video model, I forget the name now.
and some of the other things. They're good models, but ultimately they produce stock videos. That's good. I mean, you have not customized stock video, but like, know, the ambition that the best in our community have is how can I make something that can only be made by Hollywood, but like, you know, with a group of four guys, right?
That is, I think, really, really exciting. Why? Because, you know, it means storytelling becomes basically accessible to everyone. And now, so it has actually an effect across the stack, right? People who could never tell the stories and needed permission and capital from like, you know, sources, and which ultimately had the effect of censoring those stories or changing those stories or maybe not ever getting them out, can now do that.
Right? That's thing one. Thing two is people who are very good at this craft, right? Like, you know, people who are, who make movies, people whose work we love to watch. Now, they have the creative freedom to try out a million ideas, right? Because what you have inside these models is a world. And you manipulate that world with text. how insane does it get, So they have this freedom to actually do that. Like, before we launched the model,
We had very theoretical ideas about this. We had these benchmarks of a cube should stack on top of a square or a sphere should stack on top of a rectangular plate. These kind of things we would do. Of course, with prompt adherence and things like that, we still have those benchmarks. But the biggest thing was like, is this footage good for production? And two, how do we make it controllable? You have the world. Yes, you have text.
Amit Jain (19:42.136)
But how do you make it controllable in a way that you can create a full-form story out of this? It's a very hard problem. It's a very, very hard problem. Language models, it's easier because your input mechanism is text. Its native modality of thought of an LLM is text, and the output is text. So it's just one very nice circle. In video models, that's very different. Your input is text, your output is video. It thinks in video. It thinks in video and text.
And second, the data you're operating on is not this single dimension text. The other hand, it is this extremely complex three dimensional information in pixels and in time. So how do you control it? So we learned a lot about how to control it. And then we learned a lot about our ideas of like, know, that we were going to do if we had not released and we would have spent like, you know, six months doing that now we're not doing it all because, it doesn't matter because people are not, people don't care for it.
Prateek Joshi (20:43.134)
Right, the next, it's good segue into my next question. And this is about growth, more about growth hacks. Meaning there are always these pivotal moments in a company's life when you look at the graph and you're like, what the hell happened here? Clearly something worked and growth is not linear. Meaning when it works, it's not like a controllable linear thing. Something happens and boom, it just goes up and then you go up again. So.
Can you talk about maybe your top one or two big needle moving growth hacks that you did? And obviously maybe a couple of examples of the things that you tried but didn't work.
Amit Jain (21:24.45)
good question. I mean, it was not so engineered, to be honest with you. But you know, like, we made it funny. We made it really funny. Like, we had honestly, we actually had a choice when we were fine tuning Rayvon between it is like this very somber, serious model, right? Like, or it can, it was allowed to produce or we allow it to produce sometimes like, you really wacky things.
And we actually, very consciously, that was a decision. They're like, you know, fuck it. We should just let it do insane things that it does. And that went really far because people had so much fun with it. People still have so much fun with Dream Machine, with Ray 2, with our products, because I think that's where it lies. So if you make something, I mean, it of course doesn't work in every field.
But if you're if you're gonna feel like this and you can make something that people can have fun with and Share it with their friends because you know, this is funny That's really cool. I quite like that. We also make okay. There's another thing we do which is that like we don't gate access We try like hell to general availability or GA our models on day one and and I it's very difficult
There is no amount of compute in the world to actually service all the demand. So we have so many mechanisms on rate limits and gating, but also to be able to allocate all our thousands of training GPUs for inference when needed, these kinds of things, so that we can actually make it available to everyone. So it becomes this moment where everyone is experiencing at the same time. Like Game of Thrones, when that was a thing, everybody would watch us.
and then talk about it the next day. And if you didn't watch Game of Thrones, it was impossible to not get spoilers or impossible to avoid the spoilers. So that's what we tried to do. Could you actually create a moment where people share it with each other? I think it's really fun. We really relished that. We really enjoyed it. I don't know if you want to call it growth hack, we like it as a cultural.
Prateek Joshi (23:41.244)
I think that's great. think it's a thing you can do to make more people want to try it. So think that that's great. talking about, okay, let's go back to day one. Day one, you're ready to go and you have zero users, right? Nobody knows, at that point, nobody cares. So from there to today, how did you get this in front of users? And what was your quick five second pitch to try it? And I say pitch, obviously, it's not you talking, but like,
could be a LinkedIn post, could be Twitter, X, whatever it is, like what do you do to make people try this? And also maybe can you share like where you are today in terms of whatever metric you use, number of users or number of videos created, if you can share the journey from zero to today.
Amit Jain (24:27.33)
Yeah, I don't know if I'm able to share the KPIS V track. But the question is, yeah, the question is great. How do you get in front of people? How do you get people to care for what you're doing? It's not a given that you make it and people will care for it. Honestly, we're lucky because
Prateek Joshi (24:45.82)
Right, right.
Amit Jain (24:54.286)
Like Luma has such a beautiful community. honestly, like, you know, I mean, it's such a thing. It's to do lip service to the community that people have. Like I live and breathe in our discord community, in our WhatsApp communities, places like this. Like, you know, I talk to our users, like, and the numbers are not like, know, the number of people in these smaller groups is not like, you know, five or 10. Our discord is the smaller group. The internal group is 1500. We talk to them like, you know, literally every day.
and then I've gotten to know quite a few of them very closely. So I don't know. I think, of course, your story is yours to tell, but our community tells our story very well, much better than I do, honestly. They're so creative, and they share things they make. They share their enthusiasm. They share their complaints. They share everything. We just launched R2, right?
Prateek Joshi (25:38.28)
Mm-hmm. Mm-hmm.
Amit Jain (25:52.396)
And honestly, we don't put a single dollar of marketing spend so far, because I also consider that to be antithetical to PMF discovery, because you can totally just mask the search for PMF by putting marketing dollars behind it. You're going to get like a bump, is it coming from because people want to use it or is it coming because you just ghost it? So we don't do any of that. And we launched R2, and I think that was our
Prateek Joshi (26:02.28)
Mm-hmm.
Amit Jain (26:21.934)
biggest launch, there were millions of views on Twitter and there were 13,000 to I believe 16,000 posts in the span of a day, day and a half of people sharing their stuff. You know, so I, there is, I don't have a platform even, you know, a 10th of the size. I can't do that. Like Luma can't do that. It's our community that actually does it. So I'm unbelievably grateful. I think we are very lucky.
Prateek Joshi (26:31.582)
Mm-hmm.
Prateek Joshi (26:40.124)
Right.
Right.
Amit Jain (26:49.39)
that our users love us and that they want to share with the world that this is what they can do. It's also a very counter-cultural moment, honestly. There's so much, some of them actually run up against so much opposition from their peers, from their friends, from people in their groups who are like, why are you using AI? This is like, I'm going to take our jobs or whatever have you. But they're showing actually, these folks are showing what you can do.
So I think that they're also pioneers. I think they're also, they're kind of doing some really interesting things and they tell our story and that's amazing.
Prateek Joshi (27:26.512)
Amazing and actually maybe then a deeper question is there's a joke around community building is that no we started this community not because it's easy but because we thought it was easy right so because community building is so incredibly Hard because first of all, how do you get people to you give a shit? Like why would they join your thing? So in the early days, like how did you I guess maybe so product speaks for itself. So people want to join but
Amit Jain (27:34.298)
yeah?
Prateek Joshi (27:53.948)
It's no joke to get like so many people to join the community. So maybe in 30 seconds, like what do you think was the key reason that people join the community?
Amit Jain (28:04.098)
Yeah. So our main Discord group is now, think, 120,000 people. Yeah, that's quite big. Or maybe I'm messing up the number. It's somewhere in that ballpark, but I might be missing the exact number. But I haven't checked in the last few weeks. How did we get them to join? So I mean, everyone who's at Luma,
Prateek Joshi (28:11.22)
That's crazy. Yeah.
Amit Jain (28:31.82)
This includes actually some of the best researchers on the planet. Like, Luma's diffusion model team is considered to be the best in the world. This includes people from DeepMind, folks from NVIDIA, like, you people coming from UC Berkeley who invented 3D, like, know, generative AI who worked on most of the groundbreaking stuff we consider for granted today. But they're here because this is what they want to do, and they're just insanely passionate.
So we engage with people. we share our work and we don't share our work under like, you know, these heavy pretenses and marketing and sales pitches. just, we are nerds about this. We are extreme nerds. I, like, you know, yeah, like this is physics simulation for me and I am an insane nerd for that. So, you know, you find your groups, honestly. This is what internet is beautiful for. You will always find your groups. Doesn't matter how disturbing the thing you're into or how boring or how insane.
Prateek Joshi (29:28.279)
boy.
Amit Jain (29:30.402)
So we very early on found our groups, right? People who cared about this, people who saw this, people who thought this was coming. And of course now AI is everywhere and every artist, every creator, every director, or nearly every movie director, nearly everyone working in marketing is paying attention. So like, know, now building the community for us is not very difficult, but in early days, yeah.
We just shared and we found small groups, invited new people, they invited new people and I think it just meant that.
Prateek Joshi (30:02.388)
That's amazing because consumer AI is having a moment. There was a bit of a lull for a while, but now I think it's a very interesting moment. And I think it's very, very hard to get users or consumers to care about anything. And I think any experiment that they should try, it helps. And communities is a fantastic mechanism to get people to care and share and hopefully bring more people in. All right, so the next.
topic I'm going to talk about is the technology stack that you use internally to ship a product like Luma. So many of the listeners, there's so many AI tools, there's so much choice available now. So whatever you can share, to build a product like Luma, with real production, you have users, everything, like going from sign up and generating the video and so on and so forth. So what is your technology stack?
Amit Jain (30:57.806)
Yeah, I can talk about everything from training up to like, you know, serving and everything like that. So we currently train on, you know, quite a few thousand GPUs. Maybe I shouldn't say the exact number. The number is higher than five, the less than 10, there you go. So we train on about like, you know, a huge compute cluster of H100s that are like, you know, hosted with our partners at Amazon or AWS.
Prateek Joshi (31:13.268)
Yeah.
Amit Jain (31:27.342)
We are a PyTorch shop, so pretty much everything is written in Python and PyTorch. We also write quite a bit of CUDA to get the kind of performance we need from these cards, their Nvidia cards. We also work very closely with our partners at AMD. So we have started to experiment and also now deploy on those cards, and they have proven to be quite capable for us. So the MI...
300x series. Again, we use PyTorch with that. So on training side, in addition to that, of course, there's myriad of tools like weights and biases and those things that we use for experiment monitoring. We have quite a few tools we have built in-house for data and processing and those kind of things. Kubernetes tends to make an appearance everywhere, whether it's in training or data processing or serving, everything like that.
We have a very, very good infra team that is able to pull off miracles from... By the way, for a very small team, running thousands of GPUs in production and in training is not trivial. GPUs fail at a very alarming rate. Very, very alarming rate. And making a reliable service, it's absolutely difficult. So yeah, we have also lot of monitoring tools that keep an eye on the performance of these cards, how much power they're able to draw.
Prateek Joshi (32:38.58)
Alright.
Amit Jain (32:52.11)
If there are any link flapping, that's okay. These clusters are interconnected over InfiniBand. So logically it makes it look like this is one computer, but it's very fickle. 8,000, 10,000 these connected together. It's very, very fickle. So you need great monitoring tools to understand where the link flapping is and the tools to be able to identify which exact node this problem is originating in, all those sort of things. So that's on training side.
In production, we're using today, of course, we write everything in Python. So, you know, our production code is also written in Python. There's some Go services here and there. It runs in Kubernetes. We have a full-on CI CD. like, you deployment is very trivial. It didn't used to be. I remember FTPing the model weights, but, like, you know, now it's much better. We use Versal on the front end. We use...
Prateek Joshi (33:42.531)
Hahaha
Right.
Amit Jain (33:51.918)
Of course, for iOS apps, we use Swift. I've also done a lot of iOS engineering, so I'm very fond of Swift. What else can I tell you? Let's see. Any other part of Stack that you're particularly curious about?
Prateek Joshi (34:06.098)
Yeah, what does the user, maybe we can focus, maybe just double click on the components that the users interact with.
Amit Jain (34:17.27)
Okay, users, okay, so the front end as I was saying, it's, you know, we serve it through RECELL. It's written in Next.js. iOS is written in Swift. Like, you know, we have an internal API that is built using Python. It goes and talks to like, know, I don't believe they use any frameworks there that are, yeah, I think it is pretty much home built. Then we have...
Then what happens is basically we have API gateways, these go into AWS. Then honestly, from there on, we have built a queuing system for inference. Now, inference is very difficult. Serving large scale inference, especially distributed inference is ridiculously hard. So we're working with some partners so that we don't have to do it, but currently we do. So users interact a lot like, when they submit a prompt, this goes into a queue.
There's a lot of work that happens on the language side to understand what the person is trying to say, what do we need to generate, all those kind of things. Then basically we have this batching system that is able to find where the GPUs are free in multiple clusters, goes to the one that is the fastest to serve this particular user. Again, for this we built an in-house queue that is able to span TCP as well as internal networks.
For inference, use, of course, know, tensorRT and some similar frameworks for inference optimization. Then model optimization for their, like, yeah, I don't know. There's many, many, many things we have. We use quantization heavily. We use ways of, like, know, batching the model and creating, like, different tiers so that requests have different timings that they come into play. So, yeah, it has gotten very complex at this point.
Prateek Joshi (36:07.091)
Yeah.
Amit Jain (36:07.714)
But it's very, very interesting. This is the only way to serve this many users at scale.
Prateek Joshi (36:13.54)
That's incredible. can imagine, first of all, video generation. That's complex enough by itself. And then on top of that, add all of the headaches and hassles that come with serving so many users who can, they don't care if you're having a bottleneck in your input. They're like, here's my prompt. I need the response now. So it's incredible. maybe one last question here. Of all the things you had to solve,
Amit Jain (36:31.596)
Of course not!
Prateek Joshi (36:42.204)
of all the problems you to solve. What was the biggest technological challenge you had to overcome to make Luma work for these many users?
Amit Jain (36:53.27)
I mean, the primary challenge is, of course, building really high quality video models, frontier models, right? you know, there is no, nobody has done it. So you can't, like, hire for it, right? Like, know, it's like, of course, they're doing this. Building at the frontier is ridiculously hard. This involves research, this involves, you know, all the work in production and in for lots of experimentation. So of course, in the process, you build IP.
Prateek Joshi (37:00.724)
Mm-hmm.
Amit Jain (37:22.998)
In the process, you build really insane capabilities. You build a team that is able to invent. That's really huge. But of course, it's never easy. So once you have shown someone that, this can be done, then people come and follow. That's fine. But building something on the frontier, I think that is generally very difficult. We have had to figure out, how do you build these models? How do you train them very efficiently? How do you make sure that the things that they generate is exactly what users want?
In language model land, RLHF is very well understood and known, and people know how to do that. In our world, this is not so well understood. And how do you actually align user preferences with this Wild West that is the latent space of a video generation model? It's a tough problem. It's honestly a tough problem. Training on this much data and on this scale for diffusion models is also a very tough problem. The diffusion models are different enough from language models, but like,
Same learnings don't apply. like, know, you're training them stably, training them correctly. That has always been a difficult thing. Second, I'm imagining the product as well. So we pay immense attention to the product because our goal is not to just like, you know, fling the model over the fence and let people figure it out. Our goal is to actually make it so that this is how video is produced in the world. This is an indispensable tool.
for anyone who's thinking about telling a story. So to get there, you have to think very deeply about what the future looks like, what are the future capabilities of the model, how do you build the product, these kinds of things. We have a very, very good design and product team. Some of my friends from Apple who joined us during our journey, and they put immense attention and care into thinking about how do we make a good product. Of course, once you put it out, you get more feedback and those kinds of things, but that is really hard.
making a good product is absurdly difficult. So yeah, model and product.
Prateek Joshi (39:24.306)
Yeah, all right. I have one final question before we go to the rapid fire round. And this is about the future of video generation. How do you see in the next two years, perhaps, how do you envision the video generation, the sector to evolve? And what role does Luma play in that?
Amit Jain (39:28.226)
Yeah. Awesome.
Amit Jain (39:47.062)
Yeah. So video is currently thought of as this is for generating videos. Video basically is on the critical path to AGI. And why I say this is because currently we're mostly operating in the text land and we're able to reason through math and things like that. But the moment you want AI to affect the physical world,
you need to understand what the physical world is, what the rules of the physical world are, whether that is from robotics, whether that is in creative fields, whether that is just for understanding what the hell is happening. So video and scaling video pre-training is on the critical path to AGI. We are doing video because one, of course, we believe that these models is how majority of things people will watch in the future will come out. This is how production is going to be because it's a 10,000 to 100,000 X delta in cost of production.
Prateek Joshi (40:40.308)
Alright.
Amit Jain (40:40.492)
Whenever that happens, that's nuts, right? But more importantly, the path includes video, audio, language altogether, and actually making something that looks like human brain. Because human brain doesn't just offer any language. Even if you're like, you know, blind and deaf from birth, you still have, you know, the millions of years of understanding of physical space and things like that, that you still use in your brain. So to actually create...
general intelligence that is able to operate for every human that are able to operate and do better. This is what we need to do. So you're to see very rapid progress in the next year and two years of people moving beyond language models and Luma leading the charge on that of building the next generation of AI. So yeah, video critical path to AGI.
Prateek Joshi (41:29.812)
Amazing. it's a great point because many people think of video, that's video generation. I think it's a path to actually, if you want to build something that can touch and operate in the physical world, this is just like one path towards it. But that's a fantastic vision. right. With that, we're at the rapid fire round. I'll ask a series of questions and would love to hear your answers in 15 seconds or less. You ready? All right. Question number one, what's your favorite book?
Amit Jain (41:54.446)
Okay, yeah.
Amit Jain (41:59.822)
Stunkworks.
Prateek Joshi (42:02.132)
Amazing. Love that book. right. Yeah, it's actually maybe part B to the question, why do you like it?
Amit Jain (42:03.256)
Have you read it?
Amit Jain (42:11.118)
It's this story of doing exceptional things like Skunk Works from Lockheed Martin. They made the spy plane, the stealth plane, they made the first hypersonic plane, then hypersonic plane. They made impossible things like the B-2 bomber, right? These kind of things. That's what we're doing at Luma. Skunk Works is just unbelievable.
Prateek Joshi (42:32.658)
Yeah, I think it's what it stands for. It inspires a certain path and this nature of how do you build stuff. That's great. right, next question. What has been an important but overlooked technology trend in the last 12 months?
Amit Jain (42:36.162)
Yes.
Amit Jain (42:51.81)
Honestly, for all the hype video gets, it's actually overlooked more than it is. It should actually have more attention on it. Yeah. Yeah.
Prateek Joshi (42:57.968)
Right, right, that point. What company do you admire the most and why?
Amit Jain (43:03.757)
SpaceX. Again, for the same reason, making the impossible possible. So my purpose in life is to unlock humanity's tech tree. If you think about the tech tree from Civ games, right? And any company that embodies that more, I can't think of a better example than SpaceX. And one day, Luma. Yeah.
Prateek Joshi (43:23.346)
What's the one thing about video generation that most people don't get?
Amit Jain (43:30.774)
This thing, that video is not just about video, video is on the path to AGI. This is a learning that people are going to internalize very soon. But until then, we're to have a lot of fun.
Prateek Joshi (43:43.806)
I think that's a great, great takeaway from this episode is video generation is not about video generation. It's much, much more. right, next question. What separates great products from the merely good ones?
Amit Jain (44:00.398)
That's a very good question. I think it's care and taste. That you can see that somebody thought through what you will do with it. I think it's care and taste. Yeah.
Prateek Joshi (44:13.682)
Right. What have you changed your mind on recently?
Amit Jain (44:20.034)
You know, that's a very good question.
Okay, I think China recently with the DC, I think that's very interesting. There's a lot of learnings for everyone at the moment on like, you know, what is happening there, the kind of things that is coming out of the Chinese market. Yeah, China.
Prateek Joshi (44:31.548)
Alright.
Right.
Prateek Joshi (44:44.936)
What's your wildest prediction for the next 12 months?
Amit Jain (44:50.254)
For the next 12 months only. So, okay, I believe this very strongly given the roadmap that we see internally in the lab. People will be surprised how fast the adoption of video models is going to be and how quickly they won't be able to tell if the thing they're watching came from a generative model or not. already are, know, rate two passes the Turing test pretty well. We're like, you know, people can't tell if this is generated.
That's why our launch campaign was, is right too, not real. People will be surprised how much of everything that they watch is going to come from video.
Prateek Joshi (45:29.556)
All right, final question, and this is a two-part question. What's your number on advice to founders starting today? And part B is what's your advice to founders who are building in video generation?
Amit Jain (45:43.774)
Right. So people who are starting today, I mean, this is not just about today. In general, my advice is, of course, what you do and how you execute matters, but also it matters what kind of people you're doing this with. So find the right investors, find the right backers, like, know, not just the people who are willing to give you money, but people who are actually intelligent, people who are able to see what you're doing.
Right? And as your vision changes, as your goals evolve, can be next to you who have been founders, who understand like, you know, what it is actually like to found a company. It's absurd. It's just like a very unique experience in humanity's existence to start something new. And people have been doing it for millennia. But of course, every time it's hard. So find the right people.
Really, that goes way further than anything else. This applies to the employees you hire, the co-founders you choose, but also very much to the investors you have. You founders who are starting in video generation, think very differently from like, know, it's not just like, I can, you know, of course existing workflows, they're going to change significantly, but come play with the Luma API. It's wildly popular.
And you're going to see like, you know, there's so many things that people don't think video can do. Now video can do because it's not a physical thing anymore. You can do everything in bits. So yeah, like, you know, there's a lot of fertile ground.
Prateek Joshi (47:17.736)
Amit, this has been a brilliant, brilliant discussion. Love the depth of your knowledge and also just what it takes to actually ship a product because many people get stuck in the research phase and it's hard to take that thing and ship it. And obviously it's so much pressure when people actually start using your product. So thank you so much for coming onto the show and sharing it.
Amit Jain (47:39.47)
Absolutely. Thank you for having me. It was a great discussion.