Infinite Machine Learning: Artificial Intelligence | Startups | Technology

Multimodal Video Understanding

December 07, 2023 Prateek Joshi
Infinite Machine Learning: Artificial Intelligence | Startups | Technology
Multimodal Video Understanding
Show Notes Transcript

Jae Lee is the cofounder and CEO of Twelve Labs, where they are building video understanding infrastructure to help developers build programs that can see, hear, and understand the world. He was previously the Lead Data Scientist at the Ministry of National Defense in South Korea. He has a bachelors in computer science from UC Berkeley.

In this episode, we cover a range of topics including:
- What is multimodal video understanding
- State of play in multimodal video
- The founding of Twelve Labs
- The launch of Pegasus-1
- Four core principles: Efficient Long-form Video Processing, Multimodal Understanding, Video-native Embeddings, Deep Alignment between Video and Language Embeddings
- Differences between multimodal vs traditional video analysis
- In what ways can malicious actors misuse this technology?
- The future of multimodal video understanding

Jae's favorite books:
- Deep Learning (Authors: Ian Goodfellow, Yoshua Bengio, Aaron Courville)
- The Giving Tree (Author: Shel Silverstein)

--------
Where to find Prateek Joshi:

Newsletter: https://prateekjoshi.substack.com 
Website: https://prateekj.com 
LinkedIn: https://www.linkedin.com/in/prateek-joshi-91047b19 
Twitter: https://twitter.com/prateekvjoshi 

Jae Lee (Twelve Labs) (00:04.91)
Thanks for having me, Pratik.

Jae Lee (Twelve Labs) (00:17.718)
Yeah, I think video understanding could be quite vague, right? Is it, what does it mean to understand video? And I think for 12 Labs, I guess there's that unique take that 12 Labs brings is that it's really about leveraging all of the modalities that's present in video, like visuals, audio, conversations, background voices, and really using all of that information and be able to map.

human language to whatever that's happening within video content. Right. So we believe that, you know, building on AI that can efficiently map human language onto video content and be able to generate really powerful embeddings from it brings us with like multiple emergence properties, like being able to search things, summarize things, and, you know, be able to solve a lot of, um, really hard downstream video understanding tasks.

Jae Lee (Twelve Labs) (01:27.243)
Yeah.

Jae Lee (Twelve Labs) (01:34.667)
Yeah.

Yeah, so I think a lot of different industries have been hearing this, maybe it could be a marketing term, but video AI or AI that can understand videos like humans, I think that messaging was around for maybe 10, 20 years. But at the end of the day, it wasn't truly there, the technology, like technology-wise. So, you know, at least for 12 Labs, we're getting a lot of

many, many different industries from law enforcement to sports media, contextual advertising, e-learning, right? Given how, um, you know, video has really become this, like the humanities go to medium for storing data, right? Um, so I think the, the real world use case that I'm really excited about is really brute, like boosting creativity and basic quality of life, such as healthcare, like patient monitoring.

I think, given my Korean background, Korea is going through this terrible population collapse and a lot of senior citizens lonely, right? So video understanding can really help not only patient monitoring, but also senior care and whatnot. So those are some futuristic, but also really near time use case that I'm really excited about. But then there's other.

part of use cases that's more revenue driven and things like that. But at the core of it, I'm more excited about just boosting the base quality.

Jae Lee (Twelve Labs) (03:26.518)
Yeah.

Jae Lee (Twelve Labs) (03:33.758)
Yeah, yeah. So I think given the current technology, not many, I hope there's more player come in to video understanding, but video understanding problem has been many times kind of reframed into maybe like image understanding or speech understanding. Right. Which makes sense given the explosive growth in, you know, image foundation models and speech foundation models.

But we're trying to take video first approach, where we think that the world needs something like clip for video, like a video native embedding models. So in the next, hopefully in the next 12 to 18 months, what we are doing with text, basically prompting has become kind of like a feature engineering almost. Hopefully in the next 12 to 18 months,

we can provide this text interface where not only can you tackle really high level understanding, like video understanding tasks, like visual question answering or summarization, but also like really low level perception tasks, like can you put bonding box on a person that's wearing a 12 laps hoodie or something, right? And be able to do that in a single interface.

Jae Lee (Twelve Labs) (05:08.318)
Yeah, so 12 Labs is an AI research and product company based here in San Francisco Bay Area. And we build large video language model. And we serve it to enterprises and developers that are building really creative video-centric products that are looking to add or implement really powerful semantic video search, classification, summarization, any video to text operations.

Jae Lee (Twelve Labs) (05:42.998)
Yeah.

Jae Lee (Twelve Labs) (05:53.034)
Yeah, yeah. Thank you for loving the naming. Yeah, we were thinking, do we call it small model, big model? And the funny thing is the first thing that was captured in video was horse running. So what is the most iconic stallion? It's Napoleon's stallion. So we've named our embedding model, Marengo. And then hopefully with Pegasus, people feel that, oh my god.

video understanding is being really solved. So we wanted to put like a legendary horse, Pegasus, hence the naming. But yeah, the idea behind 12 Labs has always been, being that horizontal video understanding infrastructure. And I think with our loringo series of model, which is a video native embedding model, we were able to tackle retrieval based type of tasks, like search and classify, but that's like half of the...

the story, right? And we've been getting a lot of feedback from our customers. Is there any way we can leverage some of the understanding that Marengo achieved and be able to generate text and kind of eradicate the need to work with 30, 40 different vendors that provide separate metadata for video, right? So that's where Pegasus kind of came about. And we're really excited about the two models, right?

because making one of them, if we improve one of them, the other automatically gets better. So we're really excited about that small flywheel that we created.

Jae Lee (Twelve Labs) (07:40.918)
Mm.

Jae Lee (Twelve Labs) (07:47.074)
Yeah.

Jae Lee (Twelve Labs) (07:57.454)
Mm-hmm.

Jae Lee (Twelve Labs) (08:27.722)
Yeah, yeah. So I think we can go into long-form video processing. I think that's, yeah, so the four core principles to reiterate is efficient long-form video processing, multimodal understanding, video native embeddings, and deep alignment between video embeddings and language model. I think for efficient long-form video processing, it's really about the usefulness of the model that we're building, usually the videos that we deal with.

aren't like 15 seconds long, right? So, you know, we had to build a lot of technologies from the ground up, from video pre-processing to the inference pipeline, to everything, basically centered around this notion that, you know, the kind of data that we're going to be processing is going to be massive, and probably like two hours long per video, right?

Um, so that's where like that ethos kind of came from and multimodal understanding. I think, you know, this kind of. Distinguish of labs, um, from the traditional foundation model companies is that we are like multimodal company from, from inception. If you think about it, right. A lot of, um, the traditional foundation model companies have largely been language model companies. So it's, they take the language model approach, which

you know, make sense given the current technology, but then, you know, we're seeing a lot of alignment problems with adding different modalities. The model gets too big. And it's just pain in the butt to like really, you know, trying to figure out once you have this humongous language model, how do you just like, do you just like attach clip model and how do you then align those embeddings, right? So, you know, from inception 12Labs was thinking really carefully about how do you.

How do we architect our model so that, you know, it can take in visual, audio, speech information at the same time, right? And that's where we think that, you know, clip, the reception of the clip model was incredible, at least for the research community now, I think it's making a lot of impact in industry, but something like that for video would be needed. And it's probably not going to be like frame-based, right? It's going to be some higher level abstraction unit where

Jae Lee (Twelve Labs) (10:53.13)
maybe we call it scene boundaries, right? So instead of kind of creating embeddings per frames, we divvy up long-form videos into scene boundaries and these boundaries get embedded, right? And yeah, and for a lot of customers that are dealing with petabytes worth of data, that kind of makes sense, right? You don't really wanna deal with like millions of frame embeddings, which doesn't...

makes sense. And then the alignment between video native embeddings and language model. This is also kind of thesis behind 12 Labs is that developers and enterprises will always want choices, right? I don't think many people will want like maybe like a single company achieving everything, like solving all of the data modalities problem and be able to answer all of the questions under the sun. So, you know, we are excited.

about the future where there's choices for different language models, choices for different foundation models for different modalities. And we think that LLM is a, you know, it's improving, but it's a great reasoning engine. Hopefully it becomes a really great reasoning engine in the future. And 12Labs kind of serving as, or at least like providing like a set of eyes for LLMs. But that's where the alignment.

kind of comes from, we should be compatible to any open source or off-the-shelf language model.

Jae Lee (Twelve Labs) (12:46.178)
Yeah.

Jae Lee (Twelve Labs) (13:56.278)
Yeah, I think there still needs to be a lot of, you know, technical innovation that needs to happen for this technology to be widely kind of adapted, right? Almost to the point of kind of like chat, GPT type of performance, right? Yeah. I mean, TikTok and YouTube, you know, hold a lot of video content, but I think now enterprises, right, have a lot of videos. Maybe it's internal zoom meetings, all of them being recorded to, you know,

law enforcement. I think, you know, we live in a day where, you know, we're, as a country, we're trying to protect some of these law enforcement officers, like officers that are capturing whatever that they're doing, a body cam to, you know, we've proven out that, you know, video has been quite effective in some of the defense use cases to healthcare, right? So I think, you know, it's easy to think that, oh, it's

YouTube, TikTok, right? But then I think underneath, given how 80% of the world's data has become video, it's just so much more underneath the surface than we really think. And the beauty of building 12 Labs is that these kind of people kind of come in inbound with the kind of use cases that we've never even really imagined. So yeah, I think there's a lot of video data that we haven't seen.

And then the second part here is that, you know, there's the whole technical integration, the performance, but I think, you know, the smaller the model, the better, if we can keep the performance, right? So if we can expand this kind of technology into edge devices and real-time use cases, I think it could be game-changing as well.

Jae Lee (Twelve Labs) (16:12.426)
We're generating videos right now.

Jae Lee (Twelve Labs) (17:05.886)
Yeah, yeah, I think we've been getting a lot of requests about satellite imageries as well. And I think, you know, if history is an indication, if we have really powerful base model that kind of accumulated basically the humanities knowledge stored maybe in video, right? I think being able to like fine tuning the model to understand.

satellite footage is something that's definitely doable. And I think it's going to happen in the very near future. Yeah.

Jae Lee (Twelve Labs) (17:45.858)
Yeah.

Yeah. So yeah, my mind is like racing, right? Like, you know, we ha we, we maybe, you know, might be able to train with right partnerships, like maybe it could be like a weather forecasting, like news footage. Right. And all of that gets kind of connected with satellite, you know, footage that's kind of measuring the clouds, the clouds movement, and the model is being able to kind of soak in and, and formulate understanding, um, could be powerful. And I think the technology is there to make it happen.

Jae Lee (Twelve Labs) (19:05.75)
Right. You know, this is something that as a team we talk about, right? I think, you know, it's great that LLMs have kind of took the world by storm and everyone's kind of thinking or calculating at least how much like how much tokens are regenerating, like what's the price, right? So everyone is into like the thinking in token-wise kind of metrics. And I think there's like creative

Jae Lee (Twelve Labs) (19:35.414)
the same kind of story and we can probably translate some of our pricing to like token based, right? If that becomes like the general standard of how we measure the context window or the pricing, but at least for the current 12Labs customers, at the end of the day, it's really about how many hours of footage I have, right?

So we try to make it as easy as possible for these folks to understand. Because the technology we're building is overwhelming as is, and if we start introducing more jargons into the scene, I don't think it's helpful in helping this technology being adopted. So I think uniquely, at least for 12labs, it's about the...

number of hours of content we've summarized for you or we've indexed for you to make it searchable and things like that.

Jae Lee (Twelve Labs) (21:01.355)
Yeah.

Jae Lee (Twelve Labs) (21:17.203)
Mm-hmm.

Jae Lee (Twelve Labs) (21:21.918)
Yeah, yeah. So I think for videos, what's interesting is that, you know, text, a lot of texts is like pretty coherent, right? And I think it's, it's quite natural. It's like, oh, document in embedding out, right? But for videos, you know, for the kind of conversations that we're having too, there's like natural boundaries, or even like, if you look at like exciting movies, there's like a lot of like different unrelated scenes, right?

And how do you then generate embeddings based off of such a fast changing scenes, right? You know, that actually kind of degrades the performance, or at least like the powerfulness of embeddings if you kind of confuse the scenes, right? So for 12 laps, what we've done is, okay, I think there's like, we think there's this generic like boundaries.

of scenes and that's probably what the customers want to be able to have the holistic understanding of the whole thing, but be able to really go deep into different scenes. So this is something that we as a team had to think about. It involves interface and abstractions. Like, okay, there's a video in, how many embeddings out, right? And is it heuristic? Do we do like 15 second window, 30 second window? But that's like, it's not very...

helpful and so the unit there is scene boundary. So we try our very best to localize themes and context. So that's where the video native embedding piece comes from. So we localize these things. And then we, I think, if you just

create embeddings based on frames, it really kind of misses out on the spatial temporal context capturing. So that's why we do our very best to minimize redundant frame looking. So within the same boundaries that we've kind of identified, we would remove certain redundant frames, and that becomes like a basic unit that we give to our model to embed.

Jae Lee (Twelve Labs) (23:55.632)
Yeah.

Jae Lee (Twelve Labs) (24:00.363)
Yeah.

Jae Lee (Twelve Labs) (24:18.2)
That's a great question. I think first for embeddings, I think you know

Video embeddings will ultimately give all developers this programmatic access to whatever that's happening within video content, right? So I think that is powerful in itself, but this goes back to, I think, like me first learning about auto-encoders like 10 years ago, right? Auto-encoders is basically kind of like, oh, this is like a new way of compressing data, and it captures all of the semantic kind of information. But what's a...

What's tricky here is that a lot of the compression algorithm, it's kind of rule-based, so it's really easy to recreate. And you can basically, you know everything about how it works, but embeddings are fundamentally hard to manipulate in terms of knowing exactly what's going to happen. So I think it has potential to serve as

uh, as, as like a basic, almost like an encoding, uh, technology. Um, but I think, uh, we still have like miles to go, uh, because, you know, what happens like with the, the current rate of, you know, technology improving, what happens if there's like a much better embedding model that comes out? What happens like to all of the videos that you've encoded? Um, so I think, you know, we'll see when.

that moment for video understanding comes like, oh, this seems like video understanding problem is finally solved. Then I think we can create standards around, oh, then we'll start working on neural encoding or creating neural compression or something like that. Right? Yeah.

Jae Lee (Twelve Labs) (26:26.466)
Yeah.

Jae Lee (Twelve Labs) (26:40.398)
Mm-hmm.

Jae Lee (Twelve Labs) (26:56.834)
For sure. Yeah, and also it has to be multi-purpose, I think. You need to be able to recreate the content from these embeddings to be able to search for things really easily. I feel like we're talking about Silicon Valley's PipePiper technology. But I think that's where we're kind of headed. And I think embeddings present a good segue into that future. Hopefully we don't build a Skynet scenario.

Yeah.

Jae Lee (Twelve Labs) (28:18.686)
Yeah, I mean, it's an incredibly powerful piece of technology, right? So, you know, you can fine tune the models and you can maybe put some adapters on top of it and create really awful privacy encroaching technology, right? Kind of be able to track everybody and also have like a really awfully detailed reports on what people are doing.

Right? So I think that is probably very concerning. And that's why we try to be incredibly, I don't want to use the word picky, but work with customers that actually really deeply, truly care about ethics and privacy. Right? So that's why 12Labs doesn't include, like, a facial feature to our model. Right?

specific as you can get is probably describing what that person is wearing. But we're not looking at, you know, we're not trying to look at genders or ethnicities or facial features, right? The model, because the model is powerful, it is able to infer the gender and the appearance of the person, but that is never the thing that we focus on. So like...

responsible youth and you know, as creators of this technology and I hope many more scientists come into this space. There's just so many use cases that are creative driven and that are just really exciting, right? So hopefully we see more of this technology being used in that realm rather than, I don't know, military or people tracking or things like that.

Jae Lee (Twelve Labs) (31:09.975)
Yeah, I think, oof, what is unlikely to happen? Wow, I think that's a really hard question. I don't think anyone kind of foresaw what was going to happen in end of 2022 and 2023, to be honest. But at least for video understanding, I think in the next two years, I think we would have proven out.

the fundamental technology where there's this single interface that you can interact and basically solve many of the video related downstream texts. I think I've mentioned this, right? Be able to summarize, be able to do a lot of high level video question answering to going really low on kind of like a perception task. I think we've seen that from text. Now,

what we consider like a specific field like named entity recognition or, or sentiment analysis. Those are all being solved by a single model. And it is able to do a lot of like more high level understanding tasks as well. And I think we're going to see that for video and maybe other complex multimedia data like 3D and things like that. And I think in the next. So that's that, right?

nailing the fundamental research and scaling. And we're seeing this trend from LLMs as well. We've scaled, made the models really large, but now the world is focused on, oh, wow, like we can, we have like business use cases for this, but the models are too big. Um, how do we distill the knowledge into smaller model, um, and, and make it, um, more user friendly. Right. I think we're going to go through that phase and hopefully, um, you know, by the time.

Uh, you know, it's like half a decade. Hopefully Nvidia has made some serious progress in edge devices and chips. Uh, and, and we've done our part in quantizing these, these models, um, uh, for real time use cases, right. And, um, yeah, so I think that's, that's the future that I see in the next five years, um, and, and the sophisticated models just being able to kind of train on not only techs, but also all sorts of. You know,

Jae Lee (Twelve Labs) (33:35.554)
I think we've kind of already ran out of text data, but text in tandem with video adds a lot more colors to things. It captures cultural things to how different countries think about things. So yeah.

Jae Lee (Twelve Labs) (34:08.812)
Yeah.

Jae Lee (Twelve Labs) (34:25.783)
Yeah.

Jae Lee (Twelve Labs) (34:33.066)
Yeah, there are a couple of things that goes through my mind is, you know, I hope Transformer isn't the last innovation that we see. So I hope there is fundamental research and model architecture that needs to continue. And I think academia and I think that's where like academia can also shine, right? Most of the research that involves scale, but I think there's still

a very significant role that academia can play in producing really novel research in architecture that might be better than transformers. So I hope that continues and we see more of that come out. And also, I think I'm also concerned about the transformer thing.

The whole thing picked up because of the support of the developers, the software that's built around it. Now we see Nvidia adding Transformer module and H100. So, you know, hopefully this novel, you know, new architect research, you know, it has to be miles better to be even become like relevant because of like all this community that's been built, right? If it's like mediocrely better, then it's probably better to just stick with Transformers.

And I hope to see a lot of companies enter into the MPU market and build silicon. I think there's just a huge shortage in chips. So that's something that we need to, as humanity, need to figure out. Right? I think it's, you know...

I think there are a lot of amazing scientists and engineers with great insights and skillsets to be able to build something that's novel and amazing. But it's also, we live in an unfair world where only a handful of people get their hands on large amounts of AI chipsets. I hope as a humanity we figure out a way to increase the number of ships available.

Jae Lee (Twelve Labs) (36:48.498)
Yeah, yeah. So maybe, you know, I'm running this company and I'm always thinking about how, like, where do we get more, you know, compute, but I think that's a serious, um, serious issue. Yeah.

Jae Lee (Twelve Labs) (37:13.13)
Yeah.

Jae Lee (Twelve Labs) (37:17.422)
Right. And our scientists at 12 labs, they're like, I'll work for compute. Because it's like, I want to be able to build and work on really large scale model building. And there's only a handful of companies that can offer. And it's an invaluable experience for these scientists. And it's sad to see that it's only the handful that has their...

you know, hands on. Yeah.

Jae Lee (Twelve Labs) (38:03.487)
Yeah, let's do it.

Jae Lee (Twelve Labs) (38:10.774)
so many favorite books, but I think the one that I stayed up all night reading was probably Deep Learning by Ian Goodfellow, Aaron Corfield, and Joseph and Joe. But The Giving Tree is, you know, probably shaped who I am. Yeah, yeah, yeah.

Jae Lee (Twelve Labs) (38:41.654)
Yeah, I think the EO was concerning for me. So I think the trend that we were overemphasized on is like model sizes and things like that. I think that's kind of comparing the number of lines of code to determine how powerful software is. Hopefully, instead of focusing on like create demos, I hope we focus more on use cases.

what kind of use cases that we see trending that's going to be potentially malicious and regulate that. Yeah.

Jae Lee (Twelve Labs) (39:31.434)
Yeah. Video understanding, it feels like, you know, oh, if AI can understand videos, it basically understands everything. So it's really good at like high level stuff, but people tend to get shocked. It's like, oh, but it can't like detect things like really well. Right. So I think, you know, there's still that gap, right? That's why we focus on bridging that gap and be able to do everything from high level understanding past perception.

Jae Lee (Twelve Labs) (40:01.486)
having thesis and having purpose of existence, why that product matters, not only for the near future, but for the greater good.

Jae Lee (Twelve Labs) (40:20.95)
Yeah, perspective matters. I live in AI world, but get to know the world better.

Jae Lee (Twelve Labs) (40:34.43)
edge devices, robotics, bringing this high level understanding to robots.

Jae Lee (Twelve Labs) (40:49.855)
Go Elon. Yeah.

Jae Lee (Twelve Labs) (40:59.946)
If you already have a team, look at your team. Are they the ones that you would die for?

Jae Lee (Twelve Labs) (41:29.006)
Thanks for having me, I had a blast and hopefully many more to come from 12laps, but thank you for supporting.