Infinite Curiosity Pod with Prateek Joshi

Decoding Metadata in AI

Prateek Joshi

Cody Coleman is the cofounder and CEO of Coactive AI, a multimodal AI platform to accelerate metadata generation. They recently raised their $30M Series B co-led by Cherryrock Capital and Emerson Collective along with participation from Greycroft, Andreessen Horowitz, and Bessemer Venture Partners. He has a masters degree from MIT and a PhD from Stanford.

Cody's favorite books:
- The Inner Game of Tennis (Author: W. Timonthy Gallwey)
- Gettine More (Author: Stuart Diamond)

(00:01) Introduction: Setting the Stage for Metadata
(00:21) What is Metadata? Structure in Unstructured Data
(01:37) Metadata in Real-World Visual Data Analysis
(03:01) Automating Metadata Generation: Challenges and Approaches
(06:57) Introduction to Multimodal AI: What and Why
(11:25) Managing Trade-Offs in Multimodal AI Systems
(13:31) Labeling Challenges in Multimodal Datasets
(16:23) Characteristics of an Ideal Metadata Language
(18:22) Linking Metadata Quality to Model Effectiveness
(20:56) Measuring Efficiency of Metadata Extraction Engines
(22:55) Role of Synthetic Data in Metadata and AI
(25:30) Evolution of Data Labeling and Future of Metadata
(27:27) Exciting Technological Advancements in AI
(29:29) Rapid Fire Round

--------
Where to find Prateek Joshi:

Newsletter: https://prateekjoshi.substack.com 
Website: https://prateekj.com 
LinkedIn: https://www.linkedin.com/in/prateek-joshi-91047b19 
Twitter: https://twitter.com/prateekvjoshi 

Prateek Joshi (00:01.958)
Cody, thank you so much for joining me today.

Cody (00:05.506)
Prateek, thanks for having me. Super excited to be here.

Prateek Joshi (00:09.368)
Let's start with the basics. We'll be discussing about metadata quite a bit today. So can you define what metadata is?

Cody (00:21.664)
Yeah. So metadata brings structure to unstructured data. So things like images, video, audio, and even text, such that a computer can actually understand what's in that data. Now, historically, when we think about computers, they haven't been able to intuitively understand pixels or sounds like we do. So as humans have had to do this tedious and painstaking work in order to bridge that gap, actually creating metadata

so that humans can understand what is in the pixels and the sounds that we hear. And especially in the case of images and videos, most companies rely on humans actually having to watch that content and manually add metadata to it in order to highlight different attributes within that video so that they can actually use that in their downstream processes, which as you can imagine is an incredibly slow and expensive process.

Prateek Joshi (01:15.26)
When it comes to analyzing visual data, obviously metadata can play a key role. So can you talk about in the real world in practice, what role does metadata play in analyzing visual data? And then maybe expand that to all types of data, VC, REC, text, video, LIDAR, and so on.

Cody (01:37.026)
Yeah, super great question. So, you know, I'll give an example. When we think about, you know, with the rise of e-commerce, we're increasingly making purchasing decisions based off of images and video. But search systems haven't really adapted to actually understand those pixels directly. So what actually has to happen is that you have to have metadata added to those, you know, product photos and to those images.

that is then used by search systems to be able to display the right products that you're looking for. Now, for example, one retailer that we're working with, if you typed in something like athleisure wear, which like, you and I and most people would understand what athleisure is, and this retailer has a ton of that, you type that into their search bar on .com for their website, and like four results came up.

And the reason for that is that they had tagged all of their products as activeware, as that specific word or tag. So if you search for something that's even a little bit off, like athleisure rather than activeware, you're not going to get any of those results. And again, that's because these existing systems haven't really adapted to understand the pixels or video themselves. So we have to rely on that metadata for machines to understand, for search systems to understand, and for analytics to understand as well.

Prateek Joshi (03:01.586)
All right, and when it comes to generating this metadata, can you talk about, when you go in and you have access to a whole bunch of data, now, how do you generate the metadata and how do you automate the process of generating this metadata?

Cody (03:19.02)
Yeah, so actually taking a step back, it really depends on kind of the use case that people have. What we find is that we can actually enable people to do, know, kind of the initial first step of search, and a lot of cases using just multimodal AI and multimodal embeddings. But the challenge that we find, especially when we work with enterprises, is that really they care about that last mile of customization.

Or when you think about these models, they might have been trained at some point in the past and they don't have the most recent information that comes up. So metadata comes in to actually be able to cover that last mile of customization for a specific enterprise. So for example, when you think about a retailer, being able to actually understand their specific styles, their specific headings of what is a...

even simple things like, know, athleisure as we were talking about, which a model might not understand. Or in the case of media companies, you know, you might be able to get some basic search out of the box, but getting that last mile of customization for, you know, new specific athletes and actors, that's where metadata can come in, in order to bridge that gap to be able to provide, you know, a customized search experience for that.

But then, search is just one use case that we see with these enterprises where they have kind of multiple different things. For example, thinking about trust and safety or brand safety, one of our customers is Fandom, the world's largest fan generated content and entertainment platform where they have about 350 million users. They're constantly getting user uploaded images every single day, tens of thousands of images every single day, millions of images every month.

tens of millions of images every single year. And previously what they had to do is they had to have like a human being look at every single asset that was, every single image that was uploaded to their platform to see if it, you know, satisfied their community guidelines. With Coactive, they were able to codify those community guidelines and automatically generate metadata to say, you know, does this like contain excessive gore or sexualized content that might not be suitable for the specific community? So.

Cody (05:34.854)
For example, when you think about the community for Game of Thrones, you'll probably have very different community guidelines than the community for Puppy Patrol. So by being able to actually create that custom metadata, they could go from having to rely on humans to do that first kind of very tedious pass on every single asset that was uploaded to their platform, which took them days to deal. So something that that first pass could be done in 250 milliseconds.

And then it would actually provide the information about what content is actually even worth reviewing. That's even questionable because only a small percentage of traffic even has that problem. And then there's even more advanced use cases around analytics where people are trying to understand what content's resonating either from audience engagement or for advertising and things like that. Where when you think about doing analytics today, it's SQL where you have this structured view.

And ultimately, when you think about this image and video assets that actually capture what is in that content, it largely hasn't been able to be taken into those analyses and those processes. And that's where generating metadata can give you a consistent view of your content that you can join with other information, whether it be ratings data or click-through rates to actually create better experiences for users.

Prateek Joshi (06:57.222)
Going to the topic of multimodal AI, and in the use case you're describing, it's very common for people to type what's inside an image, and they expect the result to show up. Meaning, if I just describe in words and not even accurate, it's been tagged back to where I'm going to type athleisure, don't know, red pants with frills at the bottom. Like, whatever it, I'm going to describe it, and I expect you to show up. And this is a very interesting entry point.

So to start with, how would you define multimodal AI and why has it become more important today? Obviously the one search is one reason, but can you just talk about why it's becoming more important here?

Cody (07:44.003)
Yeah, so, you know, to the first question, what is multimodal AI and what is multimodal? So multimodal, you know, breaking it down is that we have different modalities of data out there. know, structured data is one modality. We have things like text, images, video, audio. Those are all other modalities. And when we think about multimodal, it's being able to actually kind of integrate multiple different modalities into kind of the same model or the same kind of AI system.

And you can do that in a variety of different ways. Like I like to think about inputs and outputs and the modalities of those inputs and the modalities of those outputs. So for example, when you think about, you know, text to image use cases, the input there is text and the output is a generated image. That's multimodal. When you think about the input being one modality and the output being another modality, you can also have, you know, question answering systems. You think about multiple modalities coming in in terms of

Text and images and then outcomes a textual answer that's multimodal because you again have multiple modalities between your inputs and outputs the reason that it's become so important is that like when you think about Everyday life in the world that we live in it's inherently, you know unstructured and multimodalities, you know right now when you think about this conversation that we're having this is actually over, you know video call so we have

multiple modalities in terms of the visuals on screen. We have the modalities of my voice coming across. The sound waves are actually being captured by my AirPods in this case. And then you can actually end up distilling that down into other output modalities. So you might have like a text description of it. You might have metadata structured output to say like, hey, where is one question kind of ending?

Where is there a pause? Where are these kind of filler words and things like that? Where you have another kind of more structured output for some downstream system. Maybe to cut out kind of, I'm notorious for saying, and you know a lot. You could actually, and I just did it there, you could imagine using a kind of structured output to detect those words and be able to take that out automatically. And it's becoming so important because just when we think about,

Cody (10:07.566)
You know, when I think about kind of the amazing promise of AI right now, it's that it used to be that as human beings, we would see the world through vision and that machines would see the world through bites. But now those worlds are finally blurring together because machines can see the world like we do because of AI. And we're creating more and more of this content than ever. know, 80 % of internet traffic is video data today.

It's how we work, it's how we shop, as they said with the rise of e-commerce, and it's how we communicate. We've gone from things like text to communicating with videos with loved ones, sharing things from Instagram or from TikTok or from YouTube to actually communicate.

Prateek Joshi (10:51.738)
And if you think about the harmony between the multiple types of modalities you're dealing with. So on one hand, if you treat them as separate modalities, then there's a lot of control. can do a lot of fine tuning on how you treat each modality. But on the other hand, multimodal AI is more, there's more harmony, there's more speed. So how do you think about the trade off between treating all of them as separate modalities versus

a single multimodal AI system.

Cody (11:25.71)
Yeah, I think it depends on the use case that you're trying to do. You sometimes it's actually important to know which modality a piece of information is coming from. You know, for example, we see this a lot with people that are doing, you know, editorial use cases where they're trying to find, you know, a clip of some celebrity. And, you know, maybe, you know, just coming up with an example, like maybe you want to find, you know, clips of Taylor Swift.

If you just go in and search in some of these systems and you don't care about which modality whatsoever, you could get a reporter that's talking about Taylor Swift without Taylor Swift ever being on screen. Versus if you actually know that like, hey, I want to search the visual modality to make sure that I actually get an image of Taylor Swift for like whatever posts that I'm doing. You know, that information is actually really, really important for that downstream use case. In this case, like an editorial use case or for a production use case.

So you'll need to know the different modalities kind of depending on what you're trying to do and the different use case there. You know, another thing you think of trust and safety, it's a different thing to talk about, you know, guns or, you know, like violence through audio. It's different to talk about it and to have that in the audio signal versus that visually being on screen. So you can see where these different modalities and being able to understand it actually matters. Now, the nice thing is that with multimodal AI, you can actually kind of

either do this ahead of time in terms of like having the separate signals or blend it together depending on your specific use case. So it really depends on kind of the workflow and the use case that you have of whether you end up blending those modalities together or when you actually process them separately. And we see that with our customers all the time where again, sometimes they want to search across audio and video. Other times they want to know specifically like, hey, I

this person talking about maybe a spectator talking about a basketball game, for example. Being able to kind of tease that out can actually be the difference of whether or not that's a solved problem for them and actually provides value or not.

Prateek Joshi (13:31.164)
How do you think about the labeling challenge in multimodal context? So if you're dealing with a multimodal data set, what are the challenges compared to just labeling a bunch of images separately versus labeling a bunch of video or audio separately?

Cody (13:49.356)
Yeah, I mean, the problem scale, you know, when you think about it, we're creating more content than ever because we have made it so cheap and easy to be able to capture visual content. You know, it used to be that you had to some serious upper body strength in order to like, you know, carry around a video camera. But now each and every one of us has, you know, a camera in our pockets. And as a result of that, it's like, you know, you and I, we probably have thousands of photos on our phones right now.

It's become so cheap and easy to do it. We take probably like, you know, tens of photos, hundreds of photos every single day. Yeah. Despite it being so cheap and easy to create visual content, it's really difficult to do anything meaningful with images and video at scale. That's why, you know, metadata comes in and, you know, I've had the problem, you know, when I think about kind of going through this process of like, you know, maybe I'm preparing for a talk or for a post on LinkedIn.

you know, sitting there scrolling through all of my photos to find that, you know, right moment that I'm looking for in order to convey that point. And that's just me as a consumer. You know, imagine what it's like for businesses where instead of thousands of photos, they have millions, millions of images and videos. Now that's really scale. And that's precisely the challenge that we saw the co-active because we want to make it easy to search and analyze those visual assets.

Because if you're like a retailer, for instance, whether you sell clothing or food, you have millions of photos of your products. You're creating hundreds of thousands of photos each quarter to capture the latest lineup of trends. And trying to sift through all of that to find a photo or video that you're looking for for a marketing campaign is difficult to say the least. And in fact, often retailers give up and just do another photo shoot. A major food company that we talked about had like over 4,000 photos of strawberries.

Prateek Joshi (15:37.234)
You

Cody (15:42.862)
strawberries. And it is, yeah, it was crazy. And it wasn't because they wanted to find like, you know, the good side of the strawberry or anything like that, because they couldn't find the other 3999 photos of strawberries they already had.

Prateek Joshi (15:44.081)
Alright.

Prateek Joshi (15:55.762)
Right. When you think about metadata, in some ways it can act as a common language for different modalities within a multimodal model because everything can be mapped to this language that can be understood. So what are the characteristics of this language? Or maybe in an ideal world, what are all the things you would expect from a good language?

Cody (16:23.342)
Yeah, you know, what you would expect from a good language is consistency and coverage over all of your assets, over all of your assets, such that, you know, across systems, across people, you can have a standard definition so that you can communicate effectively. The reality of it is today, the reality of it today is that when we work with enterprises, oftentimes metadata is incomplete, inaccurate, or just plain lacking.

So either they have like, you know, this one person that has been at the company for 20 years, has seen every single piece of content. that's like kind of an encyclopedia such that they can know kind of like, you know, exactly where the content is, or they've created a very rigorous kind of taxonomy, which has very specific meetings that only a small subset of people actually know how to, you know, read those meetings, whether it be like an archive or a library team. And in order to get consistency, or they just don't, and they don't even know what

content they have, don't know what assets they have, and they just kind of give up. And you just basically have all these assets rather than being an asset being more of a tax or liability on these companies. That's where when you think about metadata and you think about generating metadata, the benefit there is that you're to have consistency and coverage by definition.

Prateek Joshi (17:42.674)
know, metadata, it can be good, it can be bad. So many, like, the quality can change. So how do you think about the effectiveness of metadata extraction? And also, how does it map to the effectiveness of the model that gets trained?

Cody (18:03.758)
All right, say that last bit again.

Prateek Joshi (18:05.958)
effectiveness of the model, the AI model that has to use this metadata to run. So I guess the quality of the metadata and the AI model, how do you map the relationship between the two?

Cody (18:22.55)
Yeah, that's a really good question. You know, when I think about like the quality of the metadata, it's important that, you know, metadata and information is not static in a sense. We constantly have things that are changing dynamically, you know, whether it be for different use cases or, you know, different trends. So, you know, one example, when we think about fashion, you know, after the Barbie movie came out, Barbie core became a thing. It was a fashion trend to wear pink all the time.

But you need to be able to actually reflect that within populations. Also metadata can be different depending on geographical regions. know, football here in America is different than football everywhere else in the world, for example. And you want to be able to have metadata that actually captures like the specific audience that you're talking about, because otherwise you're going to have potentially ambiguous kind of results in your model. Also, when you think about kind of, you know, models today, especially these, these kind of large foundation models,

They have a really good understanding, a broad-based understanding of these words and things like that. And usually in order to improve performance, it's not about the quantity of data anymore, it's about the quality of data and finding the specific images, the specific videos that you need in order to prove the model's performance. So by being able to actually define specifically what are you looking for in terms of the data to improve your model's performance is really, really critical. And that's where metadata comes in.

being able to find those things that you're looking for in order to improve model performance, to be able to capture edge cases, to be able to even filter out problematic content, especially when you think about these models and the fact that they're trained on massive quantities of data, you can actually even use metadata to filter out potentially personally identifiable content, potentially toxic content, and things like that as well. So it comes in both in terms of...

the initial input to make sure that you have a good swath of data going into the model initially, as well as as you continually improve the model and find its weak spots, that you're finding the right type of content in order to improve the model performance. And that's where metadata can help you to be targeted about what is the right assets for you to use in order to improve.

Prateek Joshi (20:34.022)
And the engine that extracts the metadata, are there standard agreed upon methods to measure how good the engine is? Or maybe you work on it and somebody on the team says, hey, I came up with the thing. It's going to make the engine better. So how do you measure the efficiency or accuracy of the engine that extracts

Cody (20:56.514)
Yeah, so super good question. know, one of the nice things about our approach is that we can do this kind of agile fine tuning where with a few examples, you can define what you're looking for in terms of, define what you specifically mean by an attribute or a specific piece of metadata or tag. And the great thing about that is that, you know, from there, you know, in our process of agile fine tuning, we can actually very quickly say, what are the edge cases?

you know, what are the things that are kind of border, borderline of whether or not, you know, this is, you know, a piece of metadata. For example, you might think of a bird. You could provide a few examples of bird and get to pretty quick, like accuracy on a bird, thanks to like, you know, foundation models being few shot learners. But there's edge cases, you know, such as like a rotisserie chicken is a rotisserie chicken, a bird. And it really depends on like your specific use case and who you are to Costco.

you know, or to, you know, a large retailer, a rotisserie chicken might be a bird, but to like an animal conservatory, you know, rotisserie chicken is not a bird. So by being able to actually very quickly find those edge cases and do this kind of agile process of fine tuning, you can actually then be able to very quickly regenerate metadata and do it in a much more agile way versus kind of the prior way of like, you know, doing, you know, metadata creation and data set creation was

much more of a waterfall method. You would go collect all the data, you would toss that over the fence to specific annotators that usually don't have the context about the problem or the use case that you're trying to solve. They would annotate everything, then you would get all that data back, and then you would try to train a model on it and realize that it would have edge cases. And the fact that you can really focus in on a few examples makes it much more observable to understand the quality and also the limitations of any metadata that you generate.

Prateek Joshi (22:55.822)
synthetic data play a role in this and if so in what ways would that fit in into this?

Cody (23:05.774)
That's a really good question. Give me a moment. I haven't actually thought a ton about synthetic data in terms of how it fits into this.

But immediately, I think there's kind of two parts. One, when you think about generating synthetic data,

It kind of exacerbates the problem and the need for metadata to filter through and find what are the right assets that you need. You know, we're seeing this kind of more than ever, because when you think about content creation, you know, we've made it way cheaper because every single one of us has a phone in our pocket. But now with generative AI and with synthetic data, you know, if it was a tidal wave before, it's a tsunami now and the amount of content and data that we can create. And as a result, more than ever, you need to actually be able to like

not rely on manual metadata entry, you need to be able to generate metadata at scale in order to understand what's actually in that content. So the problem only gets worse when you think about this, the volume or quantity of data that's being generated when you introduce synthetic or generative techniques as well. And then also being able to actually provide input into like, what are you generating? Finding those kind of right subset of your data such that you're generating

you know, content that handles those edge cases. Being able to focus in on that kind of specific space and understand what is it that you need to generate to improve performance. Maybe you're thinking of an image generation model and like if you don't know, you know, maybe it's like generating a realistic version of a flamingo or something like that is actually really problematic. Well, you need to like actually know what outputs are like kind of problematic and you need to be able to find kind of a reference points.

Cody (24:55.983)
And those edge cases to be able to actually improve performance and to deal no contrast of things, to be able to generate synthetic data that makes it even better.

Prateek Joshi (25:05.522)
Right. And as models get more and more complex, we need a lot of data, a lot of structured, a lot of unstructured data. So how do you see the future of data labeling evolving? And also, how might metadata and multimodal AI shape this evolution?

Cody (25:30.542)
Yeah, super great question. mean, you taking a step back, we live in like kind of an unprecedented time right now, where, you know, it kind of reminds me of the 70s or 80s when we had, you know, Moore's law where the amount of compute and transistors and a chip was doubling every, you know, 18 to 24 months. When we think about kind of, you know, the amount of operating floating point operations and training a model, it's greatly outpacing what we saw in Moore's law where we see these models doubling not every 18 to 24 months.

but every 3.4 months if you look at this AI compute blog post from OpenAI. As a result of that, when you think about metadata and these definitions, the thing that you can guarantee is that these models are going to change. You want to be actually modular in your design when you think about generating metadata, generating these definitions where you can separate

you know, the specific taxonomy or this kind of semantics of your specific business from the base foundation model. Because the only thing that's certain is that that's going to change and that's going to improve over time. So when we think about kind of labeling and when we think about kind of our process of, you know, defining these like domain specific concepts or dynamic tags, we're intentional about making sure that the data set and the labels for those like tags and for that taxonomy are separate from the base foundation model.

so that we can actually very quickly and agilely fine tune any new model and swap it out almost like how you would swap out tires on a car.

Prateek Joshi (27:07.634)
Right, that's a great analogy actually. So, right, one final question before we go to the rapid fire round. Given all the things that are happening in AI, too many things are happening, they're all moving very fast, what technological advancements are you most excited about today?

Cody (27:27.24)
I mean, I'm really excited about multimodal AI and the potential of embeddings because when you think about it, enterprises have been stuck in this process of tag load search where they've had to tag their raw assets either through human or machine annotations, load those assets into their systems and then search and analyze based off of that. But now with multimodal AI and multimodal embeddings,

we're able to actually flip that process on its head where we can load and index raw images and videos and understand the pixels and audios directly and make it searchable with no metadata, no tags required. So the bulk of work when you think about just even the most basic things of like it's something a cat or a dog, you don't have to worry about that. Instead, tagging and metadata generation can be focused on that last mile to go from

you know, 70 % performance to 90, 95 % performance. So it dramatically reduces the amount of time and energy that's involved in this process and ultimately gives us kind of a completely new set of superpowers. So I'm really excited to see how those superpowers are put into practice. When we look at the teams that we work with, now they're able to find clips in a matter of seconds and not hours, whether it be for...

you know, marketing purposes, post-production, editing, content licensing, or they can detect content, like almost develop like a spidey sense to detect content that violates editorial guidelines or community guidelines and trust and safety. And then ultimately like looking at, at content and assets actually going from being a tax on these organizations to actually an asset that generates new revenue streams, generates revenue streams for creators and for these companies.

as now AI models have to consume this data because we've saturated what's publicly available in terms of what's on the internet. And now it's really getting into these more focused, domain-specific data sets, especially when we think about image and video assets.

Prateek Joshi (29:29.35)
Alright, with that, we are at the rapid fire round. I will ask a series of questions and would love to hear your answers in 15 seconds or less. You ready?

Cody (29:39.873)
Okay, I'm ready.

Prateek Joshi (29:40.434)
All right. Question number, that's fantastic. I think all these episodes, yeah, I think it's, anyway. Question number one, what's your favorite?

Cody (29:42.872)
Take a deep breath.

Cody (29:53.678)
So this is a really tough question and I'm going to give you two books. So I'm going to cheat a little bit here. The Enter Game of Tennis and Getting More by Stuart Diamond are my two favorite books.

Prateek Joshi (30:07.6)
Love it. We love books on this podcast, though more books is always better. All right. Next question. What has been an important but overlooked AI trend in the last 12 months?

Cody (30:19.566)
I think active learning and data-centric AI more broadly because AI is all about data. It's all about the data that you bring into the system.

Prateek Joshi (30:32.238)
What's the one thing about metadata generation that most people don't get?

Cody (30:40.91)
I don't think most people get why it's so important to enterprises. When you think about enterprises have so many existing workflows and applications that are dependent on metadata, that ripping and replacing all of that is a non-starter in many cases.

Prateek Joshi (30:59.866)
What separates a great AI product from a merely good one?

Cody (31:06.286)
partnering with customers, especially when we think about the enterprise because AI is such a massive shift that it's all about change management. And the only way that you can do that is by listening and learning alongside your customers.

Prateek Joshi (31:23.45)
What have you changed your mind on recently?

Cody (31:28.075)
vacations you know it's just so important to unplug and recharge in order to be able to think clearly and to be able to execute effectively especially in such a rapid moving space

Prateek Joshi (31:42.17)
What's your wildest AI prediction for the next 12 months?

Cody (31:47.344)
man, that's another tough question. Because I feel like people often overestimate what will happen in one year and underestimate what will happen in five years. I think in five years, we will see the equivalent to Toy Story, but from generative AI filmmaking.

Prateek Joshi (32:07.044)
Amazing. Yeah, that could be a very big leap. All right. Final question. What's your number one advice to founders who are starting out today?

Cody (32:18.222)
It's a team sport, from your employees to your board. So it just pays to be super thoughtful about that.

Prateek Joshi (32:28.338)
Amazing. Cody, this has been a great episode. Love the examples, the analogies, the depth of the conversation. So thank you again for coming onto the show and sharing your insights.

Cody (32:39.192)
Yeah, thank you for having me.