Infinite Machine Learning: Artificial Intelligence | Startups | Technology

Algorithmic Data Curation

February 26, 2024 Prateek Joshi
Infinite Machine Learning: Artificial Intelligence | Startups | Technology
Algorithmic Data Curation
Show Notes Transcript

Ari Morcos is the cofounder and CEO of Datology, an automated data curation platform. He was previously an AI research scientist at Meta and DeepMind. He has a PhD in neuroscience from Harvard.

(00:07) Data Curation and its Importance
(03:29) Assessing Data Quality
(06:50) Challenges in Data Curation
(13:27) Types of Data to Remove
(19:33) Relationship Between Data Size and Model Size
(23:22) Choosing the Optimal Subset of Data
(26:23) The Future of Data Curation
(31:29) Impact on Data Management Service Providers
(36:19) Rapid Fire Round

Ari's favorite books:
- The Making of the Atomic Bomb (Author: Richard Rhodes)
- The Cosmere Series (Author: Brandon Sanderson)

--------
Where to find Prateek Joshi:

Newsletter: https://prateekjoshi.substack.com 
Website: https://prateekj.com 
LinkedIn: https://www.linkedin.com/in/prateek-joshi-91047b19 
Twitter: https://twitter.com/prateekvjoshi 

Prateek Joshi (00:01.572)
Ari, thank you so much for joining me today.

Ari (00:04.854)
Thank you for having me, really excited to be here.

Prateek Joshi (00:07.803)
Let's get right into it. Can you explain the premise of data curation as it relates to the LLM world we're living in?

Ari (00:21.238)
Yeah, absolutely. So, ultimately, models are what they eat. If you show them good data, they're going to be good. If you show them bad data, they're going to be bad, right? If I train a model on 4chan, it'll sound like 4chan. If I train a model on high-quality data sources, it'll be high-quality. I mean, this doesn't just apply to LLMs, but it applies to really any AI model. And the reason that this has become increasingly critical...

over the last couple years is as the field has shifted from these small supervised data sets like ImageNet for example, which is order a million data points to these massive unsupervised data sets that now underlie all the technology that we use for these big AI models, which are now trillions of tokens. We're scraping the entire internet. We're using these massive uncurated data sources. The average quality of the data sources that we put into our models.

has gone down dramatically. And similarly, because we have such massive data sets, the amount of redundancy, the amount of unnecessary data that's present in these data sets has gone up substantially as a fraction of the total data sets. So data curation is just absolutely critical not only to make your model as strong as possible, but also to make your model train efficiently. So there had been a lot of this work on neural scaling laws. So these are things like.

Chinchilla and this is the work that the Anthropic folks did when they were at OpenAI before they went and founded Anthropic, showing that there's a predictable relationship between the performance of a model and the amount of data you show that model. And that's really nice because it enables us to train a model at a small scale and then predict what it's going to look like at a large scale. The problem with it is that it's really, really slow. So every time you 10x the data size, you get a diminishing performance increase in your model.

So we hit this diminishing returns quite badly. And that's fine if you're starting from a model that costs $1,000 to train. But if you're starting from a model that costs seven figures or eight figures to train, then 10Xing the data to get a smaller and smaller improvement just isn't sustainable. And I think this is why you see these predictions that GPTN will cost $100 billion or a trillion dollars, because people are naively extrapolating out these curves. But it turns out-

Ari (02:46.046)
that the reason you see these scaling laws is because all data is not created equal. And as the size of your data set increases, the odds that next data point is going to teach you something goes down. And in fact, it actually goes down at the exact same rate that we see the scaling laws go down. So if instead we can make it so that every successive data point is always teaching you something or always learning something from each new data point, then the model learns way, way faster.

So data curation is critical not only for getting a high quality model, but also enabling us to continue improving and scaling models. We have to use data better. We can't just naively use all the data without thinking about its quality.

Prateek Joshi (03:29.835)
That's an amazing setup for the discussion here because many times what people may or may not think about is like, hey, let's just build a giant data set and then more data is better. So it'll build a model. But if you just, I mean, in an extreme example, if you just take 10 words and you just copy paste those 10 words like a trillion times, that's a massive data set. But...

it's not going to build a great model. You know why? There's only 10 words, 10 distinct unique words in the dataset. Again, it's an extreme example, but the point is each new word after the 10th word, it's not really teaching you anything. So if you look at the process of building the dataset, how do you like as a first pass, like what should a developer do to know, okay, how good is this?

data set, is there a metric, is there a bunch of measures that you can, at least to get a sense of where the baseline is.

Ari (04:31.922)
Yeah, so I think the first thing that's really important to understand about data and data curation is that this is a frontier research problem. This is not something that there are widely accepted techniques that we just know do this well. You know, relative to the amount of effort that the research community has put into say architectures, so making better versions of transformers and now you see state-based machines and all these various things. There has been just

almost no effort put into improving the quality of data sets. And part of that is cultural. Part of that has to do with the fact that in supervised learning, you had some floor on the guarantee of the quality of the data set because a human has looked at every data point to label it. But when we shift to unlabeled data, there is no such guarantee. And we're also getting to data set sizes that are large enough that there's just no way a human can hold the entire data set in their head and make these sorts of inferences.

So I'm actually gonna to see your question of kind of what should a developer do? I think that there's a lot of challenge in identifying the right approach here. And ultimately, what we found kind of time and time again, is that humans are actually not very good at this. We like to think that we're good at this and you know, this is I think another corollary of Rich Sudden's classic bitter lesson, which is that, you know, humans...

want to come up with these elegant inductive biases for how we should train models. But in reality, all that really matters is more compute and more data. And the same thing applies to data curation, where we like to think we know what makes data good for a model. But in a lot of cases, it's something that is done much more effectively algorithmically. So I don't think there actually is a pat answer. And I think that something that's really critical for people to understand is that this is really hard.

This is not the sort of thing where there's an off-the-shelf tool that's really going to solve this problem. And if you look at kind of the off-the-shelf tools that exist, they tend to do some version of exact or fuzzy deduplication using min hash or something like this. And these approaches work, but they're really just barely scratching the surface of what is possible with real data curation. And ultimately, that's something that has to be done at scale algorithmically.

Prateek Joshi (06:50.707)
Right, amazing. I think you put it nicely. Like you have to start doing it algorithmically because historically, like if you go maybe a few years back, you build a nice image data set, you get a bunch of humans to look at it, label it, certify it, you build a model, you ship that model and hopefully, I don't know, every n weeks or months, you'll update the model, again, with human supervision. Now that entire premise is uprooted because there's a deluge of data.

There's no, like, there's no, we cannot label a trillion word. It's just not, not practical. It's even if they can do it, it's super expensive. So now you have to curate the data algorithmically, which brings me to the next question, what in an ideal world, what should that algorithm do or even look for? Let's say I come to you, maybe you're guiding a customer. They're like, oh, here's a hundred gigs of data. Right.

what should the algorithm look for in that data as a starting point?

Ari (07:54.922)
Yeah, so this is something that I've worked on extensively in the past and is what led me to found Datology. But roughly, there are three kinds of data that we think about you wanting to filter or remove from a data set. And I want to note, actually, that data curation and filtering is just one part of the process of how you use the data best for your model. There's also things like the way you order the data.

the way you augment the data, the way you batch the data. There are all these other factors in addition to just which data points you show the model that are also really high leverage and can make a huge impact in the result in performance and efficiency of the model you train. But let's just focus on kind of which data do you want for a second, kind of the filtering question. We think of there being roughly three kinds of data that you may want to remove from a data set. The first one are what we call semantic duplicates.

semantic duplicates are data points which are fundamentally identical, but which for a variety of reasons usually to do with post-processing, end up looking really different to a machine learning model. Or sorry, looking really different to a human, I'm sorry. So as an example of this, imagine that you end up with two different images of the same product that were uploaded to two different e-commerce sites.

Each e-commerce site processes it a little differently. Maybe they crop it, they down sample it, maybe they rotate the image a little bit, maybe they change the saturation of the colors. Fundamentally, this is the same piece of information. This is the same image. And to a human, it's obvious that this is the same piece of information. But if you now use something like min hash deduplication or something like that, it's never gonna catch that this is the same information because on a pixel level, this is fundamentally a different set of values.

So this is kind of the lowest hanging fruit that you can remove from data sets. And as you imagine in web scrape data sets and in those sorts of data sets that, you know, you see in practice in businesses, there's just a massive quantity of semantic duplicates. So that's the first category and I think the easiest category to handle. The second category is semantic redundancy. So these would be data points that are not the same, but which contains similar information.

Ari (10:13.45)
So for example, imagine two different pictures of two different golden retrievers in two different parks. These are different images, but they convey similar information. Or in the text domain, you might imagine two different summaries of the same book written by two different people. The challenge with redundancy, and this is one of the things that makes this really hard, is you need redundancy. For semantic duplicates, you generally don't want any of them. You just wanna find one example of it.

But for redundancy, we need redundancy, because redundancy captures the variance and the natural variability in a concept that's present in the real world. Otherwise, we'd have a model that can only ever understand a golden retriever if presented in this exact right way. So we need to have a redundancy. But we don't need infinite redundancy. And that's where a lot of the challenge is. You have to automatically, at scale, identify what are the concepts. And then for each concept, how complicated is it, such that

I know how much redundancy I need in order to understand that. If it's a very simple concept, say something like elephants, elephants are pretty stereotyped, right? There are Asian elephants and there are African elephants. African elephants are just bigger versions of Asian elephants, but they're all gray, they're all wrinkly, they all have big floppy ears, they all have trunks. They're all pretty similar. I don't need that much redundancy in order to understand what an elephant is. But if I want to understand dogs, well, there's tons of variants in dogs.

You have tons of variability, dogs of all different shapes and sizes and colorings and all these sorts of things. I need a lot more redundancy to understand dogs. So if I were to keep the right amount of data for dogs as for elephants, then I would end up with a lot of data for elephants that I don't really need and that is ultimately just a waste of time and compute to be looking at and because models never converge, efficiency ends up equaling performance.

Alternatively, if I were to use the right amount of data for elephants for dogs, well, then I wouldn't have nearly enough redundancy for dogs, and I wouldn't fully understand that concept. So semantic redundancy is one that's really challenging, because you need to be able to take in not only the conceptual complexity of each concept, but also the relationships between all of these data points. And when you're talking about the pairwise relationships between billions or trillions of data points, it's just something that humans can't do.

Ari (12:25.878)
And then the last category of data is in many ways the most interesting, which is bad data. So this would be the equivalent of mislabeled examples in supervised learning. But the challenge is we don't have labels. So what does actually a mislabeled example look like in unsupervised learning? I think there are some areas where we do have a notion of formal correctness, for example, code, where obviously any code that doesn't compile is not something that we would wanna show to a code gen model.

But if I want a model that's going to debug code, maybe it does need to see some code that doesn't compile. So there can be challenges based off of what you want to do with the model. And the thing about bad data that's really interesting is that there's been a lot of work that has shown that one bad data point is not offset by one good data point. You need many more times good data in order to offset the negative impact of just one individual piece of bad data. So bad data can really throw a model astray.

And being able to identify that at scale is really critical for making your model as performant as possible.

Prateek Joshi (13:27.963)
That's an amazing level of depth in just understanding how the three pillars work here. I want to dig in on one specific thing you mentioned. If you talk to a human like dog versus elephant, it's easy to know, hey, dog, there's so many varieties of dogs. So you need more data to capture that. Elephant, not so much. But when the concept becomes more...

esoteric and like, and some data set, you don't want to rely on a human telling you, hey, dogs need more data than elephants. So algorithmically, how do you assess the conceptual complexity of a thing?

Ari (14:11.666)
Yeah, so this is the billion dollar question in many ways. Or I would say one component of the billion dollar question. But it is definitely a real challenge and is a research problem. It is an open research problem. And something that we're working quite hard at Taitology to figure out the right ways to do this so that we can estimate this at scale. Because not only do you need to estimate, okay, for the dog concept, how complex is that concept?

I also need to estimate what are the concepts in the first place. I'm not given labels like the class labels that, hey, there's dogs and elephants. I'm just giving here as a bunch of data, find the concepts, figure out the right way to partition, figure them the right way to do the conceptual complexity and all of that. I will say that one tool that we have found to be very useful for all of this sort of data curation is using other models as embedding spaces to try to work this sort of thing out. Because when you...

Prateek Joshi (14:44.599)
Alright.

Ari (15:07.23)
Embed a data set, what you get is the model's view of how all the data are related to one another. And that's something that enables you to then start to make these sorts of inferences using algorithms that take advantage of the geometry of that embedding space.

Prateek Joshi (15:27.263)
Actually it's a good stopping point for listeners who don't know. Can you explain what Datology does?

Ari (15:40.434)
Yeah, so Datology is basically building fully automated data curation as a service for anyone who wants to train their own models. So what you do is we get deployed to a customer's own infrastructure, either through a VPC or on-prem. You point us at your data source, whatever that is, whether that's in Blob storage or S3 or whatever it is, we take in that data, we do a bunch of initial preprocessing that's specific to the modality of the type of data.

We then take all that data, embed it all, and apply a lot of the approaches we've just been discussing in that embedding space. And one nice thing about that is, because we're operating in embedding spaces, we don't actually care what type of data we're processing. So whether it's images or text or video or audio or financial data, or genomic data, whatever it is, we can handle and process it. Because ultimately, once you go into an embedding space, every data point is just the back end.

We then do that, take this embedding space, we filter the data, take those filtered data, sequence them appropriately, so the order in which you present data, as I mentioned, is really important. We take those ordered filtered data, we batch them appropriately, the way you batch your data matters a lot, and then we output a data loader, which just plugs into a customer's existing training infrastructure, and makes their models train way faster to way higher performance, and also helps them to train smaller models to the same performance.

Prateek Joshi (17:05.027)
Amazing. It's actually a good segue into what would have been my next question is if you look at data as you go from one modality to the next. In this case, you mentioned hit text or images or genomics data. It's you operate in the in a space where it's all vectors to you. So you don't really have it doesn't really stop you or you don't need to customize it. But as you go from one sector.

to the next one, let's say healthcare to financial services. Do your customers ask you for different things or what things do you have to do to make sure your product works for that sector?

Ari (17:47.486)
Yeah, so there are roughly kind of four buckets of data curation approaches that you can imagine. The first bucket is approaches that are both task agnostic and modality agnostic. So I don't care what type of data it is, and I also don't care what you're going to do with those data. And then you have a bucket of data that are task dependent, but modality agnostic, a bucket of modality dependent, but task agnostic.

sorry, modality agnostic, but task dependent, and then the bucket of both modality dependent and task dependent. There are approaches that will fit into each of these four buckets. We are convinced that there's actually a ton that you can do without knowing anything about the modality or anything about the downstream task. And that's where we're generally focused first because we are focused on the gains that we can get that are gonna be the most general purpose in generalizing.

Now that said, if we do know something about the problem and the way that a model is going to be used, you might want to adjust the curation approach subtly. And that's absolutely something that we will be doing, especially for modalities that are extremely common, like text and images and audio and video. But it turns out there's a whole bunch you can do, even if you don't know anything about what happens. Just as one example, no matter what I want to do with the data, if I have duplicated data, that's never a good thing.

It's a lot more than just that, but that is kind of one that's really easy to understand.

Prateek Joshi (19:12.235)
You know, there's an interesting trend, or maybe when you talk to a whole bunch of people, people think that data size is kind of directly proportional to model size, which is a misconception, meaning they think that, hey, I'll just keep adding more data, so correspondingly model

will get bigger and bigger. Well, no, you can take a giant data set and build a very, very small, it won't be good. It won't be a good model, but you can really build an extremely small model. So how should people think about the relationship between data size and model size? And also as you go from small to medium to large, how does that work out?

Ari (19:57.406)
Yes, this is a great question. And I think what you're alluding to is fundamentally the Chinchilla paper, which really shows that there's this relationship between the size of the model that you train and how much data you should use. I think that's a, I'm friends with a number of the folks who have worked on that paper. I think that paper has been tremendously impactful and raises a number of really interesting questions. However, I think that there are a number of challenges with thinking about Chinchilla. First and foremost, it doesn't take into account data quality.

So Chinchilla and all these scaling laws assume that you're randomly sampling your data without any regard to which data are good or bad. So it's assuming that kind of all data are created equal. And once you change that assumption, all of the math changes that underlies these sorts of approaches. Two, it doesn't take into account the cost of inference downstream. So it's really saying, given a fixed compute budget, what's the best...

trade-off between model size and data size, but you also have to take into account how often are you going to deploy that model and how much are you going to use it. The more you train, the more you're going to use inference for a model, the more important it is to have that model be smaller. Because the cost of inference scales with the number of times that it's going to be used. And as we see these AI products, go from tens of millions of users to hundreds of millions of users to billions of users,

the relative cost of inference to training is gonna go up and up and up and up and up. And that's gonna incentivize companies to wanna deploy smaller and smaller models. So, and this has actually led folks like Andrei Karpathy to call this like the chinchilla trap. So I actually think that we're gonna see a huge push towards companies training smaller models that are more specialized and can do one thing really, really well. Because if you also think about kind of the big problems we see with LLMs right now.

one of the biggest ones is of hallucinations. And that is fundamentally a byproduct of models being general purpose. If we train smaller specialized models, one, we're going to save on inference cost, but we're also going to increase reliability tremendously. But in order to do that, you have to train your models on the right data. So as you go towards smaller and smaller models, the importance of data quality goes up. And this is something that we've seen from the Fee models from Microsoft.

Ari (22:21.254)
that it's pushed out and Microsoft just actually launched a new internal group that's just going to work on small language models. That's going to make a huge impact here. I think another thing that's just really important to kind of mention is language models are one piece of this pie and language models are of course very exciting, which is why there's been so much attention diverted towards them. But they're really just the beginning of where we're going to see AI being deployed. It's going to be deployed well beyond just language to every other modality.

and every other business use case.

Prateek Joshi (22:51.859)
Right. I want to shift the conversation to the actual mechanism with which an algorithm chooses a subset, like the best, the most optimal subset for you, the customer, to build a small language model. Now, one, actually two-part question. One, if the data set is big, let's say there's like 100 billion words. Obviously, picking the global optimum

It's a frontier research field. It's not like guaranteed. So how do you assess that? Okay, I chose the subset. It's good enough. Is that a threshold? Is that a test? You say that, okay, I did a bunch of tests and I chose the subset for you. You'll get the max out of it. Also part B, is it even possible to get the global optimal point here? Like the best possible subset.

Ari (23:46.25)
Yeah, so I think, so first off, let's talk about how you evaluate a model and the quality of a data curation approach. Ultimately, it's going to depend on what you want to do with that model. So typically, we use these academic benchmarks as ways to evaluate models. And I think that one thing that is becoming more and more appreciated is how fraught evaluations of language models in particular are. We use these sort of academic reasoning benchmarks like MMLU and stuff like that.

But if I make a model 5% better on MMLU, it's not clear how that actually helps with whatever you want to do with the model downstream. So first, you kind of have to pick some way that you're evaluating the model that you are happy with, and that is relevant to the use case that you care about. Ultimately, I think in businesses, we'll see most evaluations shift to A-B testing with whatever the actual relevant business metric is. But once you have that evaluation, ultimately, the question is,

how much data does it take you to get to that performance threshold, whatever that criteria is, right? And ultimately, because of the way we train models, data is equivalent to time, is equivalent to compute costs, is equivalent to dollars. So any reduction in that reduces all of these factors. And if you're using your, if you're doing data curation effectively, what you want to be able to see is that you can get to that same threshold performance much, much faster.

Alternatively, you want to be able to show that you can get to much better performance in the same compute budget, or even do something in between those two compute budgets where you're getting both a performance gain and an efficiency gain at the same time. And that's one of the things that's so exciting to me and unique about data curation is that if you do it well at scale, there is so much headroom here. We are doing this so suboptimally by default that you can provide both performance and efficiency simultaneously.

which in many cases is largely unheard of in ML, where almost always if I want a better model, I have to pay more for it. If I want a cheaper model, I have to take some hidden performance. But if you choose your data correctly, you can actually get both.

Prateek Joshi (25:55.463)
That would be a fantastic. And maybe it's a good segue into my next question. Looking forward, right? You've been close to this. You're actually in the heart of it, building a product to make this happen. Can you talk about where the world is and where can we expect to be in terms of data curation? Like what can we expect from this group of products in 12 months?

and maybe 24 months.

Ari (26:26.398)
Yeah, so I think the first thing and the thing that's most important is a shift from approaches that rely on the end user to curate the data to approaches that just make it work automatically. And that's something that, you know, we are extremely focused on at Datology. There are a number of approaches that, you know, ask the end user to curate the data themselves, which have a number of issues for the modern era of self supervised learning. In many cases, I think they tend to be built for supervised learning.

You know when you have this sort of scale Anything where you have a human in the loop just is going to be prohibitive You need something that can just be do it automatically and when you're also dealing with this sort of scale You need to actually be able to hold the whole data set in your head You need to be able to compare thousands and millions and billions of data points to one another Which is just something that's beyond the capabilities of humans So one thing that I think we'll really see is a shift towards the way people approach this problem

from thinking about this as something where, hey, we offset this cost to the ML developer, and instead we're just gonna make it easier for the ML developer to do this themselves, to instead taking the entire cost out of their hands, and saying, we're just gonna do this for you automatically, and make it so that you have a better model that trains faster coming out of this. And I think that's gonna be really critical for not only making it easier to train models, but also for reducing the barriers to entry to train models.

You know, modern AI is going to be the most transformative technology, you know, in human history. It's going to be massive. And in order for that to happen, we need to have more than just a couple groups training models. But right now, if you're, you know, an ML developer at a, you know, at a thousand person company, you're not going to have the expertise and the resources to know how to do all of this. So taking it and making it so that it just works.

is going to dramatically reduce not only the barrier on expertise, but also on costs and make it much more viable for everybody to benefit from this technology and help to disseminate this incredible potential of AI throughout the economy and so that it can help everybody.

Prateek Joshi (28:39.187)
Right. And if you see the flow of services and dollars, I wanna get your thoughts on this approach. One approach could be companies buy dirty data and then they use a tool to clean it up and curate it themselves. That's path A. Path B is a bunch of third party providers show up and say like, hey, look, we'll take all the dirty data. We'll do the cleaning and companies...

I'm going to just buy clean data. So which of these two paths is more likely or will they coexist? One will eat up the other.

Ari (29:15.51)
Well, I think those two paths are missing, perhaps, a pretty important factor, which is that most businesses have their own data. And their own data is actually really their moat, right? That's their differentiator. And you see things like folks asking for businesses to share their data with them, people who are training big foundation models. And in many ways, I think you're asking businesses to voluntarily give up their moat for free.

their data is their most valuable commodity. So I think that what you need is to be able to operate on a business's own data, that they've been collecting petabytes of data as a course of business for many, many years, which are very informative about their customer base and the particular way that their customers use their products. And what we need to do is we need to make it accessible and easy for businesses to leverage that data to train their models, which maybe will be...

will be augmented with some open source data. But most public data is quite different than what you see inside of businesses. And I think if you want to train a general purpose language model, then the internet is OK. But even if you take the internet, it's a very, very biased representation of the world. It's extremely biased towards the developed world and towards the Western world and English and all these sorts of things. There are a lot of problems with it.

And ultimately, when you think about things like bias and fairness, they come down to a data problem. So it's also really critical that we use our data sets correctly so we make sure that these technologies don't only benefit the people who are already represented by these data sets, but also can benefit everybody else as well.

Prateek Joshi (31:00.067)
There are many companies that offer data management products in some shape or form, data lakes, data warehouses, and so on and so forth. And to them, more messy data is better because they can just charge their customers a lot. Now, Datology, you're coming in and you are selling this customer the promise that, hey, you don't have to spend that much on managing that much data because you don't need a bunch of data. So...

Do you in some way, like are you eating up into the data management service provider companies? Like are you eating up the money they're making?

Ari (31:40.318)
Yeah, that's a great question. I don't think so. I don't think that there's really anything that puts us necessarily at odds with them, at least any time in the medium term. Because we're really focused on how do you best use your data for training AI models. That's really our focus, is what's the intersection between data and AI? And ultimately, at Datology, we want to own that intersection. But.

The data lakes and the data warehouses are storing data for many other purposes as well. And for some of these other purposes, you may want to keep all that data around. And I think another important factor here is that, as I mentioned earlier, the approach to curation you take might change depending on what you want to use the model for. So you might end up taking a different subset of the data for purpose A than you would for purpose B, which would still make you want to keep that data around.

So while I think that eventually you can imagine some sorts of optimizations where maybe you don't store all the data, at least for the foreseeable future, I think companies are going to want to keep all the data around and then just identify what is the best subset of this data to use for training.

Prateek Joshi (32:53.639)
Yeah, maybe they'll keep all of the data, but maybe they'll buy less GPU hours. Maybe you're eating into that. Meaning like with Datology, you don't need to spend all of that money on buying AI compute because now you can do the same thing with less. So I think that's a great way of putting it. And I think it's good to see where this plays a role. So that's fantastic. All right, before we go to the rapid fire round, I have one last question.

Ari (33:01.355)
Yes.

Prateek Joshi (33:23.163)
in five years, what is the big technological breakthrough that you want? It's not happened yet, but you want it to happen. What's the next big thing that can happen here?

Ari (33:38.77)
Yeah, so I mentioned this at the beginning, but really the impact of self-supervised learning, of the ability to train without labels, just cannot be overstated. It is such a dramatic paradigm shift in the way that we view models, and I think that there's actually been a fundamental misattribution in the public discourse that I've seen to transformers as being the kind of critical piece in the AI pie rather than self-supervised learning. I would argue...

strongly that if we had never figured out transformers, we would still roughly be in the same place now as we are. If we had never figured out self-supervised learning, we would not be anywhere close to where we are. ChatGPT is totally impossible without self-supervised learning. But so far, I think that we have really just seen self-supervised learning applied to language primarily. And that's where you've seen the most use cases and where the vast majority of the excitement around this has come.

However, I think that we will see that most of the value of self-supervised learning and AI is gonna come on areas beyond language, right? Ways to accelerate science through helping us to do better drug discovery, helping us to do better material discovery so that we can solve carbon fixation and solve fusion and all of these major problems that are facing humanity, which fundamentally aren't text problems. So one thing that I really hope to see, and I think we will see within the next five years, I think it'll actually even happen faster.

Um, is the proliferation of these techniques to all the other modalities and all the other types of data that we have in the world to help improve all the other prediction problems that are just core to our daily lives. Um, right. Like imagine the impact of weather prediction, if we can get that really, really effectively, all the downstream implications that would have. Um, so that's something that I'm just.

personally really excited about. Also as a scientist, I was trained as a neuroscientist originally. So seeing ways that we can use AI to help, all the various sciences I think is very exciting. And I hope that datology helps to advance that future by making it easier for these folks to train their own models.

Prateek Joshi (35:49.651)
Amazing and 100% I would love for that future to happen as well as going beyond text. And there's so many, as you said, extremely meaningful things to humanity that can happen and things like biology, material science, energy, like extremely pressing things. So that's very, yeah, I'd love to see that happen. All right, with that, we're at the rapid fire round. I'll ask a series of questions and would love to hear your answers in 15 seconds or less.

You ready?

Ari (36:20.598)
Yep, let's go.

Prateek Joshi (36:21.887)
Alright, question number one. What's your favorite book?

Ari (36:26.266)
So for nonfiction, The Making of the Atomic Bomb by Richard Rhodes, which is just a spectacular book describing the origins of the Manhattan Project. And if you liked Oppenheimer, read this book for sure. And then on the fiction side, anyone who knows me will know that I'm very obsessed with an author named Brandon Sanderson, who has written a series of novels called The Cosmere, which I just absolutely love.

Prateek Joshi (36:47.683)
Amazing. Next question. What has been an important but overlooked AI trend in the last 12 months?

Ari (36:56.738)
Well, so I have a personal bias here, but of course, the importance of data. You know, this is really without the, the scenic and on which there's nothing without good data, you don't have good models. And I think that it's something that we're seeing increasing awareness of the problem that this is really critical. But one that I expect we will see, you know, more and more people talking about and discussing over the coming years.

Prateek Joshi (37:20.007)
What's the one thing about data curation that most people don't get?

Ari (37:25.954)
humans are bad at it. I think that's the biggest thing. It's hard to internalize because we as humans want to think that we're good at stuff. But it's another form of the bitter lesson that this is another area where machines are better than we are at it and that we have to build the tools and think about the ways we'd approach this with that in mind.

Prateek Joshi (37:49.351)
Right. What separates great AI products from the good ones?

Ari (37:55.134)
I think products that don't require a tremendous amount of expertise on the end user. If you need a PhD in ML to use a product effectively, it's going to have inherently very limited impact. And I think that, you know, we're seeing this trend, but I hope to see more and more products that help anyone to harness this technology, not just those who have spent, you know, many, many years of their life obsessively studying it.

Prateek Joshi (38:19.075)
All right, next question. What have you changed your mind on recently?

Ari (38:25.583)
Yeah, I think that one thing that I was surprised about actually starting datology is I expected to need to do a lot more education around the importance of data quality. But I think garbage in, garbage out is such an old adage in computer science that people actually really get this, that it's important. And I think more than anything, it's about showing that this is possible to do automatically.

Prateek Joshi (38:39.156)
Hahaha.

Prateek Joshi (38:50.515)
Right. What's your biggest or wildest AI prediction for the next 12 months?

Ari (38:57.43)
Well, I would say it's kind of the same as what you asked for what I'd hope to see for the next five years It's that I think in the next 12 months, we're gonna see a lot more discourse and a lot more attention Shifting from just LLM's to all the other applications of AI And we're gonna start to really see the ways that this is gonna impact every aspect of our daily lives

Prateek Joshi (39:17.139)
All right, final question. What's your number one advice to founders who are starting out today?

Ari (39:25.398)
I think that the biggest thing that I have learned on my journey so far is you want to be self-aware of where your strengths and weaknesses are and surround yourself with people who complement your weaknesses. Nobody knows everything. You're going to make a ton of mistakes. This is something that almost every founder I spoke to when I started, the first thing they said is, Ari, you're going to make a ton of mistakes and that's okay. And the important thing is that you surround yourself with people who can help you.

with the areas that you're new to and you know so that you can turn those weaknesses into your own strengths and just keep on getting better and better and better at what you do.

Prateek Joshi (40:04.983)
Phenomenal. Ari, this has been such a brilliant discussion. I love the clarity of your thoughts and the crispness of the way you present them. So thank you so much for coming on to the show and sharing your insight.

Ari (40:19.21)
Well, thank you so much for having me. It's been a real pleasure.