Scaling Databases in the AI Era: Insights from Andy Pavlo (Carnegie Mellon University) Artwork

What's New In Data

A podcast by Striim (pronounced 'Stream') that covers the latest trends and news in data, cloud computing, data streaming, and analytics.

All Episodes

What's New In Data

Scaling Databases in the AI Era: Insights from Andy Pavlo (Carnegie Mellon University)

March 18, 2025 • Striim • Season 6 • Episode 1

Join us for a deep dive into the world of databases with CMU professor Andy Pavlo. We discuss everything from OLTP vs. OLAP, the challenges of distributed databases, and why cloud-native databases require a fundamentally different approach than legacy systems. We discuss modern Vector Databases, RAG, Embeddings, Text to SQL and industry trends.

You can follow Andy's work on:

What's New In Data is a data thought leadership series hosted by John Kutay who leads data and products at Striim. What's New In Data hosts industry practitioners to discuss latest trends, common patterns for real world data patterns, and analytics success stories.

Speaker 1: 0:01

Hey everybody, welcome back to a new season of what's New in Data. Today we have a very special guest, andy Pavlo. If you're in the world of databases, you probably already know his name. Andy is a professor of computer science and databaseology at Carnegie Mellon University. Andy has built a massive following, with thousands of people looking to him for his insights and his flagship database course at Carnegie Mellon, which is fully open to the public online. What makes Andy's work so exciting is his ability to combine academia, industry and a little bit of Wu-Tang Clan to make learning about databases gratifying, enjoyable and practical. Let's get right into it.

Speaker 2: 0:40

Andy, super excited to have you on the show today. How are you doing? Hey, John, thanks for having me Working on my databases. It's going to be a good time Always a good time. Talking about databases, andy, I introduced you, but feel free to tell the listeners your story as well.

Speaker 3: 0:54

Yeah, so I am, as you said, an associate professor with indefinite tenure of databases in the computer science department at Carnegie Mellon, and this is my, I think, 11th year. I've been there since 2013. And prior to that, I did my PhD at in databases at Brown University with Mike.

Speaker 2: 1:15

Stonebreaker and Stan Zadonik Excellent, yeah, and, like I said, many people follow you. One of the very interesting things you do is this very public course. It's a. It's officially a CMU database course, but you open it up to the public. Intro to Database Systems CMU 15-445-645. It's generous how you've made all the courseware, all the lectures, all the projects. You can submit projects. I went through some of them myself. Why did you decide to make it so generously accessible?

Speaker 3: 1:47

Yeah, how things started was when I started at CMU. I was pretty much the only database professor there. There was another professor, chris Vellutis, but he's in a more data mining, graph mining world. And when you start off as a new professor you're like oh gosh, how about get tenure? Because I'm competing as the mit, because there's mike stonebreaker and sam madden competing at stanford and the berkeley people. So I'm like all right, how do I get people to pay attention what we're doing? And we decided just just put everything on the internet. It's advertising for the carnegie melon davis group and the. It's one thing, advertising for the Carnegie Mellon Database Group and the. It's one thing to put the videos online, but also put all the course materials online.

Speaker 3: 2:31

The thought process there was not everyone can get into CMU and, first of all, not everyone can pay for CMU. Cmu is very expensive. It's not like they're paying me a lot of money, but it's a lot. And so now everyone can go to CMU. So like, why not just put everything on the internet for anybody who wants it? I don't know why more people don't do this. And the great thing about being a private school is I didn't ask permission, I just did it and no one cares. Everyone's happy with it. So I'm glad to help as many people as I can.

Speaker 3: 3:00

I will will say the positive side of this has been one it forces me to be more professional.

Speaker 3: 3:08

I say less crazy things because I know it's going to be on the video. But from a pedagogical standpoint it's been helpful for me as a teaching and lecturing because I'll watch the previous years, like from one year to the next, and I see where students ask questions and I say, okay, they must be not understanding something that I'm explaining, or the demonstration or the diagrams aren't clear. So from year to year I can look and see where students are confused and I make it better. So that's been wonderful. And the great thing also, I would say the personal satisfaction has been you know people that aren't at CMU, like non-CMU students, send me emails and say, hey, because of the course, I was able to get this job right and I got this internship, like it's helped their career and then, without paying CMU money to me, that's fantastic. So I like those emails, I appreciate people sending them, so it's been nothing but positive yeah absolutely, and I want to say it's developed its own cult following.

Speaker 2: 4:09

It has a Discord channel that's super active. Your videos are super engaging. They all start out like some sort of Tarantino movie in a beanie looking like you just ran away from a crime scene, and that really sets the tone going into the lecture. And then you'll go in and start talking about concurrency theory or log structured indexes and things like that.

Speaker 3: 4:33

I will say the phone call for this fall 2024, the phone call that I started with that first lecture from the Carnegie Mellon police. That phone call is real. So people are like, oh, did I fake it? That is a real CMU phone call and it's obviously not because I robbed a bank or something like that. Basically, I started getting these emails over the summer saying hey, your CMU voicemail box is full, please go delete things. I'm like, okay, I didn't know I had a voicemail box because I've never hooked up my phone in my office. I can tell you about why in a second.

Speaker 3: 5:10

But, and so I find where this website is I realized I have 10 years of voicemails I've never even looked at right, and so I'm going through listening to them, see if anything I should keep. And sure enough, there's two phone calls from the cmu police, and the one that they is on the YouTube video is because some woman, not affiliated with the university, not a student, not a faculty, not staff, nothing called the CMU police because they were saying that I had a photo on GitHub of me looking very threatening and they were warning the police to have me take it down. It was something stupid like that. That phone call is because the CMU police is like hey, this lady called and said you have a video or photo of you with a knife from undergrad and we're asking you to take it down. And we said like, yeah, there's been shenanigans, but not all of it is completely made up.

Speaker 2: 5:59

Absolutely. Someone really has to bring this mentality into the database world. I think you've successfully done it and I think that's why the community is so engaged. You have thousands of people following you for this course and the Discord is super active. I went through the course myself. I couldn't get through.

Speaker 2: 6:19

It's quite a bit of dense material. It's definitely reminded me of my college OS class Shout out to the pintos class taught by freck benson over at usfcs, but just nights and weekends in the lab. But the amazing thing is, even though it's very dense, very challenging, you have people in you know discord who are chatting about going through the, the exercises. People are sharing hints and also very generous in the community. For instance, I just shared the test output for the Hyperlog project and someone said, oh, you just have a little off by one error and I said okay, that's it. It was just my indexing was off. But it's very hard to get people to come around something that's so complex and work and collaborate and ultimately solve these problems. My personal recommendation is definitely try to go in with a group, but if you do, the community is awesome. Definitely a shout out to you for cultivating that that awesome.

Speaker 3: 7:14

I would say also that discord channel that is not affiliated with me. That is some person in india. I'm gonna make a discord channel. You send me an email. I'm like, okay, great, go for it. Like I, I don't monitor it, so that's all organic outside of cmu. So give credit to those guys for establishing that community. We I had nothing to do with that, but I'm happy to see if somebody did it yeah, it's, it's incredible.

Speaker 2: 7:36

Definitely, if you feel like the material is really challenging, you just need some help, a lot of. There's a lot of active communication there through every single project. But yeah, that's even cooler to hear that that started on its own speaks to the what's become a very viral course. The other thing I want to ask you is a strong background in wu-tang clan and official prerequisites of the course, or just recommended it's recommended in In previous years not this past year because other troubles, but in previous years we've had a course DJ.

Speaker 3: 8:13

We've had someone like, if you watch the fall 2023 lectures, there's somebody sitting next to me, a full-time DJ, and he mixes and does beats before and after the lectures and that's basically. We did that because there's a bit of a time setting up before the class starts and students are walking in and it's always like this dead awkward silence. I'm like, all right, let's get somebody to play music during this time. So that's the Wu-Tang stuff came along because one I like the Wu-Tang Clan, but also we had a course DJ, so we plan on having one in 2025 in the fall. That's the number one complaint we got this semester is that we didn't have a course DJ.

Speaker 3: 8:52

In previous semesters I've done like Easter eggs or had for the final exam. For example, one year, if the student could list all the members of the Wu-Tang Clan that was on the original 36 Chambers album, including the one member who was actually in jail, who couldn't be on the album, then I gave him a 100% score on the final exam. So they had to list all the eight, the nine members, but no one could do it. One guy got close but he misspelled like Ghostface Killer and Master Killer. He missed the H on.

Speaker 2: 9:26

Oh, automatic Fail.

Speaker 3: 9:27

right yeah I said two A's had the exact names and everything. So yeah, we've done fun things like that before.

Speaker 2: 9:35

Yeah, yeah, it's a really fun class. It's really engaging to go through. Sometimes the material can get dense once you're in the weeds of things, but it'll be funny to look up and see RZA and GZA on the board and think, okay, is that some sort of acronym for indexing steam? Oh wait, no, that's from the bootlegs.

Speaker 3: 9:57

Let me comment on the density. The course that I teach is, as you mentioned, the OS class. The OS classes at universities. Those are expected to be very much dense and down in the weeds of here's what the kernel of the OS is actually doing. And no one really has done that for databases. It's always been. The standard database course at most universities are like here's what the relational model is, here's what the normal forms are, and then maybe that's the first half of the semester of like how to do data is modeling and the second half will be here's what transactions are, here's what the curtsy rules are.

Speaker 3: 10:32

And when I first started at CMU that's what the course I used to teach. It was a different number and I used to co-teach it with the, the other professor I mentioned, and honestly I didn't like it because I just don't think it set up students for the things they actually really needed to know if they were actually going to go in industry and actually work on the internals of database systems. Nobody does numeral forms. You don't need to know these things Armstrong's axioms. So I threw all of that out and that used to be for the time also when you used to search easy computer science courses at Carnegie Mellon, this old database course used to come up as number one, so I was like I can't do it. I got to get to that, so we threw it all away and I basically decided here's the course that I would want to have taken if I was an undergrad.

Speaker 3: 11:17

Of all the things I think are important, and that's how it evolved. And, yes, I understand that it can be a bit down the weeds. There's so much more material I could cover. There's not enough time in 14, 15 weeks. But I think that and the expectation is not that everyone's going to retain everything that the course covers, but the idea is that when you go off in the real world, either if you're working on database systems or even just using a database system which pretty much everybody has to these days in an application domain that you want to understand what the system is actually trying to do for you when you give it data, or you want to run to understand what its behavior is. So that's, yes, it's dense, but it's not meant to be like. Here's everything you need to remember if you go out in the world, but it's enough to give you the background and say, okay, here's how things actually work.

Speaker 2: 12:08

Yeah, absolutely, it's a. It's an incredible class and people who take courses like operating systems, where you build internal of a database, or getting into computer architecture, people always say those are the most gratifying courses for computer science students in a lot of cases. So computer science students in a lot of cases. So it's really cool to see the database lens of that where, okay, the database is also this client to the os and kind of restricted by the os in a lot of senses. So we'll get into that as well. But a super fun course.

Speaker 2: 12:37

I took my first sabbatic, got through the first project and half the buffer project and thought, okay, I need to block out more time for that. I would probably have to take a sabbatical to really do the course justice. But I I totally recommend it for a lot of folks, especially with a software engineering background, who want to get more into database engineering. My, my brother's a cs student. I said just go look at this class, just go snoop on this class, since they don't offer the equivalent at your school, and he was very interested in it as well. So I do want to get into some of the materials in the lectures. Obviously, vector databases are heavily associated with AI in industry. It's one of the use cases for taking LLMs and tying it to the database. In your course, you categorize vector database with other forms of indexing, so I want to hear your thought process behind that.

Speaker 3: 13:31

Yeah. So I guess we first explain what a vector database is. So a vector database, at least as it's been defined in the last couple of years, is a database system that supports taking vectors or embeddings. So think of a big array of a bunch of floating point numbers that are generated by a transformer. So you take your text document, you run through this transformer and then it converts it into this vector of floating point numbers or decimals, and then you build an index on that vector so that you can do quick, approximate nearest neighbor searches. So the idea is that you take all your documents or all your data, you generate all the vectors for them, you load it up into this index and then now when a query comes along and says something to the vector, find me all the records that are similar to this other record and that similarity can be based on the semantic meaning or a bunch of other factors that your transformer could take into consideration. And then you do a nearest neighbor search and you find the vectors that are closest to the vector you're trying to search for. And so when you break it down to the components of what intakes actually do that approximately near-scenario search.

Speaker 3: 14:47

The first thing is that you store data that you're going to then build vectors on and in all of the major vector databases. This is typically JSON. You take your JSON, identify which fields you want to then turn into vectors. So that writes out that says, okay, you need something. Then turn into vectors, right. So that writes out that says, okay, well, that's you need something to store some kind of structured data as JSON, or it could be relational data, it doesn't matter. And then the proxy nearest neighbor search that's occurring on this other data structure that it can do the nearest neighbor search. So, again, if you just take out the AI, the veneer on top of all of this, so the undertone of it say, okay, let's just a data store in an index, right, could be a relational database plus a B plus three index or a hash index, right. And so the the practice of nearest neighbor search, that vector index. It's exactly as I'm saying it is an index. So there's nothing radical or completely different about the idea that you want to do this nearest neighbor search on vector data. That has to be a completely brand new database system architecture.

Speaker 3: 15:53

And when you go look at when vector databases became hot in like 2022, when ChatGPT became sort of of a household name. Within one year all the major relational database vendors had all added vector indexes right, they didn't have to rewrite everything. They just added this new auxiliary index just like if it was a new P3 or a hash table. So it is basically just. It is just that index. Now there is some of the changes you have to make in your in the query optimizer. Do you search the index first or do you do it after when you want to do additional filtering there's maybe you change the execution engine about how you actually invoke that index. You're not getting exact results that may actually make it filter later on. But at a high level the query processing approach on a vector index is about the same as it would be on a relational database or JSON database. So in my opinion it is just an index.

Speaker 2: 16:55

Yeah, very cool to hear that perspective and it definitely makes sense, especially when you if you look at all the parts that were adjacent to vector databases in your lecture, we talked about Lucene, for example. Lucene isn't necessarily new either, right, it's over a decade old. It's been around for a while and you've seen approximate nearest neighbor search as a way to store and retrieve data for a while now. For a while now. But yes, it seems like now it's viewed as one of the use cases for tying AI and databases together in the storage layer. We'll get more into that, especially when we get into your paper. What goes around comes around and around, but first I just want to get dive into the technical side of it.

Speaker 3: 17:39

Another. Yeah, I would say one thing also that the vector databases are going to do a better job at doing that approximate nearest neighbor than, I think, the relational database systems. So one of the things they do better is integration with all the various AI tools or LLM tools and that sort of ecosystem like LanChain and those kind of things. They're going to do a better job of interviewing those things. I know that Oracle has put out some stuff that can connect to OpenAI other stuff more easily, but I think they'll do a better job at that. But again, the core architecture is essentially the same.

Speaker 2: 18:16

Yeah, yeah, it's always interesting to see the architectures that these companies have. There might not be anything groundbreaking or new in some of it, but the way they've built the database for the go-to-market around ai. Okay, who are the? How do we integrate it? How do we make it easy for developers to operate? What the people who are actually doing ai in a company, how do they work and how do we make this really adoptable by them? I think those are the kind of ambiguous things that companies that are super focused on delivering something might prioritize over, let's say, an oracle who you know by oracle I. I don't want to downplay them, because you know, of course there's a lot of impressive things that they've done and they're very focused on this area, but I would not say it's like a repeat of, let's say, nosql 10 years ago, where MongoDB was able to come into the market and Cassandra and others unopposed by the traditional vendors. A lot of interesting things will happen there, for sure.

Speaker 3: 19:19

I think the JSON story with Mongo is interesting, because Mongo became a popular system in 2009, 2010, and maybe even earlier, 2008. But it still took a few years before the relational incumbents added JSON support. I think Postgres added it in 2012. The SQL standard added it in 2016, at least the initial support for Oracle, maybe like 2013, 2014. So it took about three, four years before the relational database says, oh, this is a thing, let's add it to SQL, let's add it to our database system. Which is funny because they already had support for XML and going from JSON to XML, it's not that radically different, so they could have repurposed some of that code to use it, but it took a while for it to happen.

Speaker 3: 20:12

Whereas if you look at the vector search stuff and again, like I said, within a year of ChatGPT becoming like this is a thing you want to do and you want to do these RAGs or do these vector lookups in these specialized databases it took a year for them all to add that. And that either tells me that the effort it took to add a vector index in a proxy nearest neighbor search to an existing system like that was either easy to do, because again, it's just another index you plug in or and it's probably a combination of the two of these or people perceive, as this is a very important use case that everyone needs to have. It's the future. We don't want to be left behind. Let's go put all our resources at it right away. Within a year of ChatsBD blowing up, all the major databases had vector support, whereas with JSON it took multiple years.

Speaker 2: 21:02

Yeah, absolutely. That's a great observation too, and just to see the signals going out there. Okay, larry Ellison probably wasn't super excited about JSON, but he is going to dinner with Elon Musk and Jensen Wang and talking about AI and trying to build his own nuclear data center, so I don't think he's going to ignore this or let new vendors come into their territory here. It just has everyone's attention and of course, the big hyperscalers too are also throwing a lot of IP at this. I could have seen why Mongo and some of the NoSQL databases and some of the more developer-oriented databases might have been able to take a lead in the early 2010s. Maybe the large, established vendors said that's just not our market, it's not super interesting to us.

Speaker 2: 21:49

That doesn't seem like the case with AI. It seems like with AI, everyone's going full throttle into it. It'll be interesting to see what happens there. We don't even have an established repeatable architecture yet for real world AI points. There's some examples, but I think a lot of the work is still pending to see what becomes resilient. I also want to ask you about some of the other parts of your lectures. Without getting into too many specifics, I just want to high-level compare columnar versus row-based and some of the trade-offs there. Let's say, could you walk us through some of the deeper technical reasons why an OLAP system might prioritize columnar storage and compression, whereas an OLTP system, a transactional engine, might focus on row-oriented layouts?

Speaker 3: 22:39

Yeah, at this point this is established conventional wisdom. So I don't think I'm breaking new ground here. But the if you think of the original databases from the 1970s, 1980s, original relational databases they would store everything in rows, meaning all the values or attributes for a single record or tuple would be adjacent to each other. In his account information you'd have my first name, last name, mailing address, like all. That would be contiguously laid out one after another and then the next record wouldn't start until the current record finished. We think the data would actually be organized in memory on disk and for all two workloads or operational workloads. This makes sense because the queries are typically are looking for single entities in the database Go get Andy's record, go get John's record. And so when the query wants to go get that data, you want to land at some location and just read things continuously and bring it back to the application. So that world that's storing things in the row store makes sense.

Speaker 3: 23:43

And for analytics you're not looking at single records, you're looking at the aggregation of all the data you've collected. You're extrapolating new knowledge from the collection of data you've accumulated. So in that world you're looking across multiple records and typically you're looking at a subset of the columns. Find me all the records within this zip code. So I don't care about your name, your last name and so forth, right, I don't care about your birthday. Those columns are all unnecessary for the query. I'm looking at just one single column. So people basically figured out in the 2000s although the idea goes back to the 1970s that if you actually store the data within a single column continuously for every single tuple, break out all their composite attributes and store them just one way or another, all the values within a column one way or another, then that is better for these queries because now you can go jump to one location on disk or in memory and just read continuous data just for that one column and you're not polluting memory. You're not doing much IO to go fetch data that you don't actually need for the query. If I don't need the first name information about all your customers, why bring that into memory or go, why fetch that from S3 or from disk for the query? So that's the sort of the major distinction between the two. The workload sort of dictates how you actually want to store the data, but it turns out for the columnar organization.

Speaker 3: 25:14

There's a bunch of other advantages you can exploit because now all the data within a single column are contiguous, you can actually start using compression and other encoding techniques to reduce the size of the data on disk and then, when you actually bring it into memory, other optimizations you can do to keep things in low-level CPU caches and allow you to rip through the data and process it more efficiently.

Speaker 3: 25:40

An obvious thing would be, if I have a column with someone's sex and and keep it simple to say it's male and female I understand it's more complicated than that Just say it's two values male, female. So instead of just storing male female, male female over and over again, if I sort them based on the value, I can have all the males first, followed by all the females, and now I just need to record things like it's males and I have a million males and then followed by I have a million females, and now I'm taking what would have been 2 million records and storing down to just two records themselves to say here's how many repeated values I have. So that's interesting about run length encoding. There's other optimizations you can do, but repeated values I have. So that's interesting about run length encoding. There's other optimizations you can do. But if things are stored contiguously within the column. There's a whole bunch of other ways to make the data system run faster, and that's why the modern columnar systems just crush anything that was done in the 1990s or earlier.

Speaker 2: 26:39

Yeah, and Do you see opportunities for convergence here? There's a lot written about kind of HTAP, hybrid transactional analytical workloads, or the Orville Stonebraker's quote one size does not fit all still reign true for the foreseeable future.

Speaker 3: 26:57

So hybrid transactional analytical processing. So this is a Gartner term. It's an amalgamation of OOTP, or online transaction processing, which I think dates to the 80s, and then OLAP, online analytical processing that dates to the 90s. That was invented by Jim Gray, the Torrenwood winner in databases. So HTAP is the idea that I can have a single system support both categories of workloads. On paper, this clearly makes sense, right, like why provision one database for your transactional workloads and then another database for your data warehouse or your analytics? It'd be great if I could have a single system just do everything. In practice, though, the challenge is that oftentimes there's the stakeholders at an organization are typically separate, meaning the people that want to run the transactional system are not the same people that want to run analytics, and if you come along now with a system that kind of does a half-assed job on both of them, the stakeholders for transactional stuff is why would I want this sort of handicap system that can do both? I don't like about transactions Give me the best transactional database you have, and likewise the data warehouse people are like I don't care about transactions, give me the best, give me the best system that you have. And so oftentimes selling an HDA system can be challenging, because the people that are writing checks for the software often have different needs or different desires. I will say, though, what makes sense is that if you can have some minor support for, I think, analytics on the transactional side of things, or I think, analytics on the transactional side of things, so I don't think it makes sense to have or it would be very difficult to sell a full-fledged transactional database system bolted on top of your data warehouse. Snowflakes has their hybrid tables, but that's certainly not meant to replace a full-fledged system like an Oracle RAC or, say, like an Aurora, but it can do some transactional stuff, which makes sense for certain workloads, but you wouldn't want to run your like full-fledged front-end application on top of that, and so, on the flip side, if you add some support for analytics on the operational database system, I think that makes sense, because now you can do some minor analysis on the data as it comes in, rather than having to wait for it to stream out to your data warehouse, get processed and then do analytics and then send the results back. But certainly we wouldn't want to run heavyweight analytical jobs on the transaction system all the time, because that's going to slow things down because that's going to slow things down.

Speaker 3: 29:44

Now there are systems that do the hybrid approach, which I think is the right way to do, is that Oracle does. It's called Fractured Mirrors, where you basically have the row store. All new data comes in, gets stored in the row store side of things, and then they make a copy of the data in the column store format and so now when a query shows up, you can figure out okay, is the data I need? Does that need to touch the analytical side? Therefore, I run on the column store or do I need to run on the row store side? I think that's probably the right way to go, but trying to build a full-fledged column store that can do transactions at the same speed that a row store can, I think that's challenging.

Speaker 2: 30:23

And that's one of the interesting things you're bringing up is that they're both better at different things. Right, the column engine is generally better for these large scale analytics workloads People are trying to get an approximate sense of some metrics in their business whereas these transactional workloads you're thinking okay, this is your ATM system, your travel reservation system, your health records, all these things are very operational. What's an example of some of the actual technical trade-offs going on there?

Speaker 3: 30:56

Of like why you don't want to run analytics on top of the road store or Just in terms of, you know, a transactional system, like what is a transactional system?

Speaker 2: 31:06

prioritizing that an analytical system can't do, for instance, very well.

Speaker 3: 31:11

Oh, so a transactional system is it wants to run transactions, and so in that world it's all about minimizing latency. Use the P99. So trying to be able to execute queries and return responses to the application as quickly as possible. And if you start running with transactions either through the Cucurcio protocol, whereas the database system is trying to make it look like you're running by itself, when you're actually trying to interleave multiple queries and transactions at the same time, in that world you want things to be done as quickly as possible because they're holding locks or they are doing much of the bookkeeping on the inside. That could slow down other transactions. So if your transaction runs slower, there's this butterfly effect where that's going to cause other transactions behind you to get slow.

Speaker 3: 32:01

And so I mentioned the fractured mirrors or the split row store. Column store approach is like you could run sort of way to do isolation and say for all my data, for all my transactions, I'm with just as row data, I can run those fast as possible. And maybe the slower analytics that are run on the side here on the column store data. And if I build my system certain way like they don't have to communicate about who's reading what data what time I can just let the row store run as fast as possible. So the priorities for what you would care about from the application perspective in the transactional system are just different than the column store system, and that causes you to make different system design choices.

Speaker 2: 32:42

Yeah, absolutely, and I think there's always in the markets. You always see this attempt to unify everything. Okay, we're going to unify transactions and analytics, we're going to unify batch and streaming. But I always get the consensus that people have done this for a long time will always say there's a lot of trade offs going on there. It's hard to actually do both of those things. Well, this again comes back to that quote from Michael Stonebraker, which is one size does not fit all, and yeah, it's one of those interesting things that you always want to see. It sounds, when you look at it just at a high level, intuitive level, yeah, why wouldn't someone want to just have one storage engine that does everything for me? I think from a layman's perspective, you'd think that would just be like a killer product for the market.

Speaker 3: 33:28

But when you work on this stuff that there's a lot of technical trade-offs going on when she makes a plan as well, but the reason why I don't think the HF stuff has really taken off is it's what I was saying in the beginning it's not oftentimes technical right. So if you think about how, say, you're building a new application from the very beginning, whether you're a startup or inside of a company what happens? Well, in the very beginning you don't have any data right, and so you're building application to then interact with the outside world or some other outside thing and to collect new data. You'll make a web app, you'll make a new application that can ingest data from the outside world. At the very beginning you don't have any data. You need to get data, and so you're building essentially a transactional database application because you want to get. You're ingesting things from the outside world. You're updating things, updating state. At the very beginning you have nothing. So you start with a transactional database system like a Postgres or whatever. So now you start ingesting data.

Speaker 3: 34:27

Then typically the path for growth for these applications is not the next day you have one user, the next day you have a billion users. That is rare it happens, but that's usually not the case. So it's a gradual increase of usage and new features get added and new data comes in. So the's a gradual increase of usage and new features get added and new data comes in. So the database is gradually growing over time. And then at some point you say, okay, now we do have data, let's run analytics on this. And so at that point you either try to run analytics on the data, the data that you have, and maybe it does okay, but not great. But then you say, okay, I don't want to touch this operational database, this transactional database, right now, let me start offloading the data to a data warehouse, a snowflake or whatever, to start running analytics on that. So then over time now these two things are growing up separately and then the H-tap market is basically trying to say, okay, these things grew up together but separate as the siblings. Let's now go sell you something that's going to go sit in the middle and can do both.

Speaker 3: 35:31

And it's very hard to sell a database system to replace a working and functional transactional database system, because the risk is so high. So you're saying you're running Postgres, you're running Oracle, rack or whatever on the transaction database. Go get rid of that and replace it with my new database. But if that new database fails, then you're screwed, because now you can't take in new data, you can't run orders, you can't take payments, you can't get new data, whereas on the data warehouse side you can still keep up the old data warehouse. You just make copies of the data into whatever the new product you have and if it fails, no big deal, because you're still ingesting data in your transaction database and you can always roll back to the other data warehouse.

Speaker 3: 36:14

So the barrier of entry to replace a data warehouse is much lower than it is to replace a transaction database. And so now if you're trying to say I have an H-tap database, you're still going to face that same barrier where people don't want to break, don't want to fix what's not broken. So it's very hard to get people to replace a transaction database once it's grown up and around for years. There's a reason why IBM still makes a ton of money on IMS. They built that database for the moon mission in 1960s because people don't want to break stuff that if it's running just fine, or don't want to change it instead of changing it yeah, yeah, and it's that's.

Speaker 2: 36:50

That's such a great point. Like, practically, it'd be very hard to take that to market because, okay, you have your ctos and teams who've already built applications running on top of the transactional database that could be in place for, let's say, a year. It could be 40 or 50 years. And then you have this new CIO, this chief data analytics officer whatever you want to call it who wants to come in and say, hey, I want to unify analytics across all my systems. The transactional system is just one of my sources. And then I also have my marketing data, I have my third-party data. Just get that all into the analytics warehouse. And then I'm going to have who knows into the analytics warehouse and I'm going to have who knows, hundreds to thousands of data scientists and analysts running queries on that thing. Right, and then, from that perspective, you, you can see why it would be hard to say, okay, hey, transactional database owner, let's onboard all these analytical users for you, and then they would. That would be its own kind of logistical nightmare and technical nightmare, like how would you even? How would you do that without impacting the production, operational applications? It's definitely very cool when you see some of the new technology coming in.

Speaker 2: 37:54

Let's say, for instance, duck db. I I know that's in your in some of the exercises in your course as well. I I did like that part of the sql exerciser where you run some queries in SQLite and then put a timer and run the same queries in DuckDB and it's just automatically faster. And for me personally, I use both DuckDB and a data warehouse and the product I work on does change data capture and data streaming. We'll always stream data out from our transactional database into our warehouse with stream, but there are some kind of institute and analytical workloads that we want to put on top of the operational database. Duckdb has been awesome for that because in that same application that's managing the database, I can just instantiate DuckDB and it'll do these super fast analytical queries and that just solves our problem. It's not super scalable, but it's fast and it works.

Speaker 3: 38:48

Yes, that'd be fantastic. I think what Mark and Hannes have built is phenomenal. I'm jealous Not in I hate them way, but they've got a database system. I like database systems and then they got people to actually start using it and I think they've've hit. They had the idea this thing needs to be super portable and have a low barrier entry from the very beginning, and I think they crushed it yeah, and the out-of-the-box performance is always awesome.

Speaker 2: 39:16

Engineers love publishing mini benchmarks of their internal workloads and it's always a big win where I can go show my manager hey look, we sped things up by 510x by throwing this technology in there. Of course, my management is all very incredibly smart. People had a ton of questions and I said, okay, we're not solving that problem, so we just solved this problem and increase the performance there. There's always some excitement around performance gains, so we talked a bit. There's always some excitement around performance gains, so we talked a bit. I mentioned Michael Stormbreaker's name. You did as well. I said in the context, one side does not fit all, but you recently co-published a paper with him called what Goes Around Comes Around and Around. Yes, I want to hear the story behind that paper.

Speaker 3: 40:04

Yeah, I want to hear the story behind that paper. Yeah, we should preface by saying Mike wrote another paper with Joe Hellerstein at Berkeley in 2006, called what Goes Around Comes Around, and so our paper in 2024 is the 20-year follow-up to this. So the, I guess, for the background, people don't know. So Mike Stonebraker won the Turing One databases in 2014. He's the inventor of Postgres. Prior to that it was Ingress, but he's been involved in the data scheme for a long time. He's brilliant.

Speaker 3: 40:34

So he wrote a paper in 2006, and basically it was a historical summary of all the data models, going back to the hierarchical and network data model which predates the relational data model in the 1960s, and basically shows that there's all the attempts to try to build something better than the relational database, the relational model, and here's why they're not as good. Here's what doesn't pan out and here's why the relational model or, as he puts it, the object relational model, which is what technically Postgres' data model is. It just means it's a relational data model. That's extensible why this one is superior. And so in I think it was during the pandemic it was like 2020, I saw a Hacker News post where someone was like hey, I don't know why people are using relational databases. We should just use graph databases for everything. And I was like, oh man, through my proselytization of the original data model, I feel like it's basically Mike's thesis of what goes around comes around, where people are just like reinventing the wheel over and over again. So this person saying why are we using relational databases, why aren't we using the graph databases they're just unaware of? Oh, people tried that with this thing called CODISO, the never data model, in the 1970s and it didn't work for reasons X, y, z. So I just felt we want to put out another paper as a follow-up and say here's what all the things people tried in the past if you don't know your history and here's why the original data model is going to be the best or the better.

Speaker 3: 42:01

And so we wrote the follow-up paper that came out this year and it's basically an updated analysis of data models that have come out since the original 2006 paper. We do discuss key value stores, which predates that. Key value stores are probably late 80s, early 90s. But we cover array data models, metric data models, vector data model, the document data model, graph data model and then sort of map reducers that doesn't have a data model or text searchers that don't have a data model. We'll discuss those as well.

Speaker 3: 42:39

So that's the first half of the paper, and the second half of the paper, which we haven't talked about too much, is the developments or advancements in database systems since the early 2000s, and the main takeaway from that part of our analysis is that most of the improvements and advancements have been in the context of relational database systems like cloud architectures, hardware, acceleration, column stories we've talked about before. There was this sort of group of systems called NewSQL where they kind of do high-performance distributed transactions. So we basically talk about here's all the things people will try, here's what worked, here's what didn't work.

Speaker 2: 43:20

Yeah, and it's really important that you put out papers like this because, just to, I don't want to get too much into this new database. Companies are really good at marketing. In a lot of cases they do have a lot of value, but sometimes they'll make claims that are a little hard to defend when you look at it objectively. So I don't sell to.

Speaker 3: 43:44

It's quite often right. I frat to shit on blockchain databases. Like blockchain. That's another one I'd be like why would you ever want to store your things in a relational database? You just use a blockchain. No, it's stupid, it's slow, it's wrong. Here's why. So, yeah, it's basically how to say having to do like a Google News Alert or something to find out every time. People are saying like, hey, relational databases are stupid, sql is stupid. Here's a better one Instead of having to have me go reply to them automatically. Just write this paper and people could point at it and then 20 years, people forget we wrote it and have to write the next one.

Speaker 2: 44:17

Yeah, amazing, and I'm sure it won't be hard, because I think 98% of what you wrote in this year's paper will probably be relevant in 10, 15 years as well and probably just need to be restated and republished. I did want to get some of your perspectives on data lakes and data lake houses. You did mention that they've emerged. A challenge the monolithic data warehouse. There's, of course, all these conversations going around Iceberg and Delta and, of course, Databricks bought Tabular, which is the For a lot of money.

Speaker 2: 44:53

Yeah, for a lot of money. And now AWS created S3 tables which will manage Iceberg tables. From their perspective, AWS is probably seeing all these people managing tables on top of their objects object layer on S3. They're probably saying, hey, why don't we just do that? I want to get your perspectives there. Do you think it's a resilient technology that a lot of companies can go ahead and adopt, or do you think that there's some challenges ahead there?

Speaker 3: 45:18

Yeah, I guess we first want to define what a lake house is, or data lake, data lake. So the way people ran data warehouses before the data lakes was that you would provision a very expensive machine or a large number of machines and that would be this model architecture where you would put all your data in for the organization, and so it was. The data system was basically the gatekeeper. So if you want to sort data in a in your data warehouse, they do analytics on. You had to define a schema, you had to the hardware or you get permissions to put it in, and then you would insert it into the database. And the database is what we do, what we call managed storage, where they were responsible for ingesting the data, organizing the actual physical bits you would then store on the disk right and maintaining that information. So with the data lake architecture, the idea is that instead of having everyone to go through a central control of putting data into this data warehouse, you could allow anybody just to write a bunch of files to S3 or whatever your object store is, and then the idea would then be if someone wants to then do analytics on this data, they wouldn't have to go through again to the data warehouse. They could just grab the data off of S3 or whatever your object store is and process it there. So these files worst case scenario, it's just much a JSON or CSV, like text formatted data. But things like Parquet and Ork these are now binary encoded columnar formats that be as if it was being organized by the data system. But there's libraries that generate these files yourselves and just write those things out. So that's the idea of a data lake is, instead of having everyone go through a monolithic data store, you just put things out in S3 and then you would use a catalog service like Databricks has Unity.

Speaker 3: 47:10

The idea of a data lake is, instead of having everyone go through a monolith data store, you just put things out in S3. And then you would use a catalog service like Databricks has Unity. There's H Catalog, there's now Snowflake, polaris. It'd be a way to go find the data lake that provides a SQL interface to go run queries as if it was in managed storage from a data warehouse. But actually it's just on S3. Dremio Databricks are examples of this. So that's the background there.

Speaker 3: 47:44

Where Iceberg comes in, there's also Hudicom. I think there's another one, I'm forgetting Delta Lake from Databricks, what those are. Instead of having people just, in their application code, generate the Parquet file and then shove it into S3, now you have a sort of interface where you can do insert, update, deletes on essentially what looks like relational tables on files, as if they look like tables to you but underneath the covers it's just JSON and Parquet, and then Iceberg or these different middleware. They're responsible for collecting the information, running transactions for you if you want to update things and then compacting and coalescing them into Parquet files that you then write out into S3. And then again they would also provide the catalog server so you know what files correspond to what tables and so forth.

Speaker 3: 48:37

As you mentioned, iceberg came out of. Iceberg is the standard now. Iceberg came out of Netflix, hootie came out of I think, uber. H Catalog came out of the Hi Project, which came out of Facebook. So there's Iceberg, if people have coalesced on that as being what the standard is. And, as you mentioned, amazon now imports it natively, although they had support for native Parquet select as well, but although they haven't deprecated it. But you can't get it now if you're a new customer but, like existing customers, can still do basically predicate, pushdown or filtering on parquet files directly inside of S3. So Iceberg looks like it's going to become the standard. The backstory is that Snowflake has slowly been, or has incrementally added support for Iceberg, I think since 2002 or 2001? Sorry, 2021. And then they were in talks to buying Tabular, I think for $600 million, either early this year or late last year, and then Databricks came in and just threw $2 billion at them in their face and stole them On the day of the Snowflake Summit CEO's keynote session, where he was announcing their Iceberg support, polaris, yeah.

Speaker 3: 49:55

Polaris catalog and then the next week at the Databricks, I guess, summit, they announced that they were open sourcing the Unity catalog. Yeah, but I say Polaris, I think it's rewritten in. I think it's in Rust or C++. It's not written in Go or Java and it has become an Apache project. So it's not written in Go or Java and it has become an Apache project. So it's not just Snowflake building it now. Dremio's involved, a bunch of other companies are involved. So yeah, basically Iceberg is now the standard Audio Decovers. It's basically again, as I said, json files and RK, but I think the interface to the catalog and the transactional updates, that's people have sort of coalesced around that, I think.

Speaker 3: 50:38

Do. I think this is like long-term how people are going to build things, absolutely yes, I think the idea that you're going to have a monologue data warehouse where some administrator is going to have complete control over who gets put things in and out. You're still going to always have that kind of governance. But I think the disaggregated architecture of something like Iceberg is the way to go. It just makes sense, because why spend so much engineering resources to make a storage layer? If you're trying to build a system, just rely on S3. It's infinite storage, infinite in quotes. Obviously Amazon, there's a finite limit, but you'll run out of money before you can fill up S3 on Amazon or GCS, whatever it is. I think this is the right way to go. This is basically how you want to build a modern OLAP system. I think it's sustaining. I don't see it changing anytime soon.

Speaker 2: 51:31

Do you trust that Iceberg is the best table format this time?

Speaker 3: 51:37

When you say Iceberg is the best table format, it's Parquet. Underneath the covers, do I think Parquet is the best format? No, parquet was designed in 2013, 2012 and in a totally different world, where disk was always considered the slowest thing and therefore you had a heavyweight compression to minimize the amount of disk value you have to do. The hardware has changed enough that network and disk are actually pretty fast and the CPU has actually become the bottleneck. And so we have a paper that came out in VODB last year that did analysis on Parquet and Org and basically showed here's a bunch of assumptions that they made back in the day that don't make sense now and we've been pushing for a sort of a new format based on this, which I could talk about. Microsoft put out a paper around the same time as ours, basically found corroborated results, found the same thing Parquet and Orgg because they're old. So we have a line of research that is trying to build a new file format to replace Parquet and Org and we're not so much interested in here's the encoding for the different columns and, like I mentioned before, doing this run-like encoding for column-straight data, like there's implementations of that. I don't like that's going to evolve over time doing this run-like encoding for column-store data. There's implementations of that that's going to evolve over time.

Speaker 3: 52:55

I'm more interested in the scaffolding around a file, so how you could define what the file should look like and how you store the metadata to say here's how the data is actually being encoded. So the reason why this matters is, if you want to add a new encoding scheme to Parquet, right now you can't do it. You'd have to go modify the original Parquet code. Then also, this makes your data not portable, because if my application is not using my modified version of the Parquet reader, which there is a bunch of different implementations, different languages then I can't read this new encoded data that I have. And so we're interested in building a new file spec that allows for the portability and extensibility so that we don't have to reinvent the wheel every 10 years with a better version of Parquet. We can just build one file spec similar to POSIX, one spec that has evolved over years but it's standardized. That's what I'm interested in doing is building that specification, not like here's whenever we want it, for this exact hardware that we have right now.

Speaker 2: 53:52

Yeah, that's exciting for sure, and it will become every, I think, every four or five years. The architectures shift a little bit. People standardize on when approach. I think the vendors hyper-optimize, selling around that approach and then things magically become too expensive again and then people find a way to make architectures cheaper and more distributed and get more components in the place they compose. So I think, yes, coming up with a new, a file format that's more efficient and accessible, I think that's also a very exciting area and it will be cool to see how that gets plugged into some of these. Yeah.

Speaker 3: 54:36

Can I give you a preview of what makes this, how we're going to handle it? Future-proofing this thing, please do.

Speaker 3: 54:41

I would say also too, like we have our format. We were actually in discussions with the Velox guys. They put out a file format called Nimble. We had discussions with NVIDIA about some stuff. We had a larger collaboration in the works but for lawyer reasons it all fell apart. But Facebook has, or Meta has, their Nimble file format. There's LanceDB has their format. There's another system out of New York City called SpiralDB. They have Vortex, the DuckDB, the CWI team has their own file format, so there's a bunch of sort of fun ones going around.

Speaker 3: 55:13

As I'm saying, I don't want to build, I don't want to have a competing format. It's more about the again, the scaffolding. And so the way we can handle extensibility and portability is that you actually embed a WASM binary to decode the data in the file inside the file itself. So now, 10 years later, if I have some file that was created 10 years ago that I don't have the code to actually go process it innately, I can process it using the WASM that's embedded inside of it, and so this allows you to future-proof the architecture. I fully meant we didn't invent this idea. This actually came. Well, the change came from Wes McKinney, who has been working with us. The men are pandas and Apache Arrow, but he actually got the idea from Mark and Hannes at DuckTB. So this is our new file stack we hope to put out this year will be incorporate this WASM piece.

Speaker 2: 56:04

Okay, excellent, definitely looking forward to that Again. A lot of excellent ideas floating around right now in the data industry. It's really fun to follow along with. Let's say you are a software engineer working in data or even all the way up to a CIO level. What's your advice to stay educated and make people's technical skills resilient in this industry? And emphasis on resilient meaning, yeah, you could learn one technology, but that technology could become out of date in 10 years. At this rate, everything's gonna be out of date in six months, but I'd love to hear your advice on staying resilient in this market so this gets into.

Speaker 3: 56:45

What does it mean in the modern era to have a, let's, a degree in computer science? Right, because I'm in that business of selling education and without touting too much about Carnegie Mellon, the other top schools, I think, make a similar attempt at this. Right, it isn't about learning. Here's what the hot thing on Hacker News is learn how to use it, and so forth. Right, it's about, first, principles and the fundamentals of the concept needed in data processing, data analytics, data analysis in general, and so, as long as you understand the fundamentals of what does it mean to what is the transaction actually trying to do? What does it mean to have protection or the isolation and control and durability? If you understand those fundamentals, then no matter how the hardware evolves or the workload evolves or the use cases evolve, that you can always map whatever those new things are to your background, to the fundamentals. So I would say for developers, I think understanding the fundamentals is key For the higher on the higher level, like for, like a cio, how do you make sense of whatever the buzzwords are, the challenge there? Again, it isn't always for technical reasons why you make a choice to go with one vendor versus another. It oftentimes it has to be do. I already have my credit card with this company and therefore they can sell me another product, and this is easier not have to go through procurement to be able to get access it. I would say that, at the manager level, I think being skeptical about claims that people make about their products is it's always better to be more skeptical than less skeptical, right, as you said, when people come along and say, hey, here's my brand new database system or technology that can change the world and do everything you need, you really understand what your use case is and understand what this company is bringing to the table and then understanding OK, here's what the actual architecture of what they're actually doing, how it's actually implemented, to understand is my use case even going to make sense or not make sense or not?

Speaker 3: 58:49

The example I always like to use was in my classes is this episode at Uber, I think in 2016, where they were running on Postgres. Actually, they're running on MySQL. Someone said, hey, let's switch to Postgres. So they switched to Postgres and because of their applications workload patterns, that was actually a terrible choice. Postgres is the way they do multi-versioning and so forth. That was absolutely the worst thing you could do for Uber.

Speaker 3: 59:17

So then they had to go switch back from Postgres to MySQL so it had someone understood. What does our application actually try to do, what do the queries want to do, what does the data actually look like, what is the access pattern? And then understanding the fundamentals of how Postgres does mult multiversing versus MySQL does multiversing Like you could then map and say does this make sense or not? So I think it makes sense to have understand what the internals are for certain things, and then, as new things come along, it's very unlikely that someone's going to invent something crazy, brand new that no one's ever thought about before, and so it's getting past all the marketing BS to figure out, okay, what are they actually doing?

Speaker 2: 59:55

And then understanding how that maps to your needs Absolutely, and I think that's great advice both for the data engineer, the software engineer and the CIOs ultimately understand the first principles and then be a little skeptical of vendor claims, because there's really even when things seem very new or companies raise a ton of money or something like that, it's rarely completely novel something that no one's thought of before. Maybe they found a better way to commercialize it than others. But yeah, there's always a bit of work to do to actually test this stuff.

Speaker 3: 1:00:29

So the other thing I did can I give one example without naming names? Yeah, sure, I got an email from a ceo and a davis company that you heard of, but I can't see who they are. They sent me an email and said hey look, I watched your lectures on distributed architectures. There's like shared disk, shared nothing. We think we have a new architecture that doesn't fit in any of these. We think it's brand new. We want to tell you all about it. So I got on the call, talked to him and, sure enough, like it was just shared disks, separate compute versus storage, and so again, without understanding the history and the background and the fundamentals of these things, people would make claims that, like again, that wouldn't pass the sniff test for anybody. Again, who does know these things?

Speaker 2: 1:01:08

yeah, absolutely, and my especially. As a building change. Data capture and data integration is constantly straddling. All these different new age technologies and replicating the established vendor storage engines have solved the scale issue for sure for companies that actually need scale, and I think that there's a lot of great technology coming in to make this stuff more practical for enterprises to implement. Now the other side of this is, of course, there's all this innovation happening in AI and machine learning, and you spoke on this in your paper as well. Course, there's all this innovation happening in ai and machine learning and you spoke on this in your paper as well. Which architectures and capabilities would you say are absolutely critical to adopt for a cio or cto to build into their strategy over the next three to five years to actually realize the potential of ai in?

Speaker 3: 1:02:07

terms of like databasetype architectures or machine-led architectures, or both.

Speaker 2: 1:02:13

Yeah, that's a good question too.

Speaker 3: 1:02:18

The answer is it doesn't matter, right, because here's what matters If your data is dirty and it's total crap, then who cares if you're running on whatever NVIDIA's latest GPU or whatever, if your data is super dirty and messy garbage in, garbage out. So I would say putting up the regardless. Actually, if you're going to do AI stuff anyway, putting up the right controls and the mechanisms in place to make sure that people can't give you crap. Data is super important and it's hard to justify because it's like hey, we think we might need this in three years, so let's do a bunch of stuff to make sure our data is clean now. Yeah, that sucks. I would say that's the most important thing. If your data is completely useless and garbage, no AI magic is going to make this better for you.

Speaker 2: 1:03:09

And when you say useless and garbage, let's say big enterprise database. You have hundreds, maybe even thousands of these kind of normalized tables with column names that no one can really make sense of. Tons of foreign keys to actually get any data. And people are saying, okay, we're going to throw vector databases at this problem. The first thing I think is okay, try getting like a human analyst to go through all these tables and make sense of them without reading a lot of like internal documentation. Do you actually see like ai solving that problem?

Speaker 3: 1:03:45

it will help. So, certainly, like entity resolution, like the idea that, like john kute versus jay kute, like realizing that they're the same people, that's an old problem and absolutely I think LLMs or AI tools can help with this. But, like I'm saying, if it's W Coutet versus Jay Coutet and someone did a typo, then you're screwed, it doesn't matter. Yeah, I see. So I think that's a contrived example, but there's other. The other example people always give is oh, someone put an email address instead of a phone number. Sure, like that one you can check for as well.

Speaker 3: 1:04:31

It's the nuances of really large databases and what the semantic meaning is, the latent or implicit meaning of what does it mean to have this column empty versus null, like all that is usually in the application domain and oftentimes isn't documented? And so if you can prevent that from happening easier said than done if your data is running for 20 years from happening. Easier said than done if your data is running for 20 years. But I think that would set people up to better leverage, whatever the new AI stuff that comes along in the future.

Speaker 2: 1:05:02

Yeah, that's a great point and I think, cleaning up messy data there's all these kind of master data management tools out there and there's a lot of people who claim to have great solutions of that. Ultimately, when you see it happen in large companies, it's just someone's just got to grind and go through that work and I think, yeah, there are some opportunities for LLMs to go solve those problems. And the other approaches to this, aside from vector databases, is Texas SQL. They're not totally mutually exclusive, but Texas SQL this idea that you have this natural language query and know how to convert that into this deterministic SQL query that'll go in and retrieve the exact data that you're looking for Do you see that as being a strong alternative to vector databases or just a different approach that could work well with it?

Speaker 3: 1:05:58

I think it's independent of whether the underlying database is relational versus vector versus JSON. But, as you said, the idea of going from natural language to a structured query language, typically SQL, first of all again, it's oftentimes in databases it's not a new idea. People have been trying this since the 1970s. Llms are just a modern incarnation of it. I think it's a good idea. It doesn't replace SQL entirely. It's good for quick, one-off things.

Speaker 3: 1:06:30

The problem with the challenge is, if I don't know SQL, or if I don't know SQL, or if I don't know, I don't know what the answer is. If I knew the answer, I wouldn't run the query but then be able to take my natural language, generate a SQL query and then get back a result from that and to know whether it's actually correct or not. That's a challenge right there, and there's some benchmarks that can, on existing data sets, to try to figure out how good you are. I think that the challenge is going to be that the, the english language or whatever natural language you're using is imprecise, whereas something like ideally sql would be precise, but it isn't always. But like the idea that you're going from an imprecise language to a precise language and then try to tweak the natural language to make it then contort if it's not exactly what you want. That's a big challenge.

Speaker 3: 1:07:18

Where I see the natural language tools for these, these converters, being useful for is as the first attempt, give me a sql query, but then I can then somehow show a more structured form or interface that I can then tweak it by clicking buttons in the dashboard or something like that, so as a first pass, but then put that into a form that makes it easier to edit. That's where I see the future of this being. But, like for one-off things, the results are pretty stunning, but you obviously you wouldn't write your application doing this, so you wouldn't like build a web interface or a website where you have, like, natural language and the queries, because if the LLM model changes, then the query changes, then everything breaks. What do you do? But for one-off analytics? I think this makes sense.

Speaker 2: 1:08:03

Yeah, absolutely, and I think this also comes back to the differences between the operational, transactional workloads and the analytical workloads and the search workloads, and ultimately it comes down to smart people designing their applications the right way, rather than assuming that AI is going to sort of I don't want to say idiot proof, but make everything really accessible to, let's say, just natural language business users who don't know how to write SQL. I think that's like you said. That's been a challenge for a long time.

Speaker 3: 1:08:32

But go beyond natural language, though, If you're taking English or whatever natural language you want and going to SQL. Why not go from whatever query language you have now to SQL, or whatever SQL to whatever query language you want? So you can think of these things as actually being this integrated bridge that allows for portability in a way that we didn't have before. Now the challenge, of course, is going to be there's nuances and semantics of certain operations that make sense in one data system versus another, but it's very hard to get off Oracle because the SQL syntax is just a little bit too different than CodeCast and other things. You can see LLMs as reducing the barrier to switching off and changing things around, so that's another avenue of direction I think is interesting as well.

Speaker 2: 1:09:18

Yeah, absolutely. I've seen migration tools that are doing a really good job of doing the SQL conversion and that is a bit more of a you can call it predictable workload. Andy, I really want to thank you for being on this podcast. You're really always very generous with your insights. I do encourage everyone to go follow along with Andy's work, Block out some time maybe three, four months to take his database class. If you're really adventurous and willing to do the work, it's really fun stuff, Even to just go through the lectures. I think even if you just do that, there's so much to learn. Andy, where should people follow along with you? It seems like social media. Yeah, when are you most active these?

Speaker 3: 1:10:00

days I try to get off Twitter. I do have an account there, but we post most of the news stuff. We're going to Blue Sky now, okay, great, but there's the YouTube channel. Everything we do is always public. Everything's on YouTube. The course next semester it'll be a special topics course on query optimization. The lectures will be on YouTube. We don't do any advertising on the same. Hey, here's the course that people find Excellent.

Speaker 2: 1:10:26

You can follow Andy on BlueSky. We'll have his Twitter or X handle there as well and a link out to his YouTube course. Andy Pavel, thank you so much for joining this episode of what's New in Data, and thank you to the listeners for tuning in. Hey, john, thanks for having me, it's always fun.

Speaker 3: 1:10:43

Bye.