The Test Set by Posit

Deeply Unsexy: SQL's Redemption Arc — with Tristan Handy

Posit, PBC Season 1 Episode 19

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 1:05:31

dbt Labs CEO Tristan Handy drops into The Test Set to map the fault lines between the data science world and the enterprise data world — and explain why analytics engineers are basically pissed-off data analysts who decided to organize the bookshelf. We get into SQL's glow-up, the community magic of dbt Slack, what AI agents mean for data warehouses, and why everyone's building iOS apps with Claude now.

What's inside:

  • What analytics engineers *actually* do
  • SQL's journey from deeply unsexy to indispensable
  • How dbt turned source control into a source of truth
  • Building a tech community without the RTFM energy
  • AI agents on your data lake: permissions get personal
  • Will LLMs kill the open-source package ecosystem?
  • Edible gardening, welding dreams, and digital dysphoria

Welcome to the Test Set. Here we talk with some of the brightest thinkers and tinkerers in statistical analysis, scientific computing, and machine learning. Digging into what makes them tick, plus the insights, experiments, and OMG moments that shape the field. In this episode, Hadley Wickham and I sit down with Tristan Handy, CEO of dbt Labs, to discuss why SQL, once seen as part of the old world of data, has become central to how modern data teams work. So, basically, roughly a decade ago, a tool in the tidyverse called dbplyr enabled our users to query databases without ever really touching SQL. But around the same time, another tool emerged, dbt, and it took almost the opposite approach, making it easier to go hands on and write SQL directly while bringing in critical ideas from software engineering like version control and automated testing. In this conversation, we explored the emergence of tools like dbplyr and dbt, who they're for, and why SQL has ended up at the center of it all. Alright. Hey, everybody. Welcome to the Test Set. I'm Michael Chow, and I'm joined here by Tristan Handy, CEO of dbt Labs, and my cohost, Hadley Wickham, who's chief scientist at Posit. Tristan, I'm I'm so excited to have you on. I have to admit I'm dbt pilled myself, and I've as a, like, Python R user, I've really battled to explain why I'm obsessed with dbt. So I feel like I'm honored to have you here to, I think, really, like, rep dbt and get people hyped about what what dbt's cooking. Hell yeah. I appreciate you having me. Yeah. I I was wondering, maybe to start out, I know we just we did like a quick demo online of putting up like a dbt project, and Hadley was there. I I thought maybe to kick it off, like, we could just have maybe, Hadley, you could try to explain what your sense of dbt is? Like, the I think the way I think about dbt is it's sort of trying to put SQL on, like, an equal footing to other programming languages, i.e., like, giving you tools for version control, for writing, like, reusable functions or scripts or whatever you wanna macros, whatever you wanna call them, like doing documentation and and testing. That like, that's the thing that seems to kinda resonate most with me, but I'd yeah. I'd love to hear, like, how you how you think of it, if that is, like, at all all correct. Yeah. Yeah. I think that that's correct. But the so this is this is fascinating because I all of us here are data people, and yet we come from two very different trees inside the data community. You folks are in the, like, R Python world, and I very much am in the I I wouldn't even call it the SQL world. I would call it more of the, like, enterprise data world, which, like certainly, like, SQL is a common part of that, but also tools like data pipeline and data transformation. Like so Informatica is a part of this world where it's, like, not really a part of your world at all, I think. Yeah. I don't even really know what Informatica is. So Yeah. Yeah. Yeah. And the the so so part of the world that we come from includes, you know, quote, unquote, data modeling, which is like Kimball or Data Vault or, you know, choose your religious approach here. And it it it is it almost always makes sense in the context of a a a somewhat larger organization. So so when when you are certainly, the largest organizations in the world use a ton of R and Python, But a lot of times, R and Python are these generalist tool tool sets where you can do data ingestion, you can do data transformation, you can do analysis, you can you can do all this stuff. dbt is it tends to be used for now people torture it to use it for all sorts of different things, but really, it's for data transformation. And it it so the the goal is, like, somehow, people in your organization loaded a bunch of data into a data lake, a data warehouse, whatever, and then you are kind of picking up the baton and saying, like, well, now I need to turn that data into useful datasets. And, I may not even be the consumer of that dataset, but I know that there are, you know, a handful to dozens to maybe hundreds of consumers of this dataset that I will create, but I care about very specific tooling to make that process extremely robust, but but it doesn't do other stuff. It doesn't build charts or graphs. It doesn't move data, this this kind of stuff. Yeah. I think that's that's a really interesting perspective because I've always like, the sort of the data science model, like, offer data science model is always, like, you as the data scientist trying to push, like, to own as much of the the stack as possible. Like, you you, you know, hopefully, you can get some of your data from a nice, you know, SQL database somewhere. But often, you might have to do some scraping or, you know, pull together fifteen random Excel sheets. You're gonna have to do the tidying. You're gonna have to do the exploration, the visualization. And then finally, like, you're gonna communicate results to someone else. So it sounds like a lot of that, like like, one of the big distinctions is that, like, in your world, that's not one person who's kind of operating across that whole stack. You've got, like, one person like, someone's doing ingest that's almost outside of the realm of dbt, and then you've got someone who's working with the data that's already been ingested. Maybe it's in seventeen different tables, and now you wanna create, like, a nice dataset that, like, someone else is gonna use to create a dashboard or get fed into some other kind of report. Yeah. I I remember gosh. This is way back in the day. I don't even remember who wrote this, but there was maybe it was somebody at Stitch Fix wrote a blog post about the full stack data scientist. And I remember, I remember reading that and being like, oh my gosh. I I could not disagree more with this this approach. But but it makes sense in the context that you operate in. So, like, if I think the data scientist is a role that, like, kind of arose out of experimental design and validation. Like, you you wanted to, like, try to run an experiment on the world, and maybe if that experiment had some validity to it, the organization that you worked with could could be it could could see, like, significant upside when whatever the the thing was that you were trying to prove. In in the world that I operate in, there are the internal operations of a company, and those operations just require some data to to do the things that they do. You know? It could be a health care provider. It could be, like, literally anything, but you you're not you you may be trying to learn something in the process, but you may just, like, need to produce a customer three sixty dashboard so sales reps can, like, do their renewal calls. And when you operate in this kind of environment, actually, the, quote, unquote, full stack data science model is is kinda terrible because the organization may have to do this business process for multiple decades. And, certainly, like, individual people will come and go over that time period, and your process needs to be, like, very standardized and and, you know, transferable between humans. Yes. It sounds yeah. Like, looks much it's much more like business as usual when almost, like, the ideally like, the data scientist is trying to, like, disrupt that. Right? Like, the data scientist is trying to figure out something that you need to do to change your business, like, to to make a material difference, not just to, like, understand, like, where you are right now and to do all this, like, all that kind of reporting that people just need to know. Yeah. Like And and a lot of times, these two worlds are are complementary. You know, it it is, I think, very common that data scientists will use datasets that have been curated by dbt, but then kind of build from from there. And and then I think it's also very common that features that get engineered in Python and R notebooks can kinda get ported upstream into data pipelines, whether they're built to dbt or Airflow or whatever. Yeah. The the other thing that took me, like, a surprising long surprising long time to kinda figure out is just, getting everyone to agree on like what a metric is, like turns out to be like surprisingly complicated because there's all these like little wrinkles that you were like when you look at the at a at a high level, it's like it seems pretty obvious. But when you start to figure, talk about like, you know, if you just wanna know like the number of customers, like like what is a customer? Like how do you count that? There's like a lot of fine grained decisions in like different parts of the organization have, like, different priorities when they're thinking about what a customer is. Just getting everyone like, there there's some, like, kind of technological problem. I'm like, let's just calculate that number once in one place. But, like, a much bigger kinda sociological problem of, like, how do we get people to actually agree? Like, what what is this what does this number mean? Yeah. And the the thing that I think dbt has done a reasonably good job of over the past decade so so first commit to the open source repo was in twenty early twenty sixteen. So we're we're almost ten years in. That by and large, there were localized versions of the truth that existed at at organizations ten years ago. Now now some organizations were very good at this kind of thing. You know? Famously, you know, companies like Facebook and Airbnb and Spotify, like, they invested in their kind of global data infrastructure. But but it and many companies, the like, versions of the truth were much more localized, and dbt introduced the idea that you should you should push these versions of the truth kind of centrally and govern them in source control and that rather than taking your ball and going home, if you didn't like the centralized version of the truth, you, would kind of argue it out in in public and and try to come come to some consensus that worked for everybody. Now, obviously, that's imperfect, but but I think that we've made it, like, honestly, like, a reasonable amount of progress along that continuum. It does. Yeah. It does feel like that has really changed. Like, I think when I was doing my PhD, like, fifteen years ago, you know, I was working with scientists. You work with scientists, and their data comes in, like, every form you can imagine. Like, it's it's Excel spreadsheets. It's an API. It's a database. It's CSV files, like, whatever. And so, like, absolutely a big part of a role of, like, a, you know, applied statistician or data scientist was just to figure out how do you get all that data into, like, one nice clean representation. And I remember kind of thinking at that time, like, this like this like this is just so tough in science. It must be so nice to to, like, work in an industry job where you can just, like, query, like, the single source of truth and just get it. And now that it clearly, like, was not the case, like, fifteen years ago, But it does feel like we've gotten, like, much, much closer to that now, that you can now like, in many organizations, there's, like, a decent chance that the data you need is kinda, like, nicely prepared somewhere. I would say one thing that really opened my eyes to dbt that some of this reminds me of is that it might be a good context is I didn't I also didn't realize before getting into dbt how capable warehouses were of, like, even unpacking nested data. Like, for me, that was a big missing piece when people were like, I just use dbt on, like, my raw ingested data. Like, I couldn't really imagine until I realized that it can, like, unnest JSON and, like, pluck things out, just how gnarly the data can be that, like, an analytics engineer using dbt can take and kind of, like, clean up and model. And then I also didn't realize the level of obsession that analytics engineers have over the, like, forms the data can take. Like, should we count things by day? Should we do, like, periodic snapshots, you know, where we're, like, capturing our data every day, you know, so you can quickly get it? Or some of that stuff, that really caught me off guard. I have to admit, I was at the dbt 2019 holiday party, and I Oh my I had zero idea what dbt was. So we we've come a long way. That's funny. That's when we first moved into our our office in Spring Garden. It's a beautiful office. We kind of back then, we invented this term analytics engineer, which I I think is is pretty descriptive, although I don't know that there's some great textbook description of it. Some people say that analytics engineer is a pissed off data analyst. And, you know, the it's these as technology changes, the capabilities of these different roles change as well as warehouses get more sophisticated, what analytics engineers can do increases. The the way that I like to think about an analytics engineer is somebody who fundamentally enjoys the process of curation as opposed to exploration. So, like, I I don't know. You you can both tell me, but I think when I talk to data scientists, the the thing that makes me different from them is that I want to make sure all the books on the shelf are in the right order so that when somebody comes up and tries to find one, they can find it. And the thing that a data scientist often wants to do is, like, read every page in a book and draw new conclusions from it. And, like, sometimes I do that too, but that's actually not the thing that I take the most pleasure in. Yeah. It's really interesting because I yeah. I I do think, like, what a data scientist is, you know, has changed over the last fifteen years. Because I I do believe like that is also the job of the data scientist to make sure the books are all like shopped correctly and you can find what you're looking for easily. I don't, like, I don't know how true that like, you know, as the field of data has, you know, exploded, like, they specialized like, the the specialization of each role was obviously increased. But that yeah. I don't know. Like, I I think half. I know. Like, I get equal pleasure in the curation and the exploration, like, tidying things up and then using that. But I yeah. I don't know how, like, common that is as amongst other data scientists. I'm I'm not to take us down a rabbit hole, we don't have to to stay there, but I'm legitimately excited about the changes that AI makes in the world of the analytics engineer, and and it gets to this distinction a little bit. You know, you can think of any large kind of enterprise data system as being some complex environment that that you have to think about entropy. And so, you know, things are always breaking because the world is changing, and the data system is trying to describe the world, and so it has to change too. And, you know, do you have enough people to keep up with that change and, you know, push back against the the entropy and all this stuff? And and so I think that, we've actually got a fighting chance with the, like, significant speed increase where you can create new pipelines, and you can also, agents are really excellent at creating guardrails around your pipelines to make sure that they know when things are broken and how to fix them and all this stuff. But so, like, I and then they also you know, AI increases the ability for you know, analytics engineers are always thinking about their users, which oftentimes are, people who wanna use the data in unpredictable ways and, you know, slice it and dice it and oftentimes in tools like Tableau or Power BI or Excel. And the ability of those users to traverse the datasets that analytics engineers create is, like, significantly higher than than ever. It might be worth even circling back and, like, laying out defining a little bit, like, what an analytics engineer is to you and kinda what they do. I think I think we kinda gestured at it through the dbt, talking about dbt, but I don't I find it such a neat role. It it might be worth almost just really trying to lay it out for folks. Sure. So in in the version of the world that I have in my head, there are there are really four main roles that are kind of on a continuum. And this is this is again, like, in my kind of enterprise data systems world and less in the, like, more experiment driven data science world. But the the spectrum kinda goes from at the leftmost, the most technical, the deepest in the stack is the data platform engineer. And then you move to the data engineer, and then you move to the analytics engineer, and then you move to the data analyst. And so the data platform engineer kind of builds the infrastructure that everything else is built on, builds and maintains that infrastructure. Then the data engineer builds pipelines, but these pipelines might be the most technical pipelines. Maybe maybe they require a high degree of focus on performance or uptime or something like that. But but their primary focuses are technical and not business facing. And then the analytics engineer is also building pipelines, but they're starting to build the pipelines that are, like that really contain a lot of semantic meaning. The the my favorite example of this is I, one time, had to build pipelines for I think they were essentially, like, an Instacart competitor, but they they were they needed to calculate the cost of goods sold for shipments that they sent out. And it turns out that, like, when you ask for a bundle of green onions, the way of calculating the cost of goods sold on top of a bundle of green onions is, like, very challenging. And in order to build that data pipeline, you actually have to be close enough with your business counterparts that you, like, understand all these rules in your brain. And data engineers, like, honestly, if you make them think about cost of goods sold for green onions too much, they're gonna quit and go get another job. So so the the this hybrid where analytics engineers of this hybrid role where you have a lot of business context. You also have enough technical context to be dangerous. Then you go to the data analyst, but they are often not centralized. They are often embedded in local teams, and they go to stand ups with with, you know, the marketing team or the finance team or whatever. They, like, live and breathe every day inside of that business unit. Do do you think it's fair to say, like, the analytic like, the pipelines that the analytics engineer creates are, like, primarily, like, internally focused? Like, you know, that the sales and marketing look at them, like, but not the customers of your business. I think that that is often true. It yeah. I've that is more often true than not. Every once in a while, that will we will find folks that break that rule, but I think that is typically true. Yeah. I'm curious. Did you so, like, ten years ago, as you were sort of building out dbt and the some of the big ideas behind this role of analytics engineer, I'm curious what the, like like, could you see the shape how did you see the shape of an analytics engineer? Yeah. That's that's an interesting question. The the big sociotechnical problem that I saw ten years ago was that you had you had a bunch of cloud kind of early adopter tech company type organizations that were moving their data systems to the cloud. And for the first time, they were hiring data teams. And those data teams, you had to you had to generally choose an archetype of human that you wanted to be on your data team. And a lot of times in the the the ethos back then was that you hired very quantitative people for your your data teams. And this was, you know, in the heyday of the term data scientist, and, you know, everybody wanted to, know, interview all data hires based on, like, you know, helping you choose algorithms and, you know, all of this stuff. Whereas then the data scientists we could hire, and they would do you know, build Airflow pipelines and say, like, WTF is going on here. But so the you know, a lot of the people that I ran into in the industry back then had basically no experience at all building production software systems. And but but they were, like, quantitatively very strong. And so they ended up building they ended up being really good at helping companies ask and answer really sophisticated questions. But then twelve to twenty four months into their tenure, a lot of times stuff started getting bad. You know, like, unscalable processes, you know, infrastructure that didn't scale, all of this. And and it was it was because these folks, you know, had masters of econ econometrics as opposed to, you know, ten years of software experience. And so so what we were trying to do with the role was not say, like, hey, you idiots. Go learn Scala. We were like, look. Like, let's let's find a a way to bridge these communities, the the, like, software engineering and data engineering world with this, like, more quantitative world. And and, honestly, like, the first big experiment where we saw this happen was with the mattress company Casper back in twenty sixteen. Was headed headed by a gentleman named Scott Breidenother. And Scott was his background before this was, like, he was an econ like, an economist consultant or something like that. I can't remember. So some, like, boutique consulting firm where he did, like, economic data. But he, like, never run a data infrastructure before, but they, like, got ahold of dbt, and he trained his team of a dozen data people to to use it, and, like, that's how they kinda scaled their their data infrastructure. Is that a time when dbt was kinda, like, the thing you were using as you were consulting? Yes. This was the early days. Yeah. That's fun. That's a fun, I feel like, fun case of, like, building something in open source and seeing someone kinda like put it to the test. That was the moment where I mean, we thought that we were building an internal productivity tool, and that was the moment where we were like, oh, maybe other people want to use this also. Yeah. Yeah. It's so cool. I'm also one one thing I'm curious about too from from the time is I heard you mention, I think, in one of your podcasts with Ben Stancil, something like in twenty fifteen, like, nobody wanted to put SQL on their resume. Yeah. Like deeply unsexy. Could you say a little bit about that? Because I I'm so curious. Like, I feel like dbt's brought SQL into the mainstream. I'm so curious. Like There were these two different worlds in in data back in twenty fifteen. There was kind of the the new stuff, and there was the old stuff. And the old stuff felt really clunky. This is like Teradata and Hadoop or No. No. I would I would put Hadoop. Hadoop was the new stuff. Yeah. Yeah. And but but like all of the enterprise stuff was was pretty deeply unappealing because it it hadn't honestly moved that much in the past decade. And and then you had the the Hadoop world, which, you know, in twenty fifteen, Spark was really pretty nascent and and, you know, really just shooting up the charts in terms of, like, number of stars on GitHub. And so people who want and, like, people with those skills were getting slurped up by, you know, FAANG companies. And so what you wanted on your resume as a as a data person was, you know, Pig and Hive and Spark and Impala and, you know, all all of all of this. And if if somebody told you that they spent most of their time in SQL, that probably meant they were a part of this old world, they probably were, like, you know, living deep in the belly of, like, some, you know, fifty thousand person enterprise doing nothing that interesting or new. But there were, like, two things that I think changed there. One is that, like, you you were were talking about before, cloud data warehouses became super powerful and prevalent. Then the the second thing is that SQL itself became much more expressive. And so whereas previously, SQL was a language that, you know, basically, you you had to you could do some simple aggregation. But, by by the time that I was, you know, beginning to do this work in pure SQL, we actually couldn't find, a a use case for a data transformation that we couldn't express in SQL. The the most complex thing that I data transformation that I needed to express in SQL was outlier detection on time series data where I was detecting lift from TV ads. And that was a little hairy, but I was able to use window functions to to do it, and it worked it worked great. I'm so glad we're getting here because this gets into, like, the thing that I am most interested in these days. So my favorite part of the R ecosystem is dplyr, which is, like, I I think a similar, you know, you know, taking the ergonomics of R and allowing it to compile down into to SQL. I I think we need more delineation between the ergonomics of how you wanna express your logic and the execution environment that they are executed in. And so one of the things that we are spending a lot of time on right now is we dbt has a new engine powering it called Fusion, and Fusion starts from, not just AST parsing, but actually, like, full, logical plan creation. And so what you can do when you can actually go all the way down to the logical plan emulation layer is that you can then reconstruct the actual symbols in the SQL in another engine. And and so all of a sudden and and we're not fully there yet, but, like, the technology is a part of Fusion. We will soon be able to say, okay. SQL that you've written for one engine, we can just, like, cross compile and provably correctly run it on another engine. But then that's not that far away from saying, like, also, I wanna have a a Python front end or a an R front end, and it just all is a part of the same, like, you know, logical plan. Yeah. Yeah. And I think the thing that's really interesting too, coming back to AI again, like, that is an area where like, you can like, yes, you can take a, you know, an R script or a SQL script and give it to an LLM and say, translate it. And, you know, ninety percent of the time, it does a good job. But the thing that I've also found really interesting, I've been working on dbplyr lately. That's the dplyr back end that translates to SQL. Like, that's also a really good application for LLMs because it can it can generate so much of that translation code. But now I can actually like, now it's deterministic. Like, it's not the LLM isn't doing the translation. It's generating that code. And then I can unit test it. It, like, accelerates my my velocity there, increases so much, and it's still like, it's like, I know it's correct because there's all the unit tests. There's no, like, stochasticity from the LLM anymore. And that just seems like like, really that that's also just like a I don't know. That just that seems cool. Yeah. A hundred percent. I think that, like, there are these, like, weird fault lines within any kind of software communities that that divide on on language, and that that's just because of, like, the way that our brains interface with language. Like, you it it, like, takes mental space to learn a language to, like, keep it in your brain and all this stuff. And so we've got this stupid thing where, like, people are Python users or R users or SQL users or whatever. Like, it's all just, like, different ergonomic ways to express the same ideas, and I I I love the idea that we're getting closer to a place where, like, we've got a universal Babelfish. It is. I will say too, like, I I worked on, like, a little bit of porting, like dplyr and dbplyr to Python and all that translation stuff. But what's so interesting to me now in in, like, twenty twenty six is that similar to, like, unnesting JSON, I I've been amazed at, like, how expressive, yeah, like, SQL is in a lot of databases, like DuckDB. I I used to bring up the example, like, six, seven years ago of, like, why dbplyr? And it was like, well, if I wanna select every column except one, like, it's a huge nightmare in most databases. But now, like, you have DuckDB with, like, excludes or except. I always forget the they have, like, all these ways to select things and operate. So it's interesting to your point of, like, Fusion taking and being able to translate, like, down to the AST level. It it's funny. Like, seven years ago, I feel like I would have wanted, like, a very dataframe, like, Python tool translating to SQL. But now I could see, like, something like you described be sitting a lot closer to SQL, like an R wrapper around a more dialect. To me, dbt is started out a little bit like Rails. It where SQL is HTML, and nobody really wants to sit there in hand write HTML. That's, like, not an efficient use of time. And when so when you use something like Rails, you you kind of go up a level of abstraction. And, similarly, that that same thing happens with with dbt. You know, you can long before the engines themselves started doing things like exclude or except, for the select list, like, you could implement that as a function in dbt, and then using the macro capabilities, you could just have it. The as we continue up the layers of abstraction, it it just enables enables people to forget the implementation details. So, like, here's a here's a thing that that goes on right now. There are companies that have spent, like, literally millions of human hours writing Spark pipelines, and they also, in different teams, have spent millions of human hours writing stored procedures. And these things fundamentally do the same stuff. Like, they are not different. They just, for whatever reason, have been built in different technologies, and so, these companies maintain separate data infrastructures for these two things. And then at some kind of final stage, they kind of make them all available to each other. But that's not that sensible. And if you could just say, like, well, you've got a translation layer that kind of can, you know, read and operate on these these things regardless of how they were originally expressed, then that all those walls kinda fall away. It's also it sort of echoes a bit like the sort of story of, like, Hadoop to Spark to now, like, I think, yeah, people just write SQL. Right? Like, the as the like, in the early days, you know, SQL databases couldn't handle the level of like, they didn't know how to split up jobs across, you know, hundreds of machines. And and you were kind of forced as, like, a data scientist or data engineer to do this yourself and, like, explicitly manage all that computation. And, like, I'm sure some people enjoy doing that, but I think most people, it was just, like, something in the way of doing your actual job. And, like, as over time, you know, the databases have become more capable, like, all of that just gets swept away into the background. And I'm sure, of course, like, the the best the people who are so good at writing those Hadoop jobs originally, I'm sure, could be, like, write like, more efficiently sharding them and doing all sorts of things to make them, like, faster than, like, whatever the database is doing. Maybe. Maybe the optimizers are so much better than humans now. But, like, that that's just sort of all gone away, and we just you know, you can op now, once again, gonna operate at this very high level. If I'm just gonna write a SQL query, I'm not gonna care about the fact, like, behind the scenes that this job's gonna get split up into a thousand different little subjobs and seen all over the place, and the results are aggregated. It's it's pretty cool. Yeah. For sure. I think one one maybe, like, little, like, tangent that that might be a nice change of pace is, like, I do wanna feel like too, dbt has this incredible community. Like, I know we've talked a lot about, like, dbt, the tool, and, you know, Fusion, but one one thing that struck me is, like, the dbt Slack is so hopping, and there's there's, like, so much going on. I'm yeah. I'm I'm really curious to hear, like, what what do you think, like, went into kinda creating such a nice community? I'm not trying to, like, butter you up, but dbt Slack's so happening, and Coalesce is so bumping. I'm I'm curious, like, is it just like analytics engineers are wild people? Or, like, what what do you think makes the the dbt community so nice? So I actually don't know that much about the R community. Is is there, like, a place that people gather to talk about best practices and stuff like that, or is it just, like, so widely used that it's, like, there's a million different separated communities? Is there a place where R users get together in person? So I I I think they're, like, they'll I mean, they're, like there was, like, online, which is Twitter. Like, that's where people, like, shared knowledge. The R community is, like, largely abandoned Twitter for fairly obvious reasons, and there's, like, less of a sort of central online place. There definitely there's lots of yeah. But but apart from that, there's, like, quite a few, like, regional conferences. You know, Posit has a conference. But these are all conferences on the order of, like, you know, hundreds to maybe fifteen hundred, two thousand people, but, like, scattered all over the place. So that there's not, like, a really and then there's, like, one central R event in person. Yeah. And I will say there's, like, a big hex sticker crowd. Like, in R, there's, like people love hex stickers for packages, and there's a real, like, frenzy to pick them up at conferences when they get dropped. Also the R, like Oh, neat. Yeah. Yeah. Yeah. And So, like, if you create a package in R or dbt. You know? To be a real package. Like, it has to have, like, a sticker associated with it. And like that and so then other conferences, and it's like, you know, you're kinda like trading. You trade stickers with people. And I think that that's one of the things that's like pretty unique about the R community. I love that. I would you be offended if we started doing stickers for package maintainers? Okay. And it that's a neat thing. The the other thing I've like, it's amazing like what people like I've used stickers internally to get stuff done. Like like it's amazing what people will do for you if you offer offer them like a limited edition sticker. That's like one of my flavors. It's it's it's hard. You know, it is probably anybody who has kind of been the seed crystal for a a reasonably large community, it's it's kind of an emerging phenomenon, and you you kinda never know exactly what the things are that made it happen. I would I would say that the the the biggest the the biggest trait that we were talking about before with analytics engineers is that they previously often previously were data analysts, and they were leveling up in their careers. And oftentimes, the most common emotion as they went through that was a sense of overwhelm and a sense of impostor syndrome. And so many technical communities are, have have a very like, the people who run the communities are highly technical, and they there's a sense of, like, RTFM. Like, don't ask this question until you've, like, researched to the ends of the universe and and only then bother me. And and we just acknowledge the fact that, like, this stuff is, for for many of the people that were starting to use dbt in twenty sixteen through twenty twenty, were, just kinda felt overwhelming, and and they just needed some support along the way. And so we kind of seeded the community to be helpful and supportive and friendly, and we we were very serious about moderating out any behaviors that kinda conflicted with that. And so it that kind of creates a virtuous cycle when people feel like they've been supported in their journey. They wanna then then support other people in in their journeys. It is really interesting as the community matures. I mean, it the usage of dbt is still growing something like fifty percent year over year in terms of, like, actual usage of the product. But but in any given year I mean, that that used to be, like, tripling every year, and so that means that on average, the any given community members' time in the product has is is longer. And so the the ways that conversations happen and where they happen and all this stuff is is different than it was five years ago. Yeah. It's like I think there's interesting parallels of the R community because I think, like, ten years ago, fifteen years ago, there was, like like, R was, like, for people with PhDs in statistics by people with PhDs in statistics. And you go on the R-help mailing list, and you ask a stupid question. Someone will tell you what a******* idiot you are, basically. And so, like, one of the things like that, like, that when I thought about the argument, like, that is something I wanted to do the opposite of, basically. And, you know, as they observe these technology transitions from, like, mailing list to Stack Overflow, from Stack Overflow to Twitter. Kind of at every point, we had this opportunity to kind of like reinvent the community a little bit and move towards like a more like friendly and welcoming environment. And I and I think like but that was just a tremendous net benefit to the R community as a whole. And I think we're also, like we we also are lucky because the R community tends to be, like, more diverse because there's people coming from all branches of science. I mean, you know, diverse both in their backgrounds and the applications, and that and that, like, cultivating that, yeah, like, that that feeling of being welcomed. And as you said, like, that virtuous cycle, like, felt, like, really welcomed as a community, and now I'm gonna, like, pay that forward and, like, welcome the next generation of people. Like, I've it really led to, like, a pretty remarkable transformation in the in the R community. Yeah. We we were I was saying, like, very positive things about AI and how I'm hopeful for its impact on analytics engineers. The I am a little up in the air as far as how AI's impact on community formation. So the the the funny thing about when you have communities is that they not only help people get things done, but they also build social capital. And so, whereas, you know, there there were, like, really meaningful social relationships built in the early days of the dbt community when I was, like, a super, super active member. And but but they they happened as a part of asking and answering of kind of boring technical questions. And now I would never ask those types of technical questions to a community in Slack. I would go to Claude or or whatever, and they would give me it it would give me immediate answers that were probably of as high or higher quality. And the the other thing that is, you know, we still have to see how it plays out is even open source feels just a little bit less now of obviously, open source on the order of, like, R or dbt or, you know, something obvious like Linux or like, that that that stuff's not going away. But there's also, like, this entire package ecosystem that, you know, I spent a lot of time curating a package called dbt-utils, which was macros to do, like, useful utilities, and and, like, now you could you could just, like, ask Claude to say, like, hey. Make me a macro that does this thing, and maybe maybe the yeah. So I I I don't know. I I worry about that stuff a little bit, but then I feel like a grumpy old man. But I do yeah. I I I feel similar. Like, the other thing I worry about is, like, we're the we're the incumbents. Right? We're the people who created the software that if you ask Claude, it it knows how to do it. Like, that's all in the training data. And if you're like a like a young person like, the way I, you know, promoted ggplot2 and dplyr was, like, someone would ask a question on the Internet, and, like, I'd be there. Like, hey. Like, I'm gonna both answer your question, and I'm gonna be, like, friendly. And, like, yeah. So there's that, like, you're learning something new. Like, you're not gonna get that from a chatbot. And there's that, like, interpersonal relationship, which is also, like, gone now. So, like, yeah, it's pretty clear that's gonna have, like, profound implication on how, like, these communities form and, like, what people get out of them. I mean, maybe, like, it it frees us to, you know, focus more on communities, like, for the sake of community, not just to, you know, to solve your R or SQL programming problems. But I don't know. Like, it is yeah. I worry. I worry about that. About that loss of of connection. Yeah. I wonder how much, like, I I I remember, like, searching a lot too and really appreciating, like, finding a blog post or, like, finding out someone kinda did a dive into what I was looking for. And I I feel like I remember some of those to this day, which is, like Hilary Parker, so one person in the R community way back, wrote about, like, making an R package. And somehow, even though it was, like, over a decade ago, it's, like, burned in my mind. I do I do wonder how much if, like, people will still feel as, like, encouraged to I mean, maybe they'll write up just as much, but I I know, like, dbt also, like, really, I feel like, distinguish itself through just so many great blog posts and, like, deep dives. Yeah. The there there are posts that we you know, there was this post we wrote in, like, twenty seventeen or twenty eighteen, which is like, how we configure Snowflake for our clients with dbt. And that was used to configure so many Snowflake instances. You know? And I I certainly, you can still write that type of content as much as you ever could, but the I think the economic incentives for it are, like, changing very quickly. Which in in some ways is not entirely like, that's also what led to, like, all of this, you you know, like, the content, like, the all of this content farming around just, like, creating, like, tons of, like, pretty useless content. So whenever you search for something, you know, you get someone selling ads to you. Like, that I'm not, like, not sad to see that go, but, like, the, like, the blogs that people, like, poured so much heart and soul into. And then I think, like, the other thing I'm nostalgic for is, like, you would read, you know, you read someone's, like, really cool technical blog. You're like, oh, that's awesome. Then you go and, like, follow them on Twitter. And, like, as well as the technical side, you also get, like, some snapshot of, like, their personality and the other stuff that are interesting. Why the R ecosystem I could you know, we didn't have to get into it, but, like, I could tell you what their collective thoughts were on elections, U.S. elections. It's because, like, they all of this stuff bled together once you followed somebody on on Twitter. Yeah. For better or worse. Yeah. Yeah. Right. Did did everyone go to Bluesky? Is that where the community is now, or is it somewhere else? I mean, not every I mean, not everyone, but I think that's it it feels like there's enough of a community. There's, like, an there's a strong enough nucleus there that you can go and interact, and people interact back, and, like, you know, it's enjoyable in the sort of same ways as like early Twitter. Like, you know, I tried Mastodon and they never got the same. And I'm on like LinkedIn LinkedIn, which I kind of hate everything about, but it's like people use it and, like, I get I've gotten better, like, you know, feedback there than other places. Was not on my bingo card as the number one social site that I use, and I'm I'm still troubled that that's the case. But I do feel like somehow it's so tame that I'm like, this is fine as a social place. But I do miss, like I miss, like, reading posts by, like, an account that claims to be, like, a raccoon digging through garbage or, like But that actually, like, yeah. I we I'm getting that getting that from, like, Bluesky now. Like, just, like, weird, like, personas where you're like, this is this is just, like, clearly, like, such a totally different, like, person from me, and I get to, like, experience a little bit of that. I'm curious if either of you fun personal projects going on right now in data. I feel like with with coding agents, I have been doing more personal projects than ever. My current thing is that I'm trying to create my own iOS app that pulls data out of HealthKit using the, like, highly hard to access SDK so that I can get my health data into a Postgres database and screw around with it. I love this. I I just have to say, like, this sentence in, like, four years ago would have sounded insane as a project, but that this somehow, project makes so much sense to me. Like, even if you've never done an iOS app that you could have a good time. Make an iOS app. Yeah. I mean, I did this I mean, I did I also did an iOS app. But it's just a it's a talk timer for like, if you're giving a talk at a conference or you're chairing a stage. Like, I've always had in my head this kinda, like, platonic ideal of what I want from a timer. And, like, when you go, like, I've looked, tried so many different apps, and they're always, like, bloated or ugly or, like, full of ads. I'm like, oh, wow. Actually, I can just create this now. And I did. And it was like so like fulfilling to like create this thing in Swift, which I'd never used before. And like it works and I like it. And yeah. That's so not like, I haven't really tackled many, like, data projects, but like other little software projects. Yeah. That's just it's a fun it's definitely a fun time to be a a software engineer. I think I mentioned this last, maybe the last episode we had, but I've been doing a lot more Cantonese studies. So my dad speaks Cantonese, and it's it's a tough language because it's not written traditionally. So, like, it's it's rare to have, like, transcribed Cantonese. It's, like, only spoken, and then people learn to write and read Mandarin. But it's it's so easy today to get, like, Whisper to transcribe Cantonese videos that it's it's kind of mind blowing to have, like, Whisper transcribing material, and then agents are really good at speaking or writing, like, writing out the Cantonese. And so it's it's been really nice to, like, be able to generate sentences and have, like, a tutor that can kind of take what vocab I've studied and kinda, like, remix and but I will say the nicest thing about this activity for me has been, like right now, I have, like, so much focus on using things like Claude Code to generate. I I find language studies been really nice for almost, like, getting back in touch with, like, picking up a skill and kinda, like, fluency. Like, kinda that's felt good in the way that, like, coding fast felt good before too. Like, just being able to produce words and, like, read and understand somehow, like, feels nice. Yeah. It's it's been an interesting kinda switch up. Wait. How how far along is this iOS app? Are you like did you run it any hitches, or did it go off the ground pretty easy? What was the It's gone okay so far. It's it's This Claude Code? Or what's the Yeah. I'm using Claude Code. Nice. It's it's not complete yet. I have I have a late night tonight. Yeah. Actually, as we're recording, I have Claude working in the background. Right? Nice. That's I I feel that constant talk. You're like, I might as well have you working while I'm doing other stuff. Last week, I gave a talk, like an internal talk about, like, using Claude Code. And, like, someone, like, during in in the Zoom, like, polled the audience of, like, how many people are watching this talk while Claude Code is doing something in the background. And it's, like, percent of the people watching. So yeah. Jeez. I feel like, yeah, pre meeting, I'm always trying to get, like, something in teed up. Yeah. I I I actually realized, like, I had, like, this this is definitely toxic. But I realized, like, I didn't for a while, was thinking, like, his of meetings as not being, like, real work. Like so whenever I was in a meeting, like, I didn't count as work. And, like, one of the reasons that I found, like, Claude Code so, like, appealing is it turned to the meeting felt like it was work because Claude Code was working for me in background. Tristan, have you fired Claude Code at, like, dbt projects? Like, have you had any reckoning moments where you just turn it loose on dbt? Like, what what's that been like? It's I mean, it's shockingly good. I mean, as it is with with everything else. The the as as you mentioned before, Hadley, there's enough dbt code in the training data that it just knows how to do that. And now so we we we built an MCP server, shipped that in, I think, April of last year. That has seen very rapid adoption, and that now allows Claude to pretty straightforwardly kind of execute stuff and and also, like, test its own like, validate its own code. So, yeah, it it's it's good. We we did I think it was maybe two months ago or something where we were able to go from zero to a pretty sophisticated project in the space of one hour. And so it's it it felt like kind of a a moment. It does make you think, like, I really like the Jaffle Shop DuckDB demo, just to fire up and kinda walk people through. But it would be wild to have some scenario to just show people Claude Code, like, chewing through, producing kind of the the whole project? Or We have, like, an internal thing we call Demobot, which is, like, our sales team use it. So instead of when, like, when they, you know, go to talk to a customer, instead of just showing them, like, oh, here's a generic, like, there's whatnot on New York bicycle share data. We have like Demobot, which is like very simple Claude script that's just like create a sample dataset for this industry, make a dashboard, make an API, make a report. And, like, even though, you know, the data is, like, completely made up, it's, like, so much more compelling to see something, like, related to your related to your industry that that that's be really like. People really like stuff that's, like, customized just for them, and, like, it's easier now than than ever to do that. Hundred percent. I love that. I'm curious, like, I know you you had some AI predictions about BI and analytics engineering that you wrote about last year, or or analysts working in BI tools. I'm really curious, like, in twenty twenty six, do you have any anything you're keeping an eye on? Or Do you wanna make some, like, famously, like, bad predictions for us? So, like, in a year's time, it can be, like, Tristan only thought we needed one LLM for the entire United States. I think that this is the year that Apache Iceberg goes from a topic of conversation at CIO dinners to actually implement it in the wild. I think that AI is going to be layered on top of things that we already do. It is not going to be the death of dashboards or, you know, any any, you know, such catastrophic things. Yeah. The those are those are the two things that I mostly expect from from this year. I I do think that agents everyone's saying this is the year of agents. I I see the same thing is I think that the reason that agents have come for coding first and best is that, you know, one, they're developed by software engineers, and so it's easy to automate your own work. But but two, a lot of times, software is the least stateful thing, and so it's it's actually easier to kinda dummy up data and still do real work and then this kind of stuff. But and so it's it always takes a little bit longer in the world of data because state is just harder. But I think we're starting to resolve some of the kind of permission stuff or the all of the these types of things that allow agents to safely get at the, like, massive repositories of structured data that that companies have. I think I I don't know if this is related to like, I I think I heard in one of your recent interviews talking about, like, what what should agents have, like, be able to view and, like, as we roll agents out, say, onto, like, data warehouse, a data lake, like, that challenge of, like, what what should they have access to? Is that is that what you're talking about? Yeah. I think that if you know, there's there's a lot of companies that make tools, whether you're talking about Salesforce or Workday or whatever, like, kind of purpose built applications. And and most of these companies want you to build agents within the context of that piece of software. And so you you can do that, and that certainly, if you do that there, there's certain advantages. Probably, your agents can have a lot more context around what it's operating, and maybe it can also take action as well. But but there's real advantages to building agents in kind of a, more horizontal generic way in on top of your data lake. Because then they can access any data, not just the data in that one application, and, they they're a lot more flexible, etcetera. But but when you're gonna put an agent on top of a data lake, well, you you have to think about an agent just like a human. You're not just gonna say, like, go have at it. Like, you know how to read Parquet. Do whatever. So you you need to make sure that you give it access to data in the same way that you would give a human employee access to certain specific data and not other data. And that I think we're we're just just starting to get there in terms of how to think about doing that. It's so interesting to think about both from, like, whether it's, like, someone, like, coding in, like, a BI tool or Python and accessing data, or someone, like, even in a spreadsheet or firing off, like, a Slack message that then kicks off, like, some kinda, like, research or kinda, like, job. It seems like there's a lot of, like, interfaces into this this kind of stuff. People have a lot of expectations of agents. They they think about them as, like, an automated version of a human. Whereas, like, previously, we would have a service token, and that service token could, like, control things within one certain application. But an agent workflow, now you expect it to interact with, you know, four to ten different tools. And so all of a sudden, it's got an auth profile that looks kinda like mine as an employee. It's you've gotta map it to, like, a bunch of different applications, so it's it's not trivial. Yeah. Yeah. It's a tough one. I don't Hadley, do you have any AI thoughts as we close out Predictions? We could get we could get a whole round of predictions, you know, in here. I don't know. I know. I just I don't know. It's it just feels so hard to make predictions right now. Think we're gonna see a lot of change. Like, some of it, we can anticipate some of it. It's just gonna be just surprising, like, second order effects. So it's like to me, when I think about AI, like, it's all about, like like, being nimble, like continuing to experiment and try stuff out and accepting, like, whatever I believe today might be, like, wrong in six months' time. But it's still like, compared to six months ago, I don't know. I I feel, like, more optimistic, I guess, about like, I still feel like software engineering is is valuable and useful and that there are so many of the skills we know still continue to be useful. And now it's starting to think, well, what, like, what does this mean for, like, data scientists? Like, what are the skills that you need to apply, like, even though you maybe no longer, like, handwriting all of the code. So I have no spicy predictions. But Interesting. Maybe just quickly, I know you mentioned a fun fact is you're working on becoming an edible gardener. Is that right? Yeah. I am How's how's gardener life? Well, I mean, it's very cold right now, so There's not not a lot going on there. It's like – waiting for dreams of spring. Is that? Yes. I I now have six fruit trees planted. I planted them mid mid season last year. This coming year will be the first growing season. I have a a big deer fence around everything. I built a bunch of beds. So, you know, this is we'll in a year, if we talk again, I will tell you if I was successful or not. Is is this also kinda like your backup plan? Like, if AI does take your job, at least you can, like, still eat bushels of fruit. I I know that you're they partially kidding, but I I do the the more that I go down the road of, you know, AI and everything that's happening right now, there's like a digital dysphoria that makes me more and more, like, want to get my fingernails dirty. And so this scratches that itch. Yeah. Yeah. I'm, like, partially kidding and partially not, I guess. Like, I am and I think we're gonna like, me and my husband are gonna take, like, a welding course later this year Oh, cool. Because that's, like, a fun Legit. Like I've looked at doing some metalworking stuff, but it was it was you need a lot of equipment and access or or, like, access to somebody else's workshop. I I couldn't figure out how to make that happen. Yeah. There's we just discovered there's, like, a cool maker space that's pretty close to us that does a bunch of stuff like that. So Do you have any welding dreams? Like, what's your what are you gonna No. I have like no I have no needs to weld. I don't even know what to do with these skills, but it just seems like a fun thing to learn. So That's a fun one. I feel like I I'm excited for the day where Tristan, you have, like, had these they're welding the deer fences. You know? He's like Yeah. If if I need anything You can come out of here. I'll shoot you a text. You know where to go. It's gonna be like protecting us from the when we're welding things to protect us from the clankers that are coming. Tristan, thanks thanks so much for coming on. I mean, honestly, I think it's a dream to to be able to talk about dbt in this this space. And I, like like you mentioned, it is kinda like two worlds, and I think it's been so helpful to hear about the, like, similarities and differences differences between these worlds. And I'm I'm just such a big fan of dbt and all the work y'all are doing, so really appreciate you you coming on in. Thanks so much. It's been a lot of fun. The Test Set is a production of Posit PBC, an open source and enterprise tooling data science software company. This episode was produced in collaboration with creative studio, Adjy. For more episodes, visit the test set dot co or find us on your favorite podcast platform.