How Jacopo Tagliabue is Cutting Data Pipeline Latency with Fast Functions Artwork

What's New In Data

A podcast by Striim (pronounced 'Stream') that covers the latest trends and news in data, cloud computing, data streaming, and analytics.

All Episodes

What's New In Data

How Jacopo Tagliabue is Cutting Data Pipeline Latency with Fast Functions

May 20, 2025 • Striim • Season 6 • Episode 7

What if your data pipeline could run 10x faster without the overhead? Jacopo Tagliabue, CTO of Bauplan and NYU adjunct professor, is pushing the boundaries of data infrastructure with lightweight DAGs, Apache Arrow, and a radically different take on functions as a service. In this episode, he breaks down the tech stack behind Bauplan and why the future of scalable data pipelines is all about speed, modularity, and zero-copy design.

Follow Jacopo on:

What's New In Data is a data thought leadership series hosted by John Kutay who leads data and products at Striim. What's New In Data hosts industry practitioners to discuss latest trends, common patterns for real world data patterns, and analytics success stories.

0:15

hey everybody. Thank you for tuning in to today's episode of What's New in Data. I'm super excited by our guests. have, Jacopo Tagliabue. Jacopo, how are you doing today? Very well. Thanks so much for having me. Thanks so much everybody for tuning in and, you know, 10 out of 10 on pronunciation of my name. So amazing work there. Of course, it's, you know, na names are important and you know and, especially the, pronunciation. And you, so you, just launched a product, a really exciting announcement as, of the week of recording this you, announced your $7.5 million seed round and a lot of excitement about, you know what, you're working on. But first Jacopo, tell the listeners about yourself. Yeah, so I'm Jacopo the CTO and co-founder of Bauplan. I'm also a adjunct professor of machine learning system at New York University, which is mostly notable because the only job I ever had that my parents understand. And I've been building data and AI system for the best part. Of the last 10 years at any possible scale, literally from garage scale to IPO scale and everything in between, which makes me an expert because I made all the possible mistakes that somebody can make in the in the process of kind of moving it around. Yeah. Yeah. And that's really how you learn, I always say, to own something is to break it first and then put it back together and Yeah. But you know, you're the, work you're doing is, pretty exciting. You, also published a, great paper and I wanted to get into some of the, topics there. First, it would be great to just dive in and into some of the conceptual ideas that would be useful for the listeners. The first being fast functions as a service. Can you break that down for us? Yes, of course. So Bauplan is basically the combination of two ideas. One is function as a service for data and one, is the git for data, which, you know, it should be more familiar because everybody knows what git is. So functions a serving as bigger paradigm of competition that is sort of be can be traced back to AWS Lambda, which is an incredible service. It's been like a lot of success and it's been replicated in. Other clouds, like Azure functions. There's a bunch of open source project doing that, like open Wisk, K native, just to mention a few. And what is the idea? So imagine the history of the cloud. So initially you have virtual machine, so somebody else provisioned the hardware for you and you did the rest. And then you have containers. So somebody else provision the hardware and the machine. Sorry, I have a gigantic cat that is throwing, so the, sound that you, hear is this gigantic cat that throw stuff down now. Very, very cute. Yeah. Good, great cameo from the cat and yes. Obviously wanted to chime in on functions as a service. Exactly, So, yeah. So there you have containers. And containers. Yeah. Somebody else provisions the other the virtual machine and the, and then you just do the thing on top, like imagine stacking this. The last layer of abstraction is, like serverless functions, which is somebody provision the adver, the virtual machine, the ular container, and you just literally send a function for somebody to, again, typically AWS lambda. And that is awesome for all the use cases in which you don't want to think about the underlying infrastructure, maintenance, scalability and all of that. And you just want to write your code. Which I believe at least as a user, is a very powerful abstractions because I want to deal with the business logic and with, you know, wrangling my data. I don't want to deal with, you know Kubernetes or containers kind of bullshit type of thing. Yeah. And that can really slow teams down, right? When they have to think about the infrastructure and, how to provision it and how to deploy it just in time for, certain workloads or, you know, continuously have it running. So I, you know, I want to get into, I. You know, how does this differentiate from something like a Lambda or, an Azure functions? Yeah, so I think originally the I'm sorry, one sec. Originally the, as my vp, VP of engineering originally, the use cases for lambdas, which have been enormously successful were microservices. So tiny, response, like hundreds of milliseconds responses that you may want to scale horizontally. Let's say you have an endpoint. That produce some analytics results, like, you know, like a pixel for, advertising. That would be a fantastic case for Lambdas. Actually, my previous company did that in 2018. We were probably maybe one of the first people on the planet to try that. It's an amazing use case, like a hundred lines of code AWS spins it up, gets the result back in a hundred milliseconds. If you have more, requests in parallel, they just scale it for you. That's amazing. Okay. But it's exactly the opposite of what you need for data. So data functions are bulky. They run for minutes, sometimes for hours and they don't really get you back. Things like the typical data pipeline will not get you back things after one function, but it's a combination of function. Build a tag if you imagine, you know, a chain of function. And so the question becomes how do you pass data between one function and the other more than how you get a JSON back from a hundred millisecond request. So while the ergonomics. Of serverless Lambda and serverless bioplan are similar. You build a function, we do everything else. The actual underlying system is completely different, which is why we have to build the water runtime to solve this problem because we couldn't reuse any of the existing solutions in that sense. So Bauplan combines the, programming model and the runtime which is a, you know, very unique and. What is it that makes it specifically very good for data pipelines? Yeah, there was like a weird choice. So in the very beginning we started with testing out open source implementation for functions like Open Whisk. We know a bunch of people that actually had a good fortune with that K native and all of that, but they were all limited in one way or the other because the thing they were good at. Use cases we wouldn't care about. And the thing that we needed were not just represented there. So we decided to build our own. And we always say we built Bauplan because we couldn't add data to Lambdas. And on the other side we were not really able to add like a runtime to an orchestrator. Like if you're, imagine like there are very famous, popular Python packages for doing dags. Airflow prefect, Dagster and all of that. But there are framework, like there are way in which you express your thoughts. They don't really run the compute for you. They don't optimize the compute for you and all of that. So Bauplan is the combination of these two things. So we co-design a very lightweight framework for dag. So you put a bunch of decorators on top of your function and that's kind of it. So very lightweight compared to airflow, for example. And then we designed the runtime that somehow leveraged this. A way of expressing things to run things as fast as possible, as efficient as possible, as cheaply as possible. And looking back what was kind of a crazy idea like 18 months ago now it kind of makes sense because if I work in Bauplan today it wouldn't be as nice if we decided to, not do one of the two in, a sense. Yeah, because you have the, like you said, it's the, you know, you have the compute engine with a really convenient python programming model, right? And when you combine those two things, it, you know it, definitely allows developers to move much faster just having everything included. Yeah. I think, yeah the, operating, I dunno, design assumption of Bauplan is, if I can't explain this to my students at NYU. There's something wrong. You know what I mean? And, you know, my students are great for, you know, along many dimensions, but they're not senior software engineers. They know what a package is, but they may not know what docket is. You know, they know what a table is, but they definitely don't know what Apache Iceberg is like. Definitely don't know that about that. They know what an endpoint is, but they know what Apache, you know, ARO flight is. So the platform exposes all of this. In a way that a student at NYU can understand, and it does all the hard work of kind of like making the infrastructure works behind the scenes. That's very cool. I mean, you named some some frameworks there. Let's get into one of them. This is I, know you're using Apache Arrow, which is a popular in-memory serialization format. It's used by Spark, it's used by pandas. Many, frameworks rely on it. How, is Bauplan leveraging it? So there's a few, there's a few things there. So the first thing is that, remember what I said before, when you consider, I when you consider the cat has a lot of things to say about Arrow. I I is, this is incredible. The cat didn't say anything for the entire day before this, so it just, you know, just like, is like an incredible cat. Incredible timing. Matton. Yeah. So the problem with pipelines is as follows. You, when you run a microservice, you activate a function and that's independent of all the other function. That's what makes microservices so great, right? But that's not what you really do with data. Like data pipelines are a combination of functions. Imagine a machine learning model. You want to clean the data, you want to aggregate the data. You want to maybe produce three different data sets to then serve it to an open AI endpoint, right? And this is a D. So it means that the result of one function needs to go into the input as a, function afterwards. If you do this naively. Excom in airflow wink or with, you know, lambdas, you need to go and serialize back to object storage all the time. It means you need to take these hundreds of millions of rows, potentially, like pay the price of serialization, compression, whatever, right to S3 and then a function next to it. It needs to do the same process in the reverse. Now imagine a word in which we don't pay any of this. So imagine a word in which if we don't have to go to S3, because we actually want to save this intermediate result, we can either co-locate the two function and error allows you to do zero copy between them, so the parent function, output a table, and then all the children can read without copying the data, which is amazing. Imagine when you have like 10 fan out functions and the total amount of memory you use is one table instead of like 10, which will be the naive use cases. Or if you are across machine, across host, this function can communicate through arrow flight, which is the same arrow format that we use in memory, but sends over the network. The good thing about that is that you're not gonna pay the serialization cost, and so since the functions are readiness in E, you get the bandwidth of the cloud. It takes less time to send an arrow flight between two machines co-located in AWS than to read the same file from a disc in Parquet into the memory of a system. That's how much serialization actually is. A panel is attacks when you do this type of things. So going back, imagine, or doing this for like a D with 15 functions. The difference between the naive way and the pipeline way may piles up to be, you know, 10 x faster at the end of the day because of these choices. Yeah, and that definitely makes sense because if you have a dag and you all these sort of, you know, dependent processing components, right? It doesn't make sense to store, the data, write it back to S3, between each intermediate step, and then ingest it again and process it. So it sounds like you're efficiently using. Apache Arrow, which is a, you know, all in memory, you know a very efficient format as you're kind of the, you know, the data transfer between each individual component and the dag. Yeah, like it's been, it has been a bad we better on a few open formats when we started this. The other one in being iceberg. So memory the, conceptual distinction is like memory over the wire is Apache Arrow. You know, on disk is basically parquet and, Apache iceberg on disc 'cause meaning of course object storage. And doesn't been very good, bad. If I look, you know if, I look back 18 months ago, this two ecosystem were already growing. But not nearly enough as they are today. Now, Apache Iceberg is the defacto standard for lakehouse formats. I would say probably, I mean, at least the most popular one of the three major ones. And Apache Arrow is definitely catching on in many more languages and even more databases. There's like a lot of companies now building on Arrow as this. Kind of like, you know standard format for interchanging data between systems. So those bets had paid off massively. Especially for a small company like us. We can't reinvent everything. We, well, we have them spare parts, manifesto that we published like at the very beginning of this company of like, how can we build the lakehouse with so few resources? And the reason is that, you know, you are just to be a bit in, clever in reusing some stuff that is already there and deciding what you want to actually build for. For yourself. It's there. There's a lot of great components out there, like open, source components serialization formats, table formats parquet. And then I think we've come, the industry has matured to the point where people know what works and what doesn't. Just pr from a practical perspective. So it seems like you really did combine a lot of the best elements there into a, product that's super. Super, honestly ergonomic is, what I would call it for, data engineers and developers. The you, you, wrote a paper, you know about plans, zero copy scale up fast for data pipelines. I. Great paper. I recommend reading it. We'll have a link to it in the show notes for the listeners. But there was, there's one point you mentioned there, which i, kind of latched onto, which is that it, feels local when you're developing with, Bauplan. Tell me about that. So there's a, there's this thing, you know the developer loop that software engineers are used to, like, you know, there was small. Part of code, they can test it locally. You know, imagine like a Python developer from like a pipe test with some mock data, and it's kind of very fast. You can do TDD, you can do all sorts of fancy things when the time between the thinking and the actual proof that things are working is very short. Right? In data, imagine I'll take this and go to data. Now you may have, I dunno, a spark cluster. You have to wait five minutes for that to be available. When there's a stack trace of an error is like 35 lines of Scala that you may or may know how to read. You know, we have all been there with, Pi spark. And then when things go well and you need to move to, to, I dunno, to another versions of pandas, now you have to red deploy the entire cluster. Hope it doesn't, break. And then do this again. So if you take an hour of work, you code for, I dunno. 25 minutes and the rest of it is waiting for things to happen in the cloud, which I think has been the default cloud experience for data people in the last 10 years. So what can we do to compress this time to seconds, to be as close as your local development feedback or even faster in the case of Bauplan? And so that has been a guiding principle for, us, is like. It needs to feel like local meaning. For example, if there's a print statement in a, in any function you write, you're gonna see it in your, you know, in your terminal, like it was a print statement in your normal code and needs to be very fast, like adding a dependency to Bauplan. Let's say you move from Pandas 1.5, you move to Pandas to 0.2, and then you do Bauplan run again. We are 15 times faster than Lambdas and seven times faster than snow park at that particular operation. Which again may not mean a lot, but if you combine how many times you install or train new packages, when developing a pipeline, this kind of piles up to to quite, some times. And we're very happy with the idea that people don't have to wait too much to get to know if what they did makes sense or not. Yeah. And you know, you're seeing more more data processing engines emphasize the local experience. You know so, you know, DuckDB very popular. There's Apache Data Fusion, and there's definitely this moment behind it And it totally makes sense because, you know, we might as well leverage the power of, you know our, laptops. If we're doing some, you know, client client side filtration or you know, ad hoc analysis of data. But it's also just more portable and makes it so that, you know, not all your, compute and information retrieval has to go through a you know, some managed service in the cloud, but then if you want to, and you know, on a certain amount of scale, that's also useful. So it's really giving developers the best of both worlds. Yeah, we're very grateful to the community. Good friends with many people in data fusion. We use our fork version, TB web. We have some blog posts on that for the engineering gigs. Very and so we're very grateful for the work they put out. The choice that we made with Bauplan is once you work with data. At the, there's a certain scale, like, you know, tens of millions, hundreds of millions of rows. A gigantic bottom like you're gonna have is bandwidth. And it's very hard to beat S3 to EC2. So a lot of our work in making the cloud feels like it's local will result in actually a faster feedback loop than. Just spinning up activity on your machine just because there's such a difference between your bandwidth and the one in the cloud. That is very, hard to, be competitive in that sense. But on the other side, we didn't want to have the cloud experience that you get with EMR. Like, yeah, there's a cluster somewhere. Again, it's mostly me waiting for something to happen and it's kind of looks like a black box. So finding the right balance between these two forces has been probably one of the. Hardest thing that we did in the first year. It's not really an engineer, like it is a between engineering product, API, you know, it's kind of intersection of like human machine design and the actual tech. But these are details that are surprisingly hard to get right. Yeah. And since you brought that up I, just came across this, you have a blog blending Duck DB and Iceberg for Optimal Cloud Olap. You know, lessons learned, crafting a service lake house from spare parts like you mentioned. So I'd love to get into you know your that blog, what you talk about there, and how you kind of combined you know, parts of Duck db for the, compute and then iceberg as your storage. Yeah. So I mean now the, community, again, the community has evolved a lot in the last 18 months. And now there's a few extension that has been proposed to do DuckDB in Iceberg, but when we started, none of that was. Was a thing. I think it was 0.6 or something like that. So I think we're like 1.2 now. But what was the, general, like, what was the, what is the philosophical idea of Bauplan? Again, Bauplan is function as a service, and this function do not exist until you ask for them, including s QL queries. Like one of the weirdest and most distinctive thing about Bauplan is that you can open your terminal. If you have PPP installed Bauplan before and you can do Bauplan Query, select. I dunno. Sex count from Titanic dataset group by sex. And then you press a button and then will actually will run a SQL query and material and, visualize the result. On UCLI like it's a warehouse. But what is the magic trick? There's no warehouse actually. There's nothing before you press enter. So when you press center, the sequel ke get parsed by our custom parser. We understand that Titanic is part of your iceberg catalog tables. We go from the Titanic as per table to the underlying parquet file of the snapshot that we need. We launch a function with DuckDB as a memory engine that process this data in memory and then sends you back to arrow flight results. All of this happens on demand. When the arrow flight sends you data, it dies because nothing in Bauplan exists for more than the span that it needs to do for compute because it needs to leave room for other function to run. So even when you run a function in Bauplan, it feels like you have a warehouse. But what actually happens is there's an ephemeral DuckDB engine that just bonds up, answer the query and then dies and then a new one is gonna spin up once again, if that makes sense. Yeah. And to build all of this with combination of fast iceberg S3 and blah, blah, blah, at that time we had to, you know, get our own parts and all of that, which is with the, journey that we detail in our In our blog post for, the community to, learn from as well. Absolutely. And is, every component in your dag like its own instance of DuckDB? Every time. So every function in the DAG is a fully separate containerized function including the SQL one. So when you can chain SQL and Python together and you can even if weird things like, I want to run three function in Python, three point 10. Two function in Python, three point 12 and three different version of pandas. And this will all work because functions are really, you know, ephemeral and you know, separate on the environment side, which again think about spark. You can't run half of your spark cluster in Pandas 2.2 and Pandas 1.5. While the idea of using function as a smaller compute block gives you the flexibility to pick and choose. So if you want to update some of your function to Pandas two, but leave some of them to Pandas one, you know, be my guest. Like the, you know, the Doug will work, either way. Yeah that's definitely a super interesting design. You know, we're, and I'm glad that we've covered some of the fundamentals here and how the dags work and what the underlying components are. Tell me how it scales, like what is your methodology to, you know, increase the, both the throughput and even the availability of your workloads? Yeah, so that's a very good questions. And so there's a this is the first, which is, you know, our hypothesis when going to market based on you know ears in, this space we call it the reasonable scale hypothesis. It was like back in the machine learning days and it was based on our observation that. Back in the time many people wanted to, I use distributed system to do training in that particular instance, but it turns out that dataset didn't grow as much as compute in the same amount of time, right? When Spark was started. A typical C two machine will be smaller than a w Lambda today. And it's a fan fact that you can do, you can Google which of course again, is not the world we live today. Now. Now Lambda has 10 gigabytes of data in memory. And you can finally, you can find easy tools with a terabytes of, ram, like, you know, just literal off the shelf. You don't have to do, many things. So the system that we design has been based on the idea that function will run on one host and one olic, meaning that you can have a Dago 15 function. Some of this function may run a different machine, but any given function is a single basically process type of thing. And again, thanks to our friends in DB that the fusion polys and all the great work the community has been doing in single node processing, this will not have been possible 10 years ago. So this kind of like, you know the first, point, how do we scale? We scale basically with our own scheduler. So when you submit a job to Bauplan, it goes to a control plane. That is linked to your account as an enterprise. So all of our workers are single tenant soc to compliant, private linked, you name it. So that everybody's happy that the, data doesn't go, outside. And then the control plane checks, you know, if there's availability in the available workers. If yes. You know, it just sends what we call the physical plan, which is the actual thing you need to run to make the dag work to the worker. And the worker has his own binary, which is our custom runtime to, you know, spin up the container and do all the jazz that we described. And then of course it may be that there's no available space. And so one of these two things happened. One is priority based scheduling. So if you're submitting a query that's typically considered a synchronous operation because there's a human on the other side. So it may be that we want to pause an existing DAG in between function to serve a query and then resume the dag or we spin up new workers. And in that case, you know, you basically just augment the total capacity, your disposal and that's the easier way to schedule. But that's our really last resort. What, you know, our. Our attempt is always to fit within the given capacity with provision as many workloads as we, can based on the observation. I think it was the bar paper by Google also did that meso. There's a bunch of people in the system community that have observed this before, that get much higher utilization by combining synchronous and synchronous workloads instead of having two separates. Path for batch and real time. Like it because nobody will ever fill the total real time capacity, which leaves you a rather spare room for things that are less time sensitive to be processed efficiently. And I think that's kind of like a one of our music system intuition we're building. Bauplan is this one. And are there ways you can implement, let's say partitions of data you know or, sharding different across data, across different, you know components of the dag? How, does that work? So right now we, now you cannot, right now, you cannot, right now, functions are literally bound by the host in which it gets scheduled. You could imagine though that for. Let's say embarrassing parallel operation. Say you have a data set and there's an operation you want to do on every month of this year. Okay? You can totally imagine this trivial thing to do, to have a dec, a new decorator on the platform that says shard by month. And then the, child function instead of running once runs one copy for each of the 12. And then you're gonna get back to the general problem of scheduling. So the, good thing and the bad thing about this is that since everything is a function in our lakehouse, there's always one mental model to learn. The user just need to basically pile up functions. And we as developer, we just need to schedule function efficiently. Which again, very different than your typical lakehouse when you have notebook, clusters, warehouses, orchestrate, like, you know what I mean? To do the job of a lakehouse, most people cobbled together five sign, you know, I dunno, 10 different system, you know, 12 AWS tools Bauplan is kind of weird. The idea that a lakehouse is just basically a combination of function. With different priorities. So some of those are very important when you run a query. Some of this can wait a bit, you know, if a pipeline typically runs in an hour, and five minutes, it's really not that important in most cases. And so that is, again, the intuition. So we, pay the price of, you know, some scheduling penalty for simplicity and uniformity in in, in, how the system is actually run. Yeah, that's, that, that's super impressive. And you know, it's it's clearly a very flexible framework, but scalable at the, same time. I'd love to get into, you know the core use cases that you know, you're solving, you know. You know, we all know that gen, that data processing can be pretty, you know, general in the types of problems it could solve. But what do you see as like the, key popular use cases to begin with? So our customer right now have mostly exploiting the platform in three different ways. One is right audit published pattern for people they're not familiar with is this whole idea in the data lake that instead of adding your rows to your table. Just, and just, and then figure out later, there were bad rows. What you do is that you use branches like you would do. We get you import your new data into these branches, which is separate from the main system that everybody see. You do a quality check, and then if the quality check passes, you merge it and now it becomes the new production data. If you're familiar with CICD is exactly the same thing, but for data Bauplan allows you to do this with two lines of code and it's super seamless to do so. That's been very, popular. General ETL pipelines for analytics. I mean, boring these days because everybody wants to do ai, but it's also being a staple for our customer. Moving data from point A to point B to power a dashboard for executives. Again, of course there's many other tools you can use, but it's hard to find a tool that is a simple as, and you know, kind of like straightforward to use as Bauplan. So there's been a very popular use cases. And finally. Unfortunately, or fortunately, you know, the old Gene AI app is taking everybody by storm. So there's always some new use cases about gene AI that, you know, our customer want to leverage. The typical use case in data pipelines and AI is data enrichment. So you get data from a stream. Let's say you have, you're an e-commerce, just to make an example, you get catalog data and you want to produce a new description of your products based on some marketing idea, like market this to a Gen Z person living in Williamsburg. And for this thing, GenAI is actually very good. So what our customer do, they just import whatever package they want for, I dunno, open AI or Bedrock or whatever they want. And then inside of the data pipeline, they will basically insert GenAI capabilities. Leveraging the rest of the, components that Bauplan gives you, branching, versioning you know, iceberg compatibility and all of that. So these are kind of, I would say the three main things. So engineering wap, analytics people, you know, ETL, and you know, ai, machine learning people ai enrichment type of thing. Yeah, it's I think just the flexibility and just how, you know you've combined the programming model and the, and a pretty flexible runtime opens up a, you know, a lot of use cases like the, ones you mentioned, you know specifically with you know, generative ai, you know, what do you see as like the best practices like in your pipelines for interfacing, let's say with like a open AI or an anthropic model? So there's a, few things that I think are key in the new GNI world. One thing is you want to experiment a lot. At least that's my experience. Like I'm still very bad at prompting, you know, and I think that's, you know, it's no shame in admitting that. So it takes me a while to get. To get the right thing. And also I find that the same prompt perform wildly different across models. Perhaps I'm not the only person that thinks that. So a good thing about Bauplan is that you don't have to really choose, you can set up your pipeline accepting a parameters that you can just feed a runtime that, for example, says, Hey, use model A or model B, and then you can just script it to produce, you know the same pipeline, the same code starting from the same data lake, but producing branches. They produce different versions of your target table. That's very powerful because you can say, Hey, how does this prompt look with Tropic? How does this prompt look with open AI or switch it up? How does, you know you, you switch the prompt, but you keep the open eye model the same. And then you get all these table which are version for you in object storage automatically by Bauplan. And then you can build, I dunno, extremely app for your colleagues in Python and say, Hey, these are three different models, three different branches. Which one you like the most? And when you say this one, you know what you do? Bauplan Merge and that branch become the production branch. And again, you start to reproduce the, poor request flow. But for gen ai, which I think is very powerful in, a sense, especially when you can do with like three lines of code. Yeah. Very cool. And you know, there's you know, people write, are writing dags today. They're using things like airflow or DBT. You know, what is, like, what would you say to those teams? Would you say migrate or, you know, use Bauplan for separate use cases? Like, how should they think about Bauplan as like a, choice for some of their development? So these are, so if you take like DBT for, like, I think the answer is slightly different between the two examples. So if you dig DBT. Very good, especially for SQL people is mostly let's call a templating engine for, Dark Bay transformation. And we owe a lot. We as a Bauplan the community to some of the very good ideas that the DBT team had. Back in the days. If you want to keep the coding DBT but ran on Bauplan, that's fine. I mean, transpiring that code to, our platform. We didn't have a customer that had this need yet, but it's obviously something that can be done. So if somebody has a preference for the DBT style of. Syntax and not the Bauplan one. I don't think that's a, particular, that's a particular issue. For orchestration though I think the answer is a bit subtle. So we don't orchestrate things. Meaning that if you want to run every day at 11:00 PM a Bauplan Run, we don't do that. You have to do it yourself. That said. Is one line of Python that you can import in any orchestrator of your choice. That was the design choice that we made explicitly because nobody wants to change orchestrator. That was our market. I dunno if you saw, J the same John in the market, but like in my experience, when somebody is an orchestrator, then don't want to adopt a new one just to adopt a new data tool. And so the idea was that keep your orchestrator. We are not gonna run this every day at 11. You are gonna run this every day at 11, but I'm gonna make it so trivial to run this from airflow prefect Dagster, that you can't say no. And that's the choice that we made. So if you already have airflow, my suggestion will be run the business logic of data and bowel plan, 'cause a million times better than passing data between excom or, you know, having a Kubernetes cluster to orchestrate their flow. There's no comparison between these two, but keep airflow. For backfilling observability, you know triggering and all of that. The distant, the difference that your airflow docs will become one liner, like my best orchestrator code that I've ever seen in my life is every task is one line. The codes and external service. And the orchestrator just basically does the, let's say, the outer loop of this. Which I think is a very powerful thing because it leaves orchestrator to do what they do best. Backfilling you know triggering on time or events and at least pipeline or other tools. The example still stands for, Databricks, for example, to do what they do best, which is process data. I think when people use business logic inside an orchestrator, that's where things get. Not super good because now you are constrained by the orchestrator, both runtime and syntax for something that should be a bit more general. So that I don't think is a very good pattern. I dunno if that, you know, you've seen or, yeah, yeah. Definitely answers the question. And you know, I think that's, you know. A lot of the, subtlety is the design and the separation of concerns for, you know the, data pipelines that teams are building. And it ultimately comes down to what their, requirements are and what type of capacity they need and, you know, how they're even ingesting the data, whether it's, you know, streaming data or data that's coming from an API where they can't pull it more than, you know. Once a minute. And then going from there and really architecting something holistic. Now, I wanted to ask you, like every time, you know, folks working on a big project, like a big software engineering release, there's always some surprises, right? So I'd love to ask you like, what was a moment during your development or testing or when you took it to users for the first time? That something surprised you and, it ultimately shaped the, final design. Surprised me In a good way or in a bad way? Let's do both. I mean, you know, there's probably both, so Yeah. So there's, yeah, there's, yeah, there's, a there's a few things that I'm happy to, and I'm definitely happy to elaborate. One thing that I would say though is that, so our, first company was about natural language processing when it was still not cool and it was still hard to do. And so we built data pipelines for like a long time. So the first real customer for this product was a younger version of myself, like slightly slimmer, you know, no gray hair, you know slightly, funnier. And so every time we, when we started developing this we, kind of asked ourself what product we would like to have at our disposal. When we built up our first company instead of the stack that we end up doing, because that's what was available at the time, which was airflow, EMR and glue. If people are, curious. And I met, you know, I can talk about that for like hours and hours. So in the very beginning the design choices was mostly based on our taste for, good or bad. And then we start experimenting. I think it was like after six months of building with our first users. And the, surprise. First surprise is way more people want to do Python than SQL in building data pipelines. So initially the tool was more, the tool is still SQL capable. But initially we were thinking there was like a half split between our user of using SQL in pipelines and Python. It turns out to be that at least for most of the people that we talk to, most of our customers is not true. Most of our customer use for the vast majority Python packages to do a transformation. And they use SQL just for the last layer to visualize data and look at or stream it or, whatever. So some of the things that we build early on didn't turn out to be so super useful. And I think now the platform is, I balance in that sense. Like a lot of attention goes on our side for Python workloads and sql. It is there, but it takes a bit of a backseat even if you go into do the documents and the examples. They're now like, you know, 85% Python heavy in, in that particular sense. I dunno if it's gonna stay that way, but definitely a lot of pull that we're having from people building ETL for open AI or stuff like that, kind of pulls us in that direction. So there are things we didn't anticipate, and at least based on my experience, I love sql, I love both, I love SQL and Python. So, so that was a bit of a, change of mind. I think that surprised me positively, but we had a hunch about that is the entire branching concept. People in SQL are very used to the idea that when you run a bunch of shit in Snowflake or whatever, it didn't run. So it doesn't, you know, it's the concept of transactions, right? The database is you know, kind of like enforced for you, right? So it gives you kind of this peace of mind. You know, that these things, you know, it's either okay and somebody downstream can read it or it fails for whatever reasons. And, you know, sure. It's not a good thing, but you can look into that. None of these exist in the Python pipeline. Words like. It's not such a concept in effort. Like go back, you know, if you build your entire business logic inside of airflow, this concept is not, it's not such a constant in airflow or S3 or excom, right? And so I think a very powerful thing we did was this concept of git for data in which we empower arbitrary pipeline of whatever language you want, sql, Python you know, mesh together to be transactional. So if today you run a dag of 10 minutes of 15 nodes in Bauplan, they either complete. And so you're gonna find the result of these tables in your branch or they're not, but there's not gonna be quantum state. There's gonna be a super position of al tables are wrong enough, tables are correct. You either see the entire end or you don't, which have been incredibly well received. I think it for data is whole idea that you can use Bauplan, like you use gi, you do Bauplan branch, Bauplan checkout, Bauplan merge. This entire thing is the most popular thing on the platform right now. We do. Tens of thousands of merge a week between manual usage and of course you know scripted machine, usage. So there has been a surprise in the positive sense. I think people find this very novel and kind of a fresh take on this whole concept of like running pipelines, but with some safety boxes. And does this so this, supports, like, like you said, it's like git for data in a way. Is this. Versioning for the DAG runs and the, data itself or, is it more for the logic? Y you know, the cool thing about having framework and runtime is that we can version two things at the same time. So do like data and code are in extrinsically? Sorry is that a word? No, don't think it's the word, but like, they cannot be dissolved basically. So every time you run a Bauplan and run, you Bauplan and run, you run a pipeline, you get back an id. That Id uniquely and immutably identify two things. The comments corresponding to the state of your data in that specific moment is the start of your pipeline and the code that you ship to the cloud to run. What is the consequence if a month later you come to me, John, and say, I had a bug in the pipeline. Can we, debug it? I don't have to go to CloudWatch, I don't have to open 40,000 tools. I don't even have to go to git, you know what I do? I do Bauplan rerun and I pass the job I did that you gave me, and I'm guaranteed that I'm gonna run from the exact exam state of data with the same exact code that you did a month ago. So I can reproduce forever everything that happens in the, platform. And that is only possible because we keep track of both things at the same time. Yeah, that's super impressive and very well thought out and really gives a lot of, you know, flexibility and, you know a lot of room for, you know, collaboration and experimentation with these pipelines. And another thing I wanted to ask you was, you know, you're. You, have a strategy for kind of handling like the dependency isolation between functions and the same dag like you said, you could run different, like the, you could hypothetically run different versions of Python. How do you manage the, environment provisioning for that? Yeah that's that, that has been a, that has been a nice trick that we learned from our advisor, which is professor Tyler from University of Madison, Wisconsin, which we. Collaboration. And so imagine what you need to run what you need to have to run a Python process. There's gonna be a Python interpreter and there's gonna be a bunch of packages, which at the end of the day is our folder in, you know, next to the Python processes that you have been installed with pip. Okay? So the normal way in which people build a container is basically. A requirement file with a list of packages, and then you're gonna go and install one of them after the other, and then you run the process. What is good about this gives you isolation. You can run two function with different containers. What is it bad about this? You need to install all the things all the time, because these two containers don't know about each other. So that's the naive way. So what did we do instead? So when you actually now ask Bauplan to run pandas. Bauplan first gets the you know, dependency graph, you know the trans disclosure of all the packages that you need for pandas, okay? And then it's gonna install each of them in parallel, separately on the underlying host machines. Okay? So when you need to run a container, we do not do docker build. We don't do pip install. We build a container and we mount read only the packages that you need in the environment that you need. This means that now we don't pay minutes of penalty to download packages or, you know, duplicate them. There's that most always, at least at most, one copy of each package for each version. And so starting a container is basically just mounting folders, which is why our container starts in a hundred millisecond and not in 10 minutes. And so this way you get to reuse basically everything that you, that somebody on your team already used without paying the penalty. And since people that work in the same company tends to install the same packages. The cache eats are a lot. The cache is always very warm because everybody's, I installing basically the same packages which, makes all of this works very well. So the consequence of this is that our lifestyle care management for the functions is steady forward. They go up, they run and they die. Why? Because they're so fast to go up.'cause we don't pay any of the real price of building containers. That for us, it's much easier to reason about them if we just kill them all the time. Instead of if we have to manage a lifecycle of, the, container. So that's been kind of the trick there. Yeah, that's one of the things that's very impressive about Bauplan. There's a lot of great you know you know, you can call them technical tricks. You can call 'em just really great design choices that ultimately lead to, you know really great developer experience and, performance. I'd love to hear from you. Where can people continue to follow along with your work? I mean, I'm very easy to reach and as you probably guess by now, I like to talk a lot. So I'm very social. I'm happy to take a coffee or reach for coffee with anybody that wants to geek out on any of these topics. So LinkedIn, email whatever they want. If you want to in particular, know what Bauplan is doing. We have a newsletter with, you know when we, send out our blog post on the engineering stuff, our research paper. And of course if you want to try the product, you can just sign up on the, free Sandbox. It's free, you can try it on public dataset. We're always happy to get feedback. We're in private beta, so you need to send an invite and then we, invite you for that. But again. We're in the process of being very open with what we're doing now that we have, you know production use cases that we feel very confident about. And so any feedback even negative, especially negative is, very, welcome. And again, that's the valuable feedback. That's the feedback that's about feedback. And especially if you're a founder, especially if a young founder is a first time founder. Somebody did that with me. The first company. And I'm happy to, give back and listen to you and give you unsolicited feedback on your company or fundraising or whatever. But seriously, if you're a founder, especially in data and AI space and want to talk to somebody that's a bit older than you or you think I can be helpful in any way, please do reach out.'cause I'm really happy to help other, entrepreneurs especially. That's great and great opportunity for the entrepreneurs in the, AI space and, you know, work with someone who, you know, teaches at at NYU and has a lot of experience in this space. Kobo, thank you so much for joining this episode of What's New in Data. I know at the time we're recording this we're, in mid-April. I'll see you next week in data Council. We talked about bringing our tennis rackets. You can continue to work on your backhand volley a bit. I'll definitely target it. And I'm, ready. Yes. I'm ready to be destroyed by you, John. So, at tennis? No, not you're just setting me up for a surprise loss. Yeah, sure. No it'll be, good times Data council. And you're giving a talk at Data Council as well, right? Yeah, but Ciro, so my Co-founder and CEO is gonna give a talk on Python, on the Lakehouse. So it's gonna be more a talk on the, let's say data science ergonomics of the platform. A bit less of the inner kind of like infrastructure stuff we went into today. But we've been around for the entire three days. So if you want to get, you know, a coffee or a drink on our dimes or on our investor dime, please reach out. We're super happy to meet interesting people. And again, we always love to talk to other founders and startup folks. Awesome. Awesome. Well I'll, post that on, LinkedIn today, even though this episode won't be out before data council. But, you know all, in good timing either way and look forward to seeing you out there. Of course. Thanks so much for having me. Thanks everybody for, tuning in and look forward to play tennis with you, John, and everybody else who wants to challenge me at that. EE Exactly, I mean you're basically inviting it.'cause your, LinkedIn tagline is, you know, that you're working on the, backhand volley. So of course, you know, it's there's really no other choice than to. Hit, to your backhand. When you're at the net it's, a really great opportunity to work on it. That's one of my weaker shots as well. I'll, put that. You wouldn't guess how many, so how few not unsolicited requests from recruiters. I got, when I moved, my tagline from. AI data, whatever it was before AI director or whatever to, improving my backend volley. And the number of requests that I get from recruiters drop, like, you know, like, you know, 98%. So at least there was something good that came from the tagline. There you go. Yeah Yeah. Well, awesome. And, look forward to seeing you next week, and thank you for, joining us on this episode. And, thank you to all the listeners who tuned in. Thanks again.