Pybites Podcast

#197: Polars with Jeroen Janssens and Thijs Nieuwdorp

β€’ Julian Sequeira & Bob Belderbos β€’ Episode 197

Polars is changing the game in Python data processing – and fast. In this episode, we chat with Jeroen Janssens and Thijs Nieuwdorp, authors of Python Polars: The Definitive Guide, about how this DataFrame library is revolutionising workflows. 

From its origins at Dutch firm Xomnia to GPU-powered speed boosts and a behind-the-scenes look at writing their 500-page book, this episode is packed with insights on why Polars is winning over data teams. 

Check out Python Polars: The Definitive Guide - https://polarsguide.com/

Find Jeroen and Thijs online:

___

πŸ’‘πŸ§‘β€πŸ’»Level up your Python skills in just 6 weeks with our hands-on, mentor-led cohort program. Build and ship real apps while gaining confidence and accountability in a supportive community. Join an upcoming Pybites Developer Cohort today! πŸŒŸβœ…

___

If you found this podcast helpful, please consider following us!

Start Here with Pybites: https://pybit.es

Developer Mindset Newsletter: https://pybit.es/newsletter πŸ’‘

Pybites Books: https://pybitesbooks.com/

Bob LinkedIn: https://www.linkedin.com/in/bbelderbos/

Julian LinkedIn: https://www.linkedin.com/in/juliansequeira/

Twitter: https://x.com/pybites

Apple Podcasts: https://podcasts.apple.com/us/podcast/pybites-podcast/id1545551340

Spotify: https://open.spotify.com/show/1sJnriPKKVgPIX7UU9PIN1

Jeroen:

So imagine these consultants coming in like, hey, at Xomnia we have this product called Polars and we really think that we should use this, like we were really recommending our own product. So you can imagine that they were really hesitant to adopt this. But thanks to Thijs, his benchmark, which demonstrated that we could lower some computation from 30 seconds to one second, that really allowed us to convince them.

Julian:

Hello and welcome to the PyBytes podcast where we talk about Python career and mindset.

Bob:

We're your hosts. I'm Julian Sequeira and I am Bob Beldebos. If you're looking to improve your Python, your career and learn the mindset for success, this is the podcast for you. Let's get started. Hello and welcome back to the PyBytes podcast. I'm Bob Beldebos and I'm here with Jeroen Janssens and Thijs Nieldorp. Hey guys, welcome to the show. Thanks for having us. Thank you, my pleasure.

Thijs:

How? Thanks for having us. Thank you, my pleasure. How are you doing today? Oh, we're good, definitely. It's, uh, I think, one of the first times we're doing a podcast just in the middle of the day instead of in the evening on american times, so it's very comfortable.

Bob:

I'm still fresh it's a great time for me as well. We're all in the same time zone, so thanks for meeting in the morning. And uh, yeah, I'm excited to talk today about Polars and your definite guide O'Reilly book. That's quite an achievement. So, yeah, we dive into the book, the process, polars as a framework, you guys, what you do, and yeah, I have a set of questions lined up so we can dive straight into it. But yeah, maybe first of of all, do you want to shortly introduce yourself to the PyBens audience? I guess we can start with Jeroen and then we go to you guys, sure?

Jeroen:

yeah, so Jeroen Janssens, I am a developer relations engineer at Posit and I have a background in machine learning. I used to be a data scientist for a while, had my own company giving training in data science-related topics, and now I'm at Posit. That's me in a nutshell from a career perspective that is.

Bob:

Nice and you also authored another book I really liked back in the day. Well, yeah, I think it's quite a few years ago. Data Science from the Command Line, right.

Jeroen:

Yeah, that's right. Yeah, the first edition came out in 2014. And in 2021, I wrote the second edition, that's right. And I still didn't learn by then that writing books is just very hard and very time consuming. And just so we did it again.

Bob:

And yet here we are, this time with Thijs, yeah, which maybe is an explanation why you co-authored it this time, right? Oh yes definitely All right cool Thanks, Thijs.

Thijs:

Yes, I'm Thijs. I am 32 years old since last Monday, so it's already seeping in. Data scientist. I work at Xomnia. I mean, officially my title is data scientist, but by now it's also a little bit machine learning engineer, software engineer. It's just whatever needs code, I'll type it for you and, of course, author.

Bob:

Awesome, Great. So yeah, let's start off with the backstory. How did you two decide to write Python Polars the definite guide. How did that happen?

Jeroen:

How did it happen? So I used to work at Xomnia as well. So Dijs and I, we used to be colleagues, and Richie Vink, the creator of Polars, also used to work at Xomnia. So the three of us used to be colleagues. And when I just started I wasn't assigned to any specific project just yet.

Jeroen:

I had some time on my hands and I learned about, you know, richie and his pet project at the time, fullers, and so I decided to give it a go and I was immediately hooked. I know that Th was, uh, had the same feeling. But uh, I immediately thought, like okay, this definitely deserves a book. And how, how, how can we accomplish this? So first I tried to convince richie, like hey, richie, you have to write a book. But richie absolutely didn't want to do that. He, he just wanted to focus on Polar's itself and the company that he just started, polar's Inc. So, and I think that you know, in the end that's for the better, because Richie has such a talent for programming, he and the. So, um, and I was.

Jeroen:

I had a meeting with my editor at o'reilly anyway, so I decided to just ask like, hey, have you heard about this thing called polars? And uh, they, you know, he wasn't really familiar, but they did receive already four proposals for writing a book about polars, which they all rejected. And then I thought, like all right, there is, I'm not the only one with this idea. Uh, maybe I should just do this, I should write this book. But I really wasn't feeling like doing this on my own again, um, and so I reached out to tice. And why tice? They're still asking. I I'm now convinced that I definitely made the right decision. No worries there. No.

Jeroen:

So Thijs and I, we were not only colleagues, but after a while we were also assigned to the same project, and at this project, at the client, we got to experiment with pullers, right, and we can talk a little bit later how that happened. We got to experiment with pullers, right, and we can talk a little bit later how that happened. That's a story in itself. But you know, I casually asked whether Thijs, whether he likes writing.

Jeroen:

I saw that he had written a blog post or two, and then I just asked him like, hey, do you want to do this together? It'll only take a year and you know, we'll become rich and famous. And uh, you know, it turned out. It's actually, uh, it took us two years to write this book and, uh, we're still waiting on the money and the fame. But hey, here we are, we, we have actually, uh, completed the book, um, after nearly two years, and so, yeah, very happy that we have written the book and that we have done this together. I don't think that I I'm certain that I wouldn't have been able to do this just by myself.

Bob:

Yeah, nice. Yeah, it's exhaustive. How many pages?

Jeroen:

A little over 500 pages. Yeah, yeah.

Bob:

Cool and you said you both got really hooked from the start.

Thijs:

Uh, why was that so for me? The the way I got into it it's very early polar still, and it was just richie typing away at it uh, and uh, I just I, I hopped onto it and uh, just to explore what it is. Uh, because of course there's richie just uh ranting or raving about, uh, what he's been up to and made me interested just to explore some new packages. And uh, I've been using pandas because it's the go-to uh at the time and uh wanted to try something new.

Thijs:

Just like, like, uh, I think there was at one point was like an innovation week that we had at uh at Alianza, where at the end of every quarter, together with the planning for the next quarter, you also you don't have to pick up any work, but you get to invest in exploring some new things. And uh, I toyed around with Polar's already, but it's just like small examples, just loading in CSV, doing some simple stuff, just to see how it goes, and that was already so fast that I was just looking for a place to start implementing it a bit more seriously. And at the time in that innovation week, I was able to apply it in a little bit of the pipeline of Alliander and brought that specific part of the pipeline from 30 seconds to just a little over one second of the runtime, which to me proved that there was definitely something to gain from using Polars and ever since I've been using that practically.

Jeroen:

And that also allowed us to convince the team at Alliander, our client, to start using Polars. Because, thijs and I, we were excited, right, but you've got to imagine how the rest of the team must have felt when we came to them and said hey, sorry, let me take a step back. We were facing a problem. We had this giant code base, um, essentially a large etl pipeline, lots of data sets, lots of computations and some. Some end result was written in both python and r, and things were just not implemented. You know as well as they, as they should. It was. We were really running into performance problems.

Jeroen:

This process took over, uh, it used over 700 gigabytes of ram and, uh, and, and, and.

Jeroen:

Then at some point management came to us and say all right, the computation that you now do once every six months, I want you to do those once a week, and then not just for this little subset of you know the electrical grid, because that's what Alianor does.

Jeroen:

They take care of the electrical grid for thirds of the Netherlands but we want you to, you know, do this computation for all our networks. So we had a serious problem and we really needed to do something. So imagine these consultants coming in like hey, at Xomnia we have this product called Polars and we really think that we should use this, like we were really recommending our own problem, our own product. So you can imagine that they were really hesitant to adopt this uh, this, this package, which wasn't even uh at version 1.0 at the time. But uh, thanks to uh Thijs, his uh, which demonstrated that we could lower some computation, bring it down from 30 seconds to one second that was our you know. That really allowed us to convince them to move it to Polaris and eventually we did rewrite the entire pipeline from R and Pandas. So not everything can be blamed on Pandas here, but we rewrote everything to Python and Polars.

Bob:

Nice and for people new to Polars, you can purely code in Polars Python. Right, you don't have to go into the rust. Yeah, exactly.

Thijs:

If you want to get your hands dirty in some rust, python right, you don't have to go into the Rust. Yeah, exactly, if you want to get your hands dirty in some Rust, you can always dive deeper into it, of course, but generally, polars itself is written in Rust, just at the core of the whole engine that executes. All of the stuff that you're typing is drawn in Rust, but it's got a very neat Python layer around it that makes it very usable in Python.

Jeroen:

Yeah, and there are also Python language bindings for R and JavaScript and a couple of others, but the Python one is definitely the most mature, which is also why we decided to focus on just that API for the book.

Bob:

Awesome. Yeah, I want to ask, of course, a question about why it's so fast, but maybe before that, it's also very ergonomically right. I think, jeroen, you said somewhere that no more brackets, referring to pandas. Is that part of why the interface is so nice? What do you like I mean? Again, I want to go into the speed of things in the next question, but maybe we can talk a bit about why the interface is so nice.

Jeroen:

Yeah, that's a good point, and, thijs, you may be able to fill in some other things later, but a common saying is that you come for the speed, but you stay for the API. Right. The syntax that you have to produce and it's a combination of things it's, you're, at the heart of pullers are expressions, right? These smaller recipes that you build up and that you can then use or even reuse at various stages. So, whether you're selecting columns, you're filtering on a specific value or grouping all of these data frame methods, they all employ expressions in some way.

Jeroen:

So, you're, there is definitely a learning curve, right. Especially when you're already used to pandas, there are some things that you really need to unlearn first, but once you get used to that, there is yeah, I always have to think about the grammar of graphics that ggplot in R or plot9 in Python, right. So, again, there is a learning curve, but once you get it, you have to consult the documentation a lot less. So that's how I see this. Dijs, maybe you have anything to add. Why is Puller so much fun to work with?

Thijs:

I think for me it's uh, just if you read the code it's way more understandable what happens where in pandas, you have a lot of iterations, you make a lot of slices, like more slicing, selection, and like resetting indexes. For some reason there's like this the index is like this hidden state that's under the hood, that's being used for operations, but you don't have much of a grip on it. That is like a more and more imperative way of programming and more way of like. You tell the computer exactly what kind of operations to run, whereas Polar generally is a more declarative API. So you define what you want the result to be. And. But Polar generally is a more declarative API, so you define what you want the result to be, and that is way more natural. So even if you read Polar's code line by line, it's more understandable what the outcome of that query is going to be, and that's for me the game changer.

Bob:

That's awesome Because, yeah, I'm not using Pandas often, so I might go back after a year and then I'm always like is this Square Records? Is this I log log? I have a hard time kind of getting that into my muscle memory. Now, of course, with AI, that's a great help. But yeah, yeah, nice. So, yeah, maybe we can talk about speed.

Thijs:

What makes Polar so fast? So there's a couple of factors to it. One of the things you see popping up a lot these days is that anything written in Rust is instantly fast. It does help, especially compared to when you run stuff in Python. You have, at least for now, still have the global interpreter lock, which means it just runs in one thread on one CPU core, and I don't know when it was the last time that I saw a computer that only has one core, but I'm not sure if I grew up with any PC that had that. So always at least dual core or more. And that's kind of a waste right that you have a CPU sitting in your system that has four, eight, 16 cores and just one of them is doing the work. So that's something that Rust enables very easily to safely work in multiple threads at the same time, to chop up the work and process it parallel over these cores. That's what can gain you a lot of speed.

Thijs:

But what may even give you a bigger performance boost is the fact that operations are done in a lazy fashion. Usually you can pick what kind of API you run. So there's an eager and a lazy API. Eager is much like Panda runs. You perform a command, it instantly gets executed and line by line, it just runs everything you tell it to. And the lazy way of doing things is that first you build practically like a recipe of all the things you're going to do and then at the end you say, okay, now do it.

Thijs:

And because the computer knows exactly which columns you're going to hit, which kind of data you want to filter, it can optimize by pushing a lot of those filters all the way to the beginning where you read. So if I read a file from disk, it will just if I see later on like I only use two columns, it just reads those two columns and all the other columns don't even get put into RAM, they don't go through all the operations together with the others. So that saves a lot of work and can make like a 10 times performance gain just because you do exactly what you need to do. And, uh, I think those, those two things are, uh, the the big two in terms of performance. I'm not sure if I missed anything there no, no, those two things combined yep cool.

Bob:

I also read that uh, polis is built on top of apache arrow and that it has something to do with column or memory format optimization. Is that already in the buckets you mentioned, or is that something different still?

Jeroen:

It's actually a good point.

Thijs:

Yeah, I think it's slightly different. It does have a performance impact. It's hard to research what, because Polaris only uses Arrow, so you can't swap to a different way of handling things. But what Arrow is is it's kind of a specification of how you save your data types into memory, and this is developed by Wes McKinney at all.

Thijs:

While he was making Pandas he found out already that when he built Pandas he found out already that when he built Pandas he heavily relied on NumPy and the way NumPy saves data into memory. But NumPy is heavily focused on matrix operations and more numerical operations. So, for example, if you had string data, if you had categorical data, it wasn't optimized for that. So he at the time had to resort to saving these Python objects into memory with a heavy performance blow. Especially string stuff in Python is just quite slow. And out of that came Arrow. So it's a way to, in an optimal fashion for query engines and analytical queries, save your data into memory so it is easy to retrieve without getting too many cache misses, without having to put a lot of time away into reading the wrong parts of the memory awesome, very interesting yeah and when we talk about memory then yeah, rust is just a way better candidate right, because you can.

Bob:

You know the stack in the heap and you can really optimize it better. Managing your memory where?

Thijs:

Python, you have a lot more grip on it than the garbage collector of Python, exactly.

Bob:

Cool, cool. Well, let's maybe shift gears a little bit and talk about real-world use cases where you think polars really shine. I mean, we've already seen the amazing story about the pipeline that kind of triggered this whole thing and that was a necessity. But maybe some other practical thing, because I can imagine a lot of people they're just happy with Pandas, they're used to it. Why should I yet embrace another tool, right, in spite of the fastness and the beautiful API? So maybe, yeah, giving one or two real-world use cases maybe helps convince people to switch to Polar.

Jeroen:

Isn't that enough? For those two reasons, right, a package that's subjectively easier to work with, which brings significant speed increases.

Thijs:

And sustainability with it, because I think I recently saw someone that did a study into comparing different data frame libraries to see the power efficiency, and Polaris was out on top as well.

Jeroen:

But of course learning something new is always an investment, right? So when you have a tight deadline and you just know your way around pandas or what have you then of course it might be better to just stick with that. You have to invest some time in learning Polars, or any new package for that matter.

Julian:

A quick break from the episode to talk about a product that we've had going for years now. This is the PyBytes platform. Bob, what's it all about?

Bob:

Now with AI, I think there's a bit of a sentiment that we're eroding our skills because AI writes so much code for us. But actually I went back to the platform the other day, solved 10 bytes and I'm still secure of my skills because it's good to be limited in your resources. You really have to write the code. It really makes you think about the code. It's really helpful.

Julian:

Definitely helpful, as long as you don't use AI to solve the problems. If you do, you're just cheating, but in reality, this is an amazing tool to help you keep fresh with Python, keep your skills strong, keep you sharp so that when you are on a live stream, like Bob over here, you can solve exercises live with however many people watching you code at the exact same time. So please check out pybytesplatformcom. It is the coding platform that beats all other coding platforms and will keep you sharper than you could ever have imagined. Check it out now, pybytesplatformcom. And back to the episode.

Bob:

There was another collaboration I read about you guys did with with dell and nvidia on benchmarking folders on gpu. Um, yeah, do you want to talk about that a little bit? What were some of the surprising or exciting results about that?

Thijs:

yeah, for sure. Yeah, so I think it was gerund that got contacted by nvidia. Uh, nvidia was very much in a marketing wave because they've been working on several packages that are aimed at the data science market. So they've got their own qdf, which is a data frame library that runs on the on the nvidia cuda platform. So it uses your gpu instead of cpu and for for certain types of operations it helps a lot to have the GPU available. So if you compare the CPU and GPU, the CPU has just a relatively small collection, of course, but they're very advanced, so they can do very advanced type of computations operations and with that it can kind of take big steps of the work you have to do. And compared to a GPU, a GPU just has easily sometimes a thousand times as many computation cores, but they're relatively dumb, so they take very tiny steps. And if you, however, are able to rewrite operations to use those smaller steps in computation or in queries that require a lot of computation, so you get this, for example, with joins or some types of string operations, then on a GPU it executes way faster. So there's when we benchmark this, we or to take a step back.

Thijs:

So NVIDIA approached us to say like hey, we've created an engine that runs on the runs Polaris queries on the GPU using that Gouda platform. And they came to us because we're writing the book and we could offer to write an appendix in return that we were able to run our own benchmarks because NVIDIA had some marketing material. I had some marketing material, but it's just they used a certain type of benchmark where they picked a relatively server or data center quality GPU, which is not something I have in my PC at home, that's for sure. So we offered we can write something on it. But we want some hardware to test ourselves which is more aimed at consumers, something you run at home. And together with Dell we also sponsored some hardware like the actual desktop around it.

Thijs:

Besides the GPU, we tested different graphical cards to see how performance ramps over computation capabilities. So we started with like an A1000, which has a certain type of CUDA cores all the way to the more advanced ones, which has many more cores, to see how it scales with the performance of the GPU. And as we benchmark that it depends kind of on the query, like some queries were quite. I think in the worst case if you have a very small query it could be a little bit slower because data has to be moved to the GPU first, then executed and moved back and that comes with a slight overhead. But generally you can see something from just like because of the overhead, like a little performance hit all the way to up to 13 times of performance boost that you gain just from running things on the GPU. And as the data grows bigger, that advantage also grows interesting, cool.

Bob:

So maybe that's already leading into the next question I had how did writing this book influence your consulting code and how did real world challenge and shape? But made it into the book because I think this shows that how practical you guys are and and how the real world experience has gone into the book. Right, it's not just a book like I just play with polars and I use the documentation, which is great, right, but you really, it seems you extended it to the real world, so that that's really cool. Um, any anything to add to that?

Jeroen:

there was definitely a symbiotic relationship between us writing the book and also, uh, rewriting this pipeline into actual Polars. I think that our client that might be the first instance that I know of where Polars is actually put into production and it's still I mean, last time I heard anything about it was still running smoothly. I mean, last time I heard anything about it was still running smoothly. So, of course, us diving deep into the language because that's something that a book forces you to do you look at ways you look at the language, at the package, the API, with a different perspective than when you're just using it. So this really gave us a deep understanding of Polar itself, also, things that were missing.

Jeroen:

So us writing the book also shaped Polar itself, because we had very close connections with the Polar's team right, not not nothing big, but a couple of little things here. So we were able to uh submit an issue or a pull request and, uh, because we were still new to it, we understood like, okay, these are concepts, um, that are that require a bit more explanation, or these are things where, say, a pandas user may, uh may find confusing, and not only the two of us, but also our colleagues. They also had to adopt pollers at a certain point. So in the very beginning of the book, we described two personas, and here's a little secret is that they're actually based on two of our team members in mind. So as we were writing it, we always had those two in mind like, okay, is this at a level that they would understand, right? Someone who is more into the data analysis aspect of things and someone else who is more you know, associates themselves more as a data engineer. So that also helped shape the book.

Bob:

Nice. Yeah, I really like that how you have shaped that with the personas and and how you have shaped that with the personas, and I guess, yeah, it makes it more engaging. I think, yeah, and it kind of contrasts as well, like, yeah, you have these different types of users that might use.

Jeroen:

Polar, right, yeah, and that's the main. That's one of the difficulties of writing is knowing your audience, and especially when you're writing over 500 pages, you want to be consistent. You, you constantly have to make choices about the, the scope of things, how you order things, what level am I going to explain something, um, at? And then it really helps to have those two personas, uh, in the back of your head like, oh wait, we really should explain this, or no, wait, we can assume that they, that they know this yeah, yeah, yeah, the same.

Bob:

With with our coaching, you really have to see where people are. Like, do I need to give more explanations or can I just assume that they know something right? Um, yeah, which which brings me to uh, to the book writing process. Well, what was the hardest part, you think?

Thijs:

um technically or finding the right tooling. So when you're writing a technical book, like one of the things we're going to do, a lot of code examples and, and afterwards we need the output of that code example to be properly represented in the book. And there's a lot of different tools out there. So we've been looking. I think we started with Jupyter Text, which transfers Jupyter notebooks into a text form. I'm not sure if it transfers all the way to Markdown, but it's something like that, just a pipeline to transfer to that. And we ran into some issues there with like some functionality that doesn't quite work to what we need.

Thijs:

I think we've looked at quarto for a little bit and, uh, there, there as well, there was like some just these. Constantly we found these tiny things that created some friction, up until the point where jerome, over the the run of one weekend, just wrote his own tooling and called it up dog, which led me to fall for the classic joke you what's up dog? And then, of course, I think what up dog does, uh, you can explain better yeah.

Jeroen:

So eventually I was like, okay, I'm gonna write this myself. I think that um many authors go through this process. I I talked to uh wes m McKinney, the creator of Pandas and author of the book Python for Data Analysis, which is in its third edition now. When he started writing the first edition, o'reilly was still using XML, and so I heard him saying that he wrote his own tooling to parse the XML, transfer it to Jupyter and then get the results back in. And that's what we essentially did as well.

Jeroen:

When you're writing a technical book that has lots of code, you want things to be consistent. If you manually copy and paste output from your code snippets, then inevitably you will run into consistency inconsistencies. You will introduce them eventually. There is no way around it, so you need some way of so. So we were writing an ascii doc which is kind of, which is just another text format, just like markdown, but then a little bit more advanced. You could say the tool updog extracts the code, turns them into into a Jupyter notebook, and because Jupyter has a Python API, you can just start a kernel in your own code, send it code cells right and then extract the output. And so the output was immediately pasted back into the same ASCII doc file, which was a huge advantage because at a certain point, others at O'Reilly will want to work on the same ASCII doc file. They want to make edits too.

Jeroen:

It's one of the reasons why O'Reilly books have such high qualities, because there are many people involved with this. It's not just like all right, here's our text, just print it. So in the end, I mean, of course, there's some work involved getting this script up and running, making it robust, also making sure that when your code outputs an image whether that's a PNG or some interactive HTML visualization that Altair, the default plotting backend of pullers when that gets outputted, that is also correctly inserted back into the book. That is also correctly, uh, inserted back into the book. So, yeah, uh, some work, but now we do have a book, uh, that is, uh, that works, and and so because oh yeah, that's another interesting point as we were writing the book, pollers, of course, uh, kept advancing as well, and so we started writing it before it was 1.0 and there are constantly things changing, um, which I should also mention is not, so it was not so much a problem for the code that we were producing at our client.

Jeroen:

Right, there weren't that many breaking changes, but you do want to be able to sleep well at night, knowing that when the Polar's team releases a new version and they tend to do that every other week very high pace you just want to be certain that your book is still correct, and so far we haven't received any, any, any, uh, any issues that the code was, was incorrect, or that the the code did not correspond with the output or vice versa. So, uh, this was definitely worth it wow, that that is challenging.

Jeroen:

Yeah, makes me rethink doing a technical book especially if there's a lot of code, right, yeah, you should think twice before you invest such an amount of time.

Bob:

Yeah, the tooling is critical, right, because they come up with new versions and Azure Code is going to break, so you need to have some sort of regression framework that you can just constantly run it.

Thijs:

And yeah, I mean you don't need to have it, but it's definitely a good quality of life improvement if you have it, because having to go over that at the end or manually all the time it's it's horrible.

Bob:

Yeah, sorry, sorry.

Jeroen:

You just said the word regression testing right. Thijs has a has a good story about this.

Thijs:

Yeah so, as I was benchmarking, a good story about this. Yeah so, as I was benchmarking, uh, one of the things I wanted to see is how the, how polars, did at the same benchmark over time, because I was just curious, like, what kind of fixes did they do to the engine? What kind of impact did they make with these updates? And generally the trend was downwards, except for one version where it just popped up and added like a, I think, a 20 percent uh performance loss in that sense, because it took longer and uh, it's it's something I'll give a talk about on by data. It let me down a whole rabbit hole of shell scripts and uh, and mostly git bisect, which is a specific command that allows you to track down which specific commit in a repo, caused a certain regression.

Thijs:

The problem for me was that I had to pull the.

Thijs:

I had, like the, the, the polar repo and, of course, because what I, what I did at first, is, with UV, run the benchmark by pulling a version of Polars, but every version has many commits in between, right, so to track down which exact committed was was, I had to pull the deployer repo and start running git bisect from that but install the version I compiled from that repo into the benchmark repo, the virtual environment that was living there, and then in that repo start running the benchmark to see if it got better or not.

Thijs:

Because that's with PySec it just picks a commit and then runs your script, which for me was compile, run the benchmark and based on the output, say if it was a good or a bad commit and that way track it down. But ultimately, after taping everything together and a little bit of GPT magic in there, I was able to pull that off and found the commit which finally, as I talked about that with the Polar's team, they refactored the code which magically solved it, because they couldn't quite pinpoint what exactly thing it was. It wasn't like a clear mistake, it was just they moved some code around and that apparently broke something. But that was fixed in the release they released afterwards Awesome, yeah, the two of us are actually going to talk about this at PyData Amsterdam. The release they released afterwards Awesome.

Jeroen:

Yeah, yeah, the two of us. The two of us are actually going to talk about this at PyData Amsterdam in September.

Bob:

Yeah, Okay, cool. Well, we're coming on time and we'll link to the book. I think there's a free chapter on the website as well, if people want to. You know, read the first part Before getting into the last question about books and reading and what you're learning. Maybe we can do one more Polar's question. Maybe it has been answered, but still it could add value. And that's the expressions, right, which are often described as recipes. Yeah, maybe just a mindset shift Developers should adopt to really grasp Polar's. And or, yeah, maybe some general tips on I mean, apart from reading your book, of course, but what you would advise people new to Polars, how to become proficient with it all right.

Jeroen:

So, if I get your question right, it's like okay, how do you advise newcomers to Polars? How should they approach this? Yep, right, yeah, I think that a lot of things with Polars look familiar, right, there are a whole bunch of functions to get data in from various sources CSV, excel, various databases, to get that into a Polar's data frame or lazy frame if you want to go the lazy route. But then it's a matter of okay, how are you? And then you know, on the flip side, also writing your data frame to disk or to some database or what have you. Those are, you know, not too different from what you might be used to. But in between the data wrangling selecting columns, filtering rows, sorting rows, aggregation functions, the methods themselves, they are easy enough, right? I mean, there's select, there's filter, there's sort, there's group by and a couple of others In between, you have to use expressions and that is, I think, um, the the one of the most important concepts that might be new to you, and that is that you are building up this, this uh chain of instructions.

Jeroen:

So you start by, you often start by refer, referencing an existing column, and then you do, uh well, zero or more operations to that, yeah, so say, okay, I have an existing column, a uh, I want to.

Jeroen:

I want to sum it, yeah, when it's part of an aggregation function, or I want to do some string operations to that. And there are, yeah, over a thousand methods organized, you know, under various namespaces related to these expressions, and that is becoming familiar with this, with all of these methods and how they're organized. That's going to be the biggest hurdle at first. I think after a while, once you become more proficient and want to do things in a lazy way, will you will run perhaps into other uh hurdles like, okay, I cannot do everything in a lazy mode, I need to be caching things here and there. You know, that's a little bit more advanced, but when you're just just starting out, I would say that getting a good grasp of expressions uh, is vital and uh, um, we initially set out to write a single chapter about the topic of expressions, but it turned out to be three. Right, because they play such a central role in pollers and there is so much to learn about them that, yeah, we needed close to a hundred pages for that.

Bob:

Quick break for a note of our sponsor this week, which is PyBytes.

Julian:

PyBytes, that's us. I'm here, Bob. What are we talking about this week?

Bob:

Well, we have a new coaching program PDC or PyBytes Developer Cohort. We thought it was never going to happen because we have been doing one-to-one for five years, but now we can do group coaching as well. We're going to build a real world app six weeks in an exciting cohort. We're going to learn with one of our PyBytes coaches the whole journey, but also work together and learn together.

Julian:

And yeah, no more tutorial paralysis Build, build, build. It's wonderful. It's not something you want to miss out on, so please check out the link below pybytescoachingcom. This is a program that bridges real building with a cohort, environment learning with other people, building with engineers. It's a wonderful thing. Check it out now. Piebitescoachingcom, and back to the episode cool.

Bob:

So my last question is what's next? I mean, the book is done um. Are you still learning more polars or contributing, or are you learning other things? And or what book are you reading?

Thijs:

I think for me this got me inspired enough to try Rust. So the whole movement I see is that the way the programming world, the data world, works is it's getting a bit more polyglot. So you have Python as kind of a glue scripting language to get stuff done quickly and Rust that's for the more performance critical components of your code. So that's something I want to dive into. And on a more personal note, I found a cool project where this is something that ties more into the dutch history history. Even my, my grandpa, after the second world war, went to the, the dutch indies at the time, and wrote letters home about his adventures there. And with the recent advances, advances in llms and all kinds of natural language programming, I want to try to see if we can can just scan all those letters because I think there's like two or three binders full of those letters to scan them, ocr them and see if I can kind of map a timeline of these events there, to see if I can retrieve some of that family history.

Bob:

That's so cool, nice, nice and Jeroen, yeah, I'm really excited about Narwhals.

Jeroen:

And Jeroen. Yeah, I'm really excited about Narwhals, the Narwhals package which I mentioned a moment ago, and I guess I should say a few words about that. It's with all these different data frame implementations in Python right, there's not just Pandas and Polars, but there's also IBIS and a couple of others. It becomes really difficult for package developers, other package developers, to use those. So think of the many data visualization packages. They would require a lot of code if they want to support all of these. Right, it becomes quite tricky. Narwhals is an abstraction on top of all of these data frame libraries so that you, as a package developer, you would only need to use Narwhals and then you automatically support pandas and polars. And I think this is making a very big impact on the PyData ecosystem because it's unifying many things that are out there.

Jeroen:

And if you look at the examples that are currently using Narwhals, for example, altair is one, grey Tables is another, or, wait, I think Grey Tables was a little bit early to the party and I think they have their own abstraction layer on top of this. But there are the number of packages Bokeh is another one, another data visualization package which uses narwhals there are. These are really I want to say that the number of packages is growing and I am really looking forward to Plot9, which is the data visualization package in support of ggplot2, a very popular in the R community. I can't wait to add narwhals to Plot9 so that Plot9 doesn't just use pandas, uh, but can support pollers natively and um well, um, yeah, I have some other things on my plate right now, um, but at a certain point I do want to uh take a stab at that interesting, cool, I hadn't heard of that one.

Bob:

So yeah, you both gave us.

Jeroen:

As a user, you don't necessarily need to use Narwhals yourself. It's when you are developing a package that is built on top of a DataFrame library.

Thijs:

I think it's also. Narwhals only generally supports a subset of the most important operations, but if you want to get nitty gritty or very advanced within one library, then it doesn't cover all those use cases. It's mostly for package maintainers to do the data manipulation.

Bob:

Gotcha Cool. Well, we'll link that all in the show notes and, of course, the book.

Jeroen:

Have you not interviewed Marco Gorelli?

Bob:

No, that all in the show notes and, of course, the book. Have you not? Have you not interviewed marco gorelli? Uh, no, all right, we'll, uh, we'll, introduce you to him. You should have him on your show, okay, yeah, well, I appreciate that. Yeah, yeah, that, uh, we always look for more interesting guests and explore more tools and things python related. So, yeah, more than welcome, yeah, cool.

Jeroen:

So, um, yeah, before closing, maybe any uh final shout out or piece of advice to our audience before we wrap it up if you are interested in learning more about the book, or dice and myself, you can go to pullersguidecom and there you can download the first chapter for free. You know, see what it's like, see if Polars is a thing that you will find worthwhile to explore, to invest time in. Yeah, it also allows you to easily find myself and Dice, and if you want to stay in touch with us or if you have any questions, you'll find ways to reach out to us, which you're very much welcome to do.

Bob:

Nice, yeah, people go check out the site and, yeah, there you can reach out to Jeroen and Thijs. It's all linked up there. Jeroen, thijs, thanks again for joining me today and sharing all these interesting things. Thanks for all the work you do and, yeah, hope to catch you in another episode.

Thijs:

Thank you so much for having us, Bob.

Jeroen:

Thanks, Bob.

Julian:

It's been a pleasure. Hey everyone, thanks for tuning into the PyBytes podcast. I really hope you enjoyed it. A quick message from me and Bob before you go to get the most out of your experience with PyBytes, including learning more Python, engaging with other developers, learning about our guests, discussing these podcast episodes, and much, much more please join our community at pybytescircleso. The link is on the screen if you're watching this on YouTube and it's in the show notes for everyone else. When you join, make sure you introduce yourself, engage with myself and Bob and the many other developers in the community. It's one of the greatest things you can do to expand your knowledge and reach and network as a Python developer. We'll see you in the next episode and we will see you in the community.