Streaming Audio: Apache Kafka® & Real-Time Data

ksqlDB Fundamentals: How Apache Kafka, SQL, and ksqlDB Work Together ft. Simon Aubury

December 01, 2021 Confluent, original creators of Apache Kafka® Season 1 Episode 188
Streaming Audio: Apache Kafka® & Real-Time Data
ksqlDB Fundamentals: How Apache Kafka, SQL, and ksqlDB Work Together ft. Simon Aubury
Show Notes Transcript

What is ksqlDB and how does Simon Aubury (Principal Data Engineer, Thoughtworks) use it to track down the plane that wakes his cat Snowy in the morning? Experienced in building real-time applications with ksqlDB since its genesis, Simon provides an introduction to ksqlDB by sharing some of his projects and use cases. 

ksqlDB is a database purpose-built for stream processing applications and lets you build real-time data streaming applications with SQL syntax. ksqlDB reduces the complexity of having to code with Java, making it easier to achieve outcomes through declarative programming, as opposed to procedural programming. 

Before ksqlDB, you could use the producer and consumer APIs to get data in and out of Apache Kafka®; however, when it comes to data enrichment, such as joining, filtering, mapping, and aggregating data, you would have to use the Kafka Streams API—a robust and scalable programming interface influenced by the JVM ecosystem that requires Java programming knowledge. This presented scaling challenges for Simon, who was at a multinational insurance company that needed to stream loads of data from disparate systems with a small team to scale and enrich data for meaningful insights. Simon recalls discovering ksqlDB during a practice fire drill, and he considers it as a memorable moment for turning a challenge into an opportunity.

Leveraging your familiarity with relational databases, ksqlDB abstracts away complex programming that is required for real-time operations both for stream processing and data integration, making it easy to read, write, and process streaming data in real time.

Simon is passionate about ksqlDB and Kafka Streams as well as getting other people inspired by the technology. He’s been using ksqlDB for projects, such as taking a stream of information and enriching it with static data. One of Simon’s first ksqlDB projects was using Raspberry Pi and a software-defined radio to process aircraft movements in real time to determine which plane wakes his cat Snowy up every morning. 

Simon highlights additional ksqlDB use cases, including e-commerce checkout interaction to identify where people are dropping out of a sales funnel. 

EPISODE LINKS

Tim Berglund:
My friend Simon Aubury has been using ksqlDB from the beginning, from back when it was called just ksql. He's done some really cool projects, blogs, and neat Raspberry PI things. And also used it, as we say, in anger, in production settings. So I have him on the show today, really talked through that whole history where ksqlDB has come, where he's had success using it. The kinds of use cases that have worked, what kind of use cases haven't worked. So listen in.

Tim Berglund:
Before we get to that though, I need to remind you that Streaming Audio is brought to you by Confluent Developer, that's developer.confluent.io. There are a lot of great educational resources about ksqlDB on Confluent Developer. There are a couple of video courses, all kinds of executable tutorials. I don't even know how many Confluent tutorials are there for ksqlDB, it's a lot. And a lot of other just general Kafka adjacent educational materials besides. If you check it out and you take any of the courses and do the exercises, you'll sign up to Confluent Cloud, you can use the code PODCAST100 when you sign up to get an extra $100 of free usage. So check out Confluent Developer, check out the ksqlDB materials there. But first, check out this conversation between me and Simon.

Tim Berglund:
Hello, and welcome to another episode of Streaming Audio. I'm your host, Tim Berglund. And I'm joined again by my friend, Simon Aubury. Simon is a principal consultant with Thoughtworks. Simon, welcome back to the show.

Simon Aubury:
Why, thank you, Tim. I'm so excited to join you today and thanks for welcoming me back.

Tim Berglund:
You bet. I think the last time you were on was the first time you were on, and I thought it was the second. I had just remembered you being on before. And I welcomed you back to the show on your first appearance, but this is well and truly not your first appearance.

Simon Aubury:
Oh, well I'm really appreciating being back again for the second or N plus one times. It's fantastic.

Tim Berglund:
It is, appearance N plus one.

Simon Aubury:
Yes.

Tim Berglund:
And that, I think, is all that we need. So, you have been, since the beginning of ksqlDB, you've been a person who does a lot with it. I mean, that's some of my favorite cool demos and blog posts about it and things like that have been by you. And I kind of thought it would be cool to just get your introduction to ksqlDB, your cat's introduction to ksqlDB. If you're listening, 80% of you are listening to the audio, 20% of you are watching the video. You heard the bell jingling, that 20% saw Simon's cat. Now you go on, now you want to watch on YouTube.

Simon Aubury:
No. It's adding a level of intrigue. That is Snowy, the cat, who is also a big fan of ksqlDB.

Tim Berglund:
And is ironically named because Snowy the cat is, in fact, a black cat.

Simon Aubury:
That's right. That's right. Yeah. So-

Tim Berglund:
Yeah. You know what? You don't have to do it now, but I want to tie in, you're not kidding about Snowy, like in ksqlDB.

Simon Aubury:
Oh, absolutely not. No, she is an internet sensation, who is deeply aware about the power of event streaming and the power of ksqlDB.

Tim Berglund:
And airplanes and that's-

Simon Aubury:
Absolutely.

Tim Berglund:
We'll get there. First, what's ksqlDB? For those who are somehow brand new to the ecosystem, they need you to explain.

Simon Aubury:
Yeah. Yeah. It's a really good question. And maybe it's worth explaining what the world looked like prior to ksqlDB. So I think we're probably all familiar with the great producer and consumer APIs that get data, both into and out of Kafka. But when we want to do super interesting stuff with the data that we've got resident in our event stream, we're into this world of event streaming and that's fantastic. That's things like joining and filtering and mapping and aggregating and all the good stuff that we want to do when we transform and enrich our data.

Simon Aubury:
So our choices, when we're back in the world of sort of 2016, 2017, was we could utilize the Kafka Streams API. And this is a really, really rock-solid, fantastic programming interface for doing all of the heavy liftings. So if you actually want to do things like join stream into a table and get some results and add a level of enrichment or do some windowing, these are all sort of great value add activities to our data stream. But at the end of the day, although Kafka streams is a fantastic API and a really robust and scalable solution, at the end of the day, it's a programming interface, which is great if you're a developer, but it's quite limiting if you're thinking about, well, we want to open up these capabilities to a wider group of people.

Simon Aubury:
And not only is it limited to programmers because it's heavily influenced by the JVM ecosystem. At the end of the day, to really utilize the Kafka Streams API, you've got to be essentially a Java or a scholar developer, which is good. But that's not everyone. So I was really excited when ksql came along in, I'm going to say 2017, but might need to be fact-checked on that.

Tim Berglund:
There was at summit that year, there was an announcement, I believe.

Simon Aubury:
Yeah.

Tim Berglund:
And it was a super-duper early preview kind of thing. Not really terribly usable until into the next year, but I think that's right. I'm terrible with dates, but.

Simon Aubury:
Yes.

Tim Berglund:
I'll just sit here and Google it while you're talking in between.

Simon Aubury:
Between you and me, we're going to say around 2017. And the idea of presenting essentially the strength of the Kafka Streams API with a SQL-like dialect, this is super powerful because you're very much into this idea of describing what you want rather than mechanically having just described the individual steps. So SQL is sometimes called a declarative programming language, in which you describe what you want to achieve rather than the individual steps. And that's fantastic.

Simon Aubury:
And I think a lot of people, who are coming from a relational database background, were quite familiar with the SQL-like obstructions. And ksqlDB essentially provides some really good strong abstractions, both in the stream processing sense, but also in the data integration sense. So from the stream processing sense, ksql is a great way of abstracting out things like streams and tables and joins. And from the data integration sense, it's a great level of abstraction to achieve some of the data integration that's possible with Kafka Connect. Essentially, getting data in and out of Kafka.

Simon Aubury:
So I got really excited by the opportunity to use all of, essentially the good parts of the Kafka ecosystem, but make it a little more accessible. We've some declarative interfaces. So I think that's why I got so excited by the potential of using SQL in combination with essentially the good building blocks that were already there in the Kafka ecosystem.

Tim Berglund:
Yes. And by the way, confirmed August 20th, 2017 was the initial announcement. So what were you doing at that time? You're pointing out that it's a way to do stream processing a little bit more easily, right? You don't have to sit down and write Java code. It's a language that we all, it's a declarative language and a language that basically we all know. There was the connect integration. So what were you doing at the time and what got your interest? What was the first thing you built?

Simon Aubury:
Yeah, yeah. So this wasn't just an artificial interest. At the time, I was actually working for a large sort of multinational insurance company. And at that time, we were dealing with onboarding a lot of data. And you can imagine the kind of data that an insurance company deals with its customer information from a CRM system, it's payments, it's policy information. But it's also data about weather events and property and landmass and cars and vehicles. So a lot of data from disparate systems.

Simon Aubury:
So I'm not going to pretend it was easy getting that data into our Kafka cluster. But what was a bit of a challenge was adding value to it by joining and matching and filtering and aggregating that data, because we only had so many developers in our enterprise. So this became a bit of a scaling factor. We had all of the ambition to do stuff. And one of the limiting factors we had was, actually how many developers we had at keyboards.

Tim Berglund:
Okay. So developer productivity is kind of the-

Simon Aubury:
Yeah.

Tim Berglund:
Forcing function.

Simon Aubury:
Exactly, not an infinite function. And I actually remember the moment that I've discovered ksqlDB was actually during a fire evacuation. It was actually luckily just a drill, but as you can imagine, routine working in an office. Occasionally, they test the fire alarm systems. So it was a beautiful spring day. We were out in the middle of Sydney, as we had evacuated the building. And I was talking to some of my colleagues about how we had all of these use cases, but essentially we just didn't have enough programming folk to actually scale it.

Simon Aubury:
And someone actually saw on Twitter, the announcements around ksqlDB. And so it was within that 10-minute fire drill that we got very, very excited by the possibilities, and very quickly. Then went back and studied, doing essentially a tech spike around the potential here. So, yes it was-

Tim Berglund:
That is memorable.

Simon Aubury:
Yeah, yeah. So that was a moment where I do remember what might be considered the catalyzing moment, where we actually think there is a capability. And what was really good off the back of that was we very quickly got moving on some exploratory use cases, some development use cases, and actually started seeing, hey, this stuff really does work. We do actually get the simplicity of moving forward with sort of a declarative mindset. But we're not throwing away any of the Kafka ecosystems and we're actually building on top of the strength of the Kafka Streams API. So it wasn't like we were sacrificing any of the availability or the distribution or the scalability we'd already invested in. We had just got that through essentially a superpower, which is a high-level abstraction of doing everything with SQL.

Tim Berglund:
Yeah. In effect, writing Kafka Streams applications, but with SQL and having that scaled, performed, and everything. Kind of like Kafka Streams applications.

Simon Aubury:
Yeah, and that's it. And I do like to think of this as essentially like a multiplier, because I don't know about you Tim, but I don't have infinite time in my day. So having something that makes it easier to get to an outcome by using a high level of abstraction where it's appropriate, just makes it more plausible to get more stuff done.

Tim Berglund:
Absolutely. Absolutely. That's my favorite Grady Booch quote. The history of software engineering is one of increasing levels of abstraction. Something like that. I'm paraphrasing, but I mean that's not untrue. We look back and-

Simon Aubury:
Absolutely.

Tim Berglund:
We're doing, you have basically the same amount of cognitive equipment in each new generation of software developers. We're not meaningfully better than the previous ones, but we need to get more done. And so we have these basically more useful tools, higher levels of attraction as we go on. Now, you're working for an insurance company. At some point in there, you transitioned to Thoughtworks. And certainly, as a consultant, you don't always get to talk about the stuff you're doing. But you still like to teach, you give talks, you write blog posts. So how did you continue exploring? And maybe there's a tie-in to Snowy the cat.

Simon Aubury:
Oh, absolutely. Absolutely. So yeah, wearing the consultant hat gives us an opportunity to get involved in some really interesting companies, be them, transport companies or e-commerce companies, or healthcare. But we always kind of share this work. But one of the things I'm super excited about is sometimes to get other people inspired by technology, you just have to show them a self-contained use case. And I got quite passionate about exploring some of the Kafka Streams and ksqlDB opportunities in my own time. And I found one of the best ways of learning is essentially just putting together little demonstration projects. And one of the projects I was initially interested in with ksqlDB was the idea of taking a stream of information and then enriching it with some static data. This was quite a parallel to what I was doing in my day job.

Simon Aubury:
And I found a really useful stream of information, which is real-time updates from aircraft. So every aircraft is always, any commercial aircraft is constantly transmitting its location, its altitude, its direction. And this is a freely available stream of data, but it's quite a sort of coarse-grained. It's just a whole bunch of bytecodes and lat and longs. And the real use from a stream like that is then to sort of enriching it with what is the flight and how many people are on the airplane and who's a flag carrier for that aircraft body.

Simon Aubury:
So one of my first pet projects around ksql was to actually take a stream of location information and then enrich it with essentially some static data around aircraft type. And lo and behold, that was one of the first opportunities to essentially share some of the capabilities for ksqlDB, but also learn some things on the way. And that was essentially a flight tracking project, which was ultimately to work out which plane, first thing in the morning, was waking Snowy the cat. Because we knew that the cat was getting upset at six a.m. every morning, but we didn't quite know which plane it was. And it turns out it was Airbus A380 that was flying off to Dubai every morning, and that was the-

Tim Berglund:
Oh. A 380, that'll get you out of bed in the morning.

Simon Aubury:
Exactly. Exactly. So, that is the plane that wakes Snowy the cat each morning.

Tim Berglund:
Well, that's [inaudible 00:15:20]. I remember the blog post well. I didn't remember the conclusion, but that was the airplane that did it. And there was, I mean, this is not ksql specifically, but you had a Raspberry PI and some kind of radio hardware because that information needs to be received and demodulated.

Simon Aubury:
Exactly, exactly. I mean, no project can't be made better without the introduction of a Raspberry Pi. And so yes, to justify some random internet purchasing, a raspberry PI was part of it.

Tim Berglund:
So right. It's always better with a PI. I think you've been an inspiration to a lot of people. Clearly, you weren't the first person to think of it, but we have at least one PI project going on right now. Robin Moffitt has done them with various kinds of data, sort of lying around his house. And Danica Fein, who's a new developer advocate on my team is working on a plant watering and data acquisition system. So not exactly a high bandwidth data, but still definitely streaming data. And if there's a Raspberry PI, nobody cares. It's just fun.

Simon Aubury:
Absolutely. Absolutely. And if nothing else, I'm always inspired by other people's projects. And definitely, Robyn is a great motivator for some of the incredible opportunities out there. It's both inspirational and educational just to see what's possible. And again, this is a great demonstration of, we've got this great idea and with a few lines of supporting code and a bit of ksql support, you can achieve some pretty impressive results fairly quickly.

Tim Berglund:
It's pretty cool stuff. So I know specifics aren't a thing, that you can't talk about specific projects. But what kinds of places have you seen ksqlDB land in the wild and where has it pushed you? Where has it seemed like a good fit, not a good fit? If you take us into anti-patterns and places where ksql doesn't work, this is where it does.

Simon Aubury:
Yeah.

Tim Berglund:
Kind of take us into that.

Simon Aubury:
Yeah, sure. So I think a lot of the styles of streaming work will not be all that surprising. But I've had the opportunity to use ksql for e-commerce checkout style interactions, essentially just being able to identify points in a digital channel customer flow of where people are dropping out of a sales funnel. And that's quite interesting. In fact, if I can give you a bit of a diversion story, one of the first times I think we used ksqlDB in anger was actually sort of a firefighting incident where I was working at essentially an e-commerce platform. Where we were releasing a new capability, that essentially was a big bang, turn a big switch and suddenly have a whole bunch of customers using a new sales funnel.

Tim Berglund:
That's [inaudible 00:18:23].

Simon Aubury:
Yeah. I mean, it was all Kafka-based. We'd actually built a lot of capabilities. We had, as you can imagine, a lot of confidence that it was going to grow and scale because we were using best practices around building it on distributed streaming technology. And then we turned the big switch and most customers were flowing through and we were getting some great essentially sales off the back of this.

Simon Aubury:
But we did notice that maybe 5% of our customers weren't actually going all the way through. And our observability platform was indicating that there was something wrong with the identification and payment part of this flow. And because this was Kafka and ksql, we could quickly write some queries and kind of determined that the customers who were failing were customers who had essentially identified themselves with email addresses. But we made the assumption that all email addresses were lowercase IDP. We had assumed was always going to send us lowercase emails. But it turns out that this particular IDP service allowed people to be quite creative with their use of capitalization. And it turns out that when you discover the-

Tim Berglund:
IDP feed doesn't require them to be lowercase.

Simon Aubury:
No. Absolutely not. Yes, as you can imagine. The next day, there were a lot of conversations about looking back at the email spec that at the time, was a bit of a diversion. And it was very much of the first question is, how many people exhibit this behavior? And the second question is obviously, what can we do about it? And it was really great to have just the ksql interface at our disposal, but this answers the first question of what proportion of customers have non-lowercase characters in their identity string.

Simon Aubury:
That's obviously a very straightforward thing to just quickly query and you can, in our case, it was a percentage. Not a huge percentage, but a reasonable percentage of customers who exhibited this characteristic. And the second part of what can we do about it, obviously, this is a live system and we need to do something about it quickly. The idea of setting up a stream where we've just lowercased the identity stream. Again, that was a very straightforward thing to do at the moment.

Simon Aubury:
And yeah, it was just a sort of a small example of both the data exploration you can do and a quick bit of development. I wouldn't suggest that this is a typical practice, but when you need to do things quickly. And sometimes you've got pressure to do things quickly because you've got a production system that you actually need to resuscitate. Having that sort of pace and scale and flex, think about all the tools at your disposal. And if ksql is one of those tools, it's great to have that as an option.

Tim Berglund:
Nice. Now anybody says there's a tool or architectural paradigm or anything like that, that's always applicable, is I think not being straightforward. So, where have you seen ksql feel like it just wasn't a good fit?

Simon Aubury:
Yeah. Yeah. Yeah, absolutely. You shouldn't force-fit everything to an extreme. I think sometimes, the language around SQL might attract the wrong association. I've worked alongside analysts who hear SQL, they think Power BI and they immediately ask, "How do I generate a graph using ksqlDB and show it, render it in a webpage?" And of course, there's a level of education about this tool that fits this outcome. But yes, ksqlDB will not make your BI and analytics community happy because it's not a tool for them.

Simon Aubury:
I think another place where ksqlDB might not be a direct answer is companies moving from traditional RDBMSs and thinking about, well, how are we going to take a million lines of PL SQL code or T-SQL code in a stored procedure? Sometimes there's a quick compare and contrast of, oh, ksql is streaming. It's extensible, it's got UDF. And we've got a heritage of doing things in stored procedures. That might be another watchpoint of, don't go down this path. Just because you can, doesn't mean you should. So ksql-

Tim Berglund:
And that would be a dream if you're doing imperative programming. You have a stream processing API in the form of Kafka Streams, and everything's going to be fine there. It's not SQL, but then that kind of stored procedure life gets you away from the, oh, it'll be declarative. And it's just code, it's just in a different place.

Simon Aubury:
Yep. Yep. 100%.

Tim Berglund:
Different places with worse tooling and all that. But anyway, go on.

Simon Aubury:
Yeah, yeah. But I think we should always be supportive of multidisciplinary teams who are using the right and the most appropriate tools. Sometimes that's teams are well-versed in Python or Kafka Streams or ksql. And I think just having this mindset of picking the fastest or the most appropriate tool and using that to solve the solution, I love that ksql is within that toolbox. So at the end of the day, if you've got a query or something, or join something or push and pull data through Kafka Connect. I just like that ksqlDB is a shortcut for achieving those outcomes, not at the exclusion for everything else, but it's just great to have that tool in the toolbox.

Tim Berglund:
Yeah. Last question. Deployment, maintenance, kind of classical DB ops. You've got some thoughts in this area?

Simon Aubury:
Oh, absolutely. So although ksql is great for this sort of interactive mindset about being able to explore data, to construct a query, construct a stream, we can't forget the important software engineering practices around testing and deployment and capacity planning. None of that is excluded just because you've got great tools at your fingertips. So ksql has some great test runners. I'd always encourage this sort of mindset around wrapping deployment code with test cases.

Simon Aubury:
And if this is sort of familiar to your organization, a test-driven approach where you define the test case, and then you build the code so it makes your build go green. And so you can actually embrace this kind of development mindset with ksqlDB. That's something that's great to establish from day one. And of course, capacity planning is super important. So thinking through your fail-over points, the burst capacity, so that's a consideration for both the Kafka Streams API. Do you have enough capacity there, but also some of the fail-over and load balancing considerations that you might want to consider. Like to think that you'll be on your merry way with being able to support production systems.

Simon Aubury:
But at the end of the day, things do go wrong. Don't have a single point of failure with your schema registry. You shouldn't have a single point of failure with your case equal server. So always thinking about both the happy path of what can go right, but planning for the worst in what can go wrong. Both from a recoverability perspective, a fail-over perspective, and a capacity planning perspective.

Tim Berglund:
What's next for you in your... This is a bonus extra question. I just liked that answer so much, I just want to talk a little bit more. What's next for you in the world of ksqlDB? Are you planning anything new looking forward, like wishing it had a feature? And this is your time to talk.

Simon Aubury:
Yeah. I think there's a couple of things that super excite me with ksqlDB. Firstly, just to see the support and the education that's coming along. KsqlDB always had great written documentation, but it's great to see the community getting together and writing demonstration projects. I know Michael Drogalis is putting together some fantastic video tutorials, just to sort of open the funnel and help people onboard with ksqlDB.

Simon Aubury:
And from almost like, if Simon had his magic wand and could request a beautiful future. I love that ksqlDB is accessible from a CLI and a rest API standpoint, but I think there's room to grow from integrating with other programming languages. So for the listeners, obviously ksqlDB has some great command-line interfaces, a rest API, and an initial, correct me if I'm wrong. I believe it's just a Java cloud library for interacting with the ksqlDB server.

Tim Berglund:
The only one I'm aware of.

Simon Aubury:
Yes. And I would say that an oft-requested feature from the teams I deal with is, we're a Python community or a dot net community. And we'd like something other than a rest API. So it's good, but moving to greatness would be embracing additional languages and additional constructs like that.

Tim Berglund:
You've said it on air. Hopefully, that gives us momentum.

Simon Aubury:
Anyway, it's always good. I mean, if you're grading students' work, you want to give them an eight out of 10. So they know that they can move into a 10 out of 10.

Tim Berglund:
Indeed. My guest today has been Simon Aubury. Simon, thanks for being a part of Streaming Audio.

Simon Aubury:
Tim, thanks so much for having me today. I really enjoyed the chat.

Tim Berglund:
And there you have it. Thanks for listening to this episode. Now, some important details before you go. Streaming Audio is brought to you by Confluent Developer, that's developer.confluent.io, a website dedicated to helping you learn Kafka, Confluent, and everything in the broader event streaming ecosystem. We've got free video courses, a library of event-driven architecture design patterns, executable tutorials covering ksqlDB, Kafka streams, and core Kafka APIs. There's even an index of episodes of this podcast. So if you take a course on Confluent Developer, you'll have the chance to use Confluent Cloud. When you sign up, use the code, PODCAST100 to get an extra a hundred dollars of free Confluent Cloud usage.

Tim Berglund:
Anyway, as always, I hope this podcast was helpful to you. If you want to discuss it or ask a question, you can always reach out to me at TL Berglund on Twitter. That's T-L B-E-R-G-L-U-N-D. Or you can leave a comment on the YouTube video if you're watching and not just listening or reach out in our community Slack or forum. Both are linked in the show notes. And while you're at it, please subscribe to our YouTube channel, and to this podcast, wherever fine podcasts are sold. And if you subscribe through Apple Podcast, be sure to leave us a review there. That helps other people discover us, which we think is a good thing. So thanks for your support, and we'll see you next time.