What Could Go Wrong with a Kafka JDBC Connector? Artwork

Confluent Developer ft. Tim Berglund, Adi Polak & Viktor Gamov

Hi, we’re Tim Berglund, Adi Polak, and Viktor Gamov and we’re excited to bring you the Confluent Developer podcast (formerly “Streaming Audio.”) Our hand-crafted weekly episodes feature in-depth interviews with our community of software developers (actual human beings - not AI) talking about some of the most interesting challenges they’ve faced in their careers. We aim to explore the conditions that gave rise to each person’s technical hurdles, as well as how their experiences transformed their understanding and approach to building systems.

Whether you’re a seasoned open source data streaming engineer, or just someone who’s interested in learning more about Apache Kafka®, Apache Flink® and real-time data, we hope you’ll appreciate the stories, the discussion, and our effort to bring you a high-quality show worth your time.

All Episodes

Confluent Developer ft. Tim Berglund, Adi Polak & Viktor Gamov

What Could Go Wrong with a Kafka JDBC Connector?

August 04, 2022 • Confluent, founded by the original creators of Apache Kafka® • Season 1 • Episode 227

0:00 | 41:10

Java Database Connectivity (JDBC) is the Java API used to connect to a database. As one of the most popular Kafka connectors, it's important to prevent issues with your integrations.

In this episode, we'll cover how a JDBC connection works, and common issues with your database connection.

Why the Kafka JDBC Connector?

When it comes to streaming database events into Apache Kafka®, the JDBC connector usually represents the first choice for its flexibility and the ability to support a wide variety of databases without requiring custom code. As an experienced data analyst, Francesco Tisiot (Senior Developer Advocate, Aiven) delves into his experience of streaming Kafka data pipeline with JDBC source connector and explains what could go wrong. He discusses alternative options available to avoid these problems, including the Debezium source connector for real-time change data capture.

The JDBC connector is a Java API for Kafka Connect, which streams data between databases and Kafka. If you want to stream data from a rational database into Kafka, once per day or every two hours, the JDBC connector is a simple, batch processing connector to use. You can tell the JDBC connector which query you’d like to execute against the database, and then the connector will take the data into Kafka.

The connector works well with out-of-the-box basic data types, however, when it comes to a database-specific data type, such as geometrical columns and array columns in PostgresSQL, these don’t represent well with the JDBC connector. Perhaps, you might not have any results in Kafka because the column is not within the connector’s supporting capability. Francesco shares other cases that would cause the JDBC connector to go wrong, such as:

Infrequent snapshot times
Out-of-order events
Non-incremental sequences
Hard deletes

To help avoid these problems and set up a reliable source of events for your real-time streaming pipeline, Francesco suggests other approaches, such as the Debezium source connector for real-time change data capture. The Debezium connector has enhanced metadata, timestamps of the operation, access to all logs, and provides sequence numbers for you to speak the language of a DBA.

They also talk about the governance tool, which Francesco has been building, and how streaming Game of Thrones sentiment analysis with Kafka started his current role as a developer advocate.

EPISODE LINKS

SEASON 2
Hosted by Tim Berglund, Adi Polak and Viktor Gamov
Produced and Edited by Noelle Gallagher, Peter Furia and Nurie Mohamed
Music by Coastal Kites
Artwork by Phil Vo

🎧 Subscribe to Confluent Developer wherever you listen to podcasts.
▶️ Subscribe on YouTube, and hit the 🔔 to catch new episodes.
👍 If you enjoyed this, please leave us a rating.
🎧 Confluent also has a podcast for tech leaders: "Life Is But A Stream" hosted by our friend, Joseph Morais.

Kris Jenkins: (00:00)
Hello, you're listening to the Streaming Audio Podcast. And today's discussion is primarily about connecting Apache Kafka into other databases. I asked Francesco Tisiot to come on and tell us about using JDBC connectors, what they're supposed to do, how they're supposed to work, and where they can go wrong. They can lose data if they don't have quite the right model for capturing database changes. So we talk about that. We talk about using Debezium as a better solution for capturing a stream of database changes. And then we also managed to cover a governance tool he's been working on and how hacking on Game of Thrones data got him his current day job. He's an interesting guy, so I hope you enjoy it.

Kris Jenkins: (00:45)
Before we get started, let me tell you that Streaming Audio is brought to you by developer.confluent.io, which is our education site for Kafka. It's got free information about how to build and maintain successful event systems. It's also got plenty of hands-on courses you can take and learn at your own pace. If you take one of those courses, you're going to need a Kafka instance. For that, head to Confluent Cloud. You can get a cluster up and running in minutes, and it will scale all the way up to production sizes. Just add the promo code PODCAST100 to your account and you'll get $100 of extra free credit to use. And with that, let's figure out how to make Kafka play nicely with Postgres, MySQL, Mongo, and more.

Kris Jenkins: (01:34)
My guest today is Francesco Tisiot. Francesco, welcome to the show.

Francesco Tisiot: (01:37)
Thank you very much. I'm really happy to be here with you today.

Kris Jenkins: (01:41)
Good to have you. You're a developer advocate at Aiven or Aiven, Aiven?

Francesco Tisiot: (01:47)
Aiven.

Kris Jenkins: (01:48)
Aiven. Pronunciation is difficult, especially in this day of made-up words on the web. So what's that like? Give me a quick look into your background.

Francesco Tisiot: (02:00)
Okay. First of all, I'm lucky because the pronunciation of Aiven is the same in the Nordics and in Italy making my life really easy so that's the thing. What we do is we basically do open source data platforms as managed services on top of the major clouds. We do Apache Kafka, as you may know. We do Postgres, Apache Flink, MySQL, and OpenSearch. So we try to take some of the open source products and create a cloud version of them. Well, it's not a cloud version, it's the same software running in the cloud as managed services. What we do on top of it, if I can add, is we try to give back to the community. So we have an open source team, which is dedicated to building with the community open source products. So we have dedicated people working for Kafka, and we have dedicated people working for OpenSearch. We, of course, create our own managed solution, but we try to give back to the community of the communities that we take the software. So it's a win-win situation for both parties.

Kris Jenkins: (03:13)
Do you get involved in that? Where's your role in that stuff?

Francesco Tisiot: (03:19)
Okay, that's a good question and a tricky question at the same time. I'm a software engineer by the paper that I got at home. I write some code, but I don't write a lot of code. So I believe with open source, a lot of people think that you can only contribute by code, but I don't think that's true. I believe you can contribute in many different ways. One thing that I enjoy doing, I can do is, for example, writing about technology. This can take different shapes. For example, you can write about a feature in documentation, or what I do most of the time, you can write a blog post to enhance the knowledge of the feature is available.

Francesco Tisiot: (04:05)
Or the other bit is, I see DevRel as basically trying to reduce the arrow from not knowing about the thing and understanding the coolness of the thing. So if you create, for example, a notebook that allows people to start with Python and Kafka, well, you are doing a service to the community and also to the open source tool, in this case, Kafka. I believe I'm in more of this stage of not being able because I don't have the knowledge of contributing directly to Kafka itself, but try to raise the audience, and try to make it more accessible to people.

Kris Jenkins: (04:47)
Okay, that's interesting. A colleague of mine sees the job as being an enabler of technology.

Francesco Tisiot: (04:54)
Exactly.

Kris Jenkins: (04:55)
Yeah, I can see the parallels there. But before you got into that life, you must have had a large background in databases, and you did a lot of BI stuff. Take us along that journey.

Francesco Tisiot: (05:09)
Okay, and that was a long journey. If you allow me the parallel, it's basically the same long journey that made me wait for a long time the day after the data from the day before. I was the person doing the BI system, I was responsible for a set of dashboards, which were relying on one or more ETLs that were running overnight. So I was always coming to the office or connecting to the office in the morning and hoping that the ETL of the previous night was working successfully and always having to wait for two hours for like a day before being able to analyze the results.

Kris Jenkins: (05:55)
You're the babysitter of the overnight batch, right?

Francesco Tisiot: (05:58)
Well, I wasn't responsible, but my work was dependent on that. I was doing the data analysis, but I was always late. So now, when you see the streaming world coming up, well, that's where my brain started. I believe my brain actually started smiling, because understanding that you could analyze the data in real time, you could create real-time views or real-time SQL that could provide a business logic on top of the raw data and create streaming things. This is when I started understanding, okay, I can make an immediate impact. I don't have to work on yesterday's data, I don't need to accept a consistent delay in the data. I can do everything as soon as it happens.

Francesco Tisiot: (06:48)
This is where I started blogging about Kafka. I started blogging about two of my interests, which were Kafka and Game of Thrones, and trying to match the sentiment analysis in the tweets coming from Twitter via Kafka and doing with, I believe at the beginning I did with R doing the sentiment analysis, then I did also something using, I believe, automated version from Google that was scoring automatically the Tweets and I was building dashboards on top of it. My old passion is to take a problem, which in the case of Game of Thrones wasn't really a problem, it was a problem that I created myself, still, it's a problem.

Kris Jenkins: (07:38)
Those are some of the most fun ones, right?

Francesco Tisiot: (07:39)
Yeah, and trying to come up with a technological solution to solve the problem. For me, literally, Game of Thrones and Kafka changed my life because now I'm a developer advocate at Aiven because Aiven found me after they read one of my blog posts. So it's an interesting world if you are willing to put a little bit of yourself out there.

Kris Jenkins: (08:05)
Yeah, that's true. Before we move on from that, you have to tell me what the results were. What sentiment analysis did you get out of Game of Thrones?

Francesco Tisiot: (08:14)
It was interesting and complex at the same time. Because I was using a standard dictionary, that didn't work well with the original theme of Game of Thrones. Because if you write a sentence someone killed someone else, that, of course, has a standard scoring as a negative sentence. But in Game of Thrones, because it was all about this kind of word between people, well, it's hard to judge if that is a positive or negative sentence.

Francesco Tisiot: (08:50)
But there were some interesting trends, specifically, if you were not analyzing only one character at the time, but you were starting analyzing a couple of characters at the time. So I can't remember now the names, but if you could clearly see that a combination of characters had a very positive sentiment associated to them, because in a particular episode, they were basically, if I remember well, escaping from a cruel guy and some other, this cruel guy had an extremely negative sentiment, so you could see some patterns coming from what the show was telling also in the data.

Kris Jenkins: (09:29)
That's interesting. It's going to be a while before that becomes a part of Hollywood script writing classes, but you could say one day.

Francesco Tisiot: (09:39)
Yeah, possibly. It's something to think about in the future. You see a lot of now in football teams people doing real-time analysis on all the biometrics of players, I believe. Is that a world that we will see also in Hollywood movies? Maybe someone will be able to forecast the impact of a certain sentence in a movie or reshape the sentence to have a better impact. I believe now with data, the luxury you have is that you can analyze and possibly forecast everything, you can also overanalyze and over forecast everything. So you have the benefit and the risk with everything.

Kris Jenkins: (10:20)
Yeah, yeah. Maybe I'll go with the perspective that more insight is better, and then hopefully we use it well, right?

Francesco Tisiot: (10:29)
Yeah, exactly. I agree with that.

Kris Jenkins: (10:32)
So from that, you get into working at Aiven, Aiven got to say that right. My grandfather's Welsh and his name was Ivan and now that's sticking into my head. So you're living in this world where you are making lots of databases talk to each other, right?

Francesco Tisiot: (10:52)
Yeah, so I believe Kafka, as we all know, is most of the time just a middle layer. What do you do with Kafka? You take the data from point A to point B. And if you take this kind of assumption into the real world, point A or point B most of the time is a database because people have been using databases for the past 30 years, or even more, and they will probably use databases for the next 500 years or even more. So the fact that Kafka needs to talk with a database is something that we just need to accept.

Francesco Tisiot: (11:34)
What I've been trying and checking is how you can integrate Kafka with the database. And specifically taking my history of me coming from this kind of ETL mindset, I started digging about what is the default and the easiest way of integrating Kafka with the database, which is using the JDBC connector. I mean, I've been using JDBC in order to extract data from all to play with the database all my career in one way or the other.

Kris Jenkins: (12:12)
Yeah, absolutely. So you want to connect to a relational database with the Java-based process is obviously your first go-to tool, right?

Francesco Tisiot: (12:21)
Yeah. And it works, it works in some sense. It works if you think, and if you approach the database from the Kafka point of view, still with a batch mindset. So if, for example, you have a source database and you want to take the data from databases into Kafka, but you want to do that once per day or every two hours, well, the JDBC does exactly what ETL tools have been doing a long time. So you tell the JDBC connector which query more or less you want to execute against the database and the JDC connector will take that data into Kafka. Problem solved.

Francesco Tisiot: (13:13)
The problem, and that's why I also gave this talk at the Kafka Summit, is when you start trying to use a solution that was originally aimed to solve a batch problem in the streaming world, is where instead of trying to query the database, okay, tell me yesterday data. And I ask the question at 1:00 or 2:00 AM because I want to be sure that yesterday's data is all landed in the database. This is the perfect batch problem. Now, with streaming, I'm asking the database, to tell me what was happening five seconds ago, and tell me what's happening now. And it's where you don't have this time of being sure that the view that you have in the database is somehow static, that little problems can start rising in the connector between the database and Kafka.

Kris Jenkins: (14:09)
Because there is a tendency with data for it to sort of stabilized after a few hours, right?

Francesco Tisiot: (14:15)
Yeah. Well, that was always my experience. What we were told when we were extracting the data was okay, you want yesterday's data, but you cannot ask about yesterday's data at one second after midnight, because you will have always some latency somewhere. So you need to wait a couple of hours. You need to wait for all the processes going around to finish and then you can extract the data. That is, I believe, the standard way. But that doesn't work if you cannot wait at the time, or if you are looking into five seconds ago, 10 seconds ago. If you're shortening the timeframe between when an event happens and when you want to know about it, this is the concept of streaming basically. And this kind of battery-oriented solution, which is the JDC approach, has problems.

Francesco Tisiot: (15:13)
The worst part is that from a configuration point of view, everything works great. So once you spend your time, because I've been in that shoes, you need to spend a little bit of time trying to understand how to configure the JDBC source connector, how to set all the parameters to set the query mode is a very important parameter that tells you how you are fetching, which are the new rows compared to the previous poll.

Francesco Tisiot: (15:42)
When you set all these kinds of things, at a certain point, you have the connector, which is up and running and you start having a flow of events happening in the database and the same flow of events happening in Kafka. But then at a certain point, without changing anything, with your connector always running perfectly, you start seeing some differences between the two words. It's possible that even if the connector is working perfectly, you see some changes happening in the database which is not reflected in Kafka. And this is because again, we are using a not optimal method to access the data.

Francesco Tisiot: (16:28)
One clear example is possibly the polling time. So with the JDBC connector, we have to set a polling time, which dictates how often we check for new data, right? So you have, let's say, 30 seconds of polling time. What happens if in the database you have an insert for a particular key, let's say we insert a row for something happening in the city of Milan and then we delete the same row within the same polling interval. What we do is a query before and a query after. If the change is too fast, we will not be able to track it. So this is where trying to approach the problem of moving all the changes with a query mode start showing the limits is where you have events happening too fast, events that you cannot control.

Francesco Tisiot: (17:24)
For example, there are some advanced modes in the JDBC connector that allows you to track the new rows based on a serial number or based on the timestamp. What happens if those serial numbers or timestamps get out of sync? What happens if a serial number that you thought was always increasing suddenly is not always increasing because you have a new process that has an extra logic which doesn't follow the old logic? So in those cases, even if everything is perfect from the connector point of view, and is perfect also from the database point of view because there is no official inconsistency from the database point of view, still you start seeing two different realities between the database and Kafka.

Kris Jenkins: (18:14)
Because it's based on this snapshotting periodically idea, which doesn't quite fit with streaming, right?

Francesco Tisiot: (18:22)
Exactly, that's the thing. If we want to take the change of the database into Kafka, then we need to rethink our approach. We cannot query based on a polling interval, we need to look into the database for something that possibly looks like Kafka. And if you think what Kafka is, Kafka is the basic boom, it's a log. Can we find the same log also on the database side? Well, yes, because if you write something to a database, what the database usually do is you write and insert to the database, the database receives the insert and writes the insert in the internal state. But then just to be safe, let's write another copy to a log just in case, it usually works with my mother when we cook, because I'm not a good chef, my mom probably is not either. So when we receive an order, we write the order in the kitchen, but then we write also not somewhere, just because what happens if the kitchen takes fire, you always have the note and you can go back to the note to redo all the orders that you had in your lock.

Francesco Tisiot: (19:48)
This is the standard way that databases have been keeping the state, recreating the state in a lot of different techniques. In the talk, I used the case for Postgres. Postgres has a log file called the wall log where it writes down for each operation, what was the operation? So in case, the database state takes fire, in case you need to create a read-only replica, you have this information and you can rebuild the state from a certain point in time.

Kris Jenkins: (20:18)
Yeah, so it's capturing everything that changes. There is something inside the database naturally capturing a change stream from replication and recovery?

Francesco Tisiot: (20:27)
Yeah, exactly. And that's kind of the interesting information that we want to grab because that contains a specific line, a specific entry for each change. It's not going to miss a thing. It's not going to miss a change because it's the method for the database to keep everything in sync. So it would be cool if we could take that information instead of trying to query often and often the database. And we have a solution for that because there are the Debezium connectors. Debezium is an open-source project, allowing you just to do that. It uses the best change data capture options that we have in the database to take the state from the database into Kafka, into a standard format of Kafka. And it works with a variety of different databases because it works with Postgres, MySQL, and with also non-relational databases like MongoDB, and Debezium. So it uses the mechanism that every database has a little bit different to track the changes and import them into Kafka.

Kris Jenkins: (21:44)
Okay, because they almost have something like that inside, they're all doing that kind of right ahead log in some format or another. So the essence of Debezium is to tap into that database-specific format.

Francesco Tisiot: (21:59)
Yeah, I believe depending on the source database, some of them will have this kind of log that they will read. Some others, I believe SQL server, has this concept of change data table. So when you apply the change data capture to SQL server table, it creates a new schema with a new table containing all the changes happening to the source table. In that case, you are still querying officially a database, but you are querying a change log table, which you will be sure will not miss an event.

Kris Jenkins: (22:41)
And is that an append-only change table?

Francesco Tisiot: (22:44)
Yeah. Yes, I believe so. Once you create this kind of change lock tables, you expect the tables to be up and only, be sort of the same as a log, immutable and up and only. Then I believe you will...

Kris Jenkins: (23:01)
I-

Francesco Tisiot: (23:01)
Yeah, go on.

Kris Jenkins: (23:02)
Yeah. It makes sense for them to model that as an append-only table, right? Because logically they're very, very similar.

Francesco Tisiot: (23:10)
Yeah. If you do a change, so let's say you update a table or you insert a row, it's an event. You delete a row, you in the change log, you are not going to delete the insert event you are just creating a new event. It's the logic that we have been always all relying in Kafka. Every event it's immutable, if there is a change in the state of the source transaction or whatever, it's a new event for us.

Kris Jenkins: (23:42)
Yeah. It's funny when you first come to things like Kafka, you think this idea of having a log of events is a very new and very weird thing, but actually, a lot of databases are already doing this under the hood and have been doing so for years, right?

Francesco Tisiot: (23:59)
Yeah, exactly. I believe so what I see the beauty of Kafka is that for the database, this was an extra necessary bit that you were doing to keep consistency. For Kafka, it's the main bit that they are doing. We do this, we just write one event at a time. And of course, then you have compacted topics and all these other bits of pieces that you're creating on top. But the reality basic reality is Kafka's a lock. We want to have a way of replaying all the history of the lock if needed. With Kafka, there have been a lot of talks about is Kafka the perfect solution for long-term storage? Maybe yes, maybe no. It gives you the opportunity because it can act as the middle layer. You can take the data from that database to Kafka to the Amazon industry if needed, but there is nothing officially stopping you from keeping the data for a long, long time in Kafka as well.

Kris Jenkins: (25:05)
Yeah, yeah. And having an exact permanent record of everything that happened to your data, your audit log, right?

Francesco Tisiot: (25:11)
Yeah, exactly. And the other beauty that I find with Kafka is that I've been working with databases that were requiring you to use a specific tool to write the data in and to read the data out. With Kafka, you ask about agnostic or who is the producer and who is the consumer. You can write with one code into Kafka, you can read with another code with another tool from Kafka. They don't need to talk with each other, they don't need to speak the same language. The only language that they need to agree on is the data format of the events.

Kris Jenkins: (25:49)
Yeah. And you've decoupled writing and reading quite nicely?

Francesco Tisiot: (25:53)
Yeah, I believe coming from a closed-down solution into the open source world, this is also something that I feel really eyeopening because now you can select which is the best tool to solve a problem. Also, one thing that I've been talking a lot with the kind of in this kind of middle land between databases, Kafka, and other technologies, is that for example, you may have all your data in a certain database, but then one of your team needs to do a text search. And let's say that your database doesn't offer any functionality for text search. So what you can do in that case is you can start up your mission of kind of freeing the data, you put Kafka in the middle, you create a source in the database, pass it into Kafka, let's put the data in open search to allow some advanced text search.

Francesco Tisiot: (26:59)
Or on the other side you could say, "Well, no, I will upscale with the team with the need of doing a search into my SQL language." So you're creating a barrier in that what I've seen in corporate in a long time is that the more you create barriers, the more you are basically forcing people to go around those barriers. So the other team will probably at a certain point do a select star from all the tables that they want, create a CSV export, and use their own tool to do whatever they want. So allowing to have Kafka in the middle, and managing Kafka allows you to take control of what you are doing, and take the control of providing a consistent view of reality to a lot of different teams. So it's not that you are forcing them to export a CSV and who knows what's happening to the CSV? Who knows who manages the CSV. We never saw any kind of company sharing customer data on an unprotected S3 bucket, that never happened, right?

Francesco Tisiot: (28:13)
If you basically force people to find alternative solutions, you open the door to a lot of possible problems. If you are managing on the other side a consistent flow of data, while you are providing value to the team that is using their favorite tools in order to solve the problem. While at the same time, ensuring that all the bits and pieces are secured and protected as you want.

Kris Jenkins: (28:38)
Yeah, yeah. I've definitely worked in banks where the database team really, really didn't want to share connections to the central database for security reasons, for the fact that they might accidentally write something they shouldn't, they might see something they shouldn't. But you're right, making them jump through hurdles to get around that is also a terrible mistake. And the solution is to find a good way to safely rebroadcast that data in a read-only format. And that's where Kafka comes in.

Francesco Tisiot: (29:10)
Exactly, yeah. So I've been in a similar space. I was designing centralized BI tools. And you couldn't evolve too fast because you have the complexity of managing all these kinds of things like the access of a certain person to a certain data set. And if you were not evolving as fast as possible, as fast as the requirements, people were doing exactly the same thing, they were exporting the entire dataset into Excel and then going wild with Excel, which was exactly the same problem. So being able to evolve, being able to provide the data, it's both provide accurate data and real-time data so you don't also have the problem of having stale data that you're working with.

Francesco Tisiot: (30:01)
So you are creating these kinds of data flows. You can manage the data flows with streaming tools like KSQL or Apache Flink, you can change the shape. So if you have different stakeholders that need to see only a different portion of the data, only different fields of the data, you can easily reshape and create different streams and fetch and give this information to all the relevant people in streaming mode. And you are, you have more or less one good, reliable central touchpoint that there is Apache Kafka that allows you to shape all your different data patterns.

Kris Jenkins: (30:39)
Yeah. This sort of starts touching back on the idea of a data mesh where you're publishing your data as a first-class product with the governance of which fields are accessible, right?

Francesco Tisiot: (30:51)
Yeah, I believe, so one interesting thing to me is the more I'm in this kind of data world, the more I care about the metadata, because I believe that is the interesting part, which is sometimes overlooked but is really, I believe something key that we should always look after. Because together, when you publish a data set, the metadata describes to the outward how that dataset should be used, how the fields make sense, how they should, and how a consumer should take those fields. Is a field optional? Should a field consider numerical or not? Is that a field, an address or a phone number, or whatever? The metadata is extremely valuable information and we see with data mesh with all the explosion of data that we had in recent years, that having accurate tracking of data is a key thing. And I spent some time, I also built an open source tool that allows you to, for example, track all your data lineage across different tools and map that as a set of nodes and edges. So you can see data lineage, you can answer all the kinds of queries related to data as data assets, like who can see my code? Who can see my data? Which fields are accessible to which people? What if I change a column here, what is the impact, of all these kinds of complex queries, if you start tracking all the journey of the data and become just a network query, more or less, a graph query.

Kris Jenkins: (32:37)
Interesting. What's that tool called? That sounds interesting.

Francesco Tisiot: (32:41)
It's called metadata-parser. There is an open repository under Aiven, which is aiven/metadata-parser go and check it out. It covers Kafka so I'm analyzing the Kafka topics. If you have ACLs on top of the Kafka topics, but that's only Kafka. Then when you expand that to other systems, for example, you have a connector taking the data from a database into Kafka and what the tool does it browses the configuration of the connector, finds out which are the sources, which are the targets, and creates dots and edges between all of them. So you now have an automated way of creating a network of all data assets.

Kris Jenkins: (33:31)
Interesting. And that's presumably a really good tool for things like GDPR, but also for change impact analysis.

Francesco Tisiot: (33:39)
Right, yah. The basic thing is you have a lot of metadata, you need to make sense out of it. And as you said, GDPR reasons who can see my data, data lineage, impact assessment. I also had people who from the security team saying, "What if I create a new ACL? What is the impact of a new ACL?" You create a fake node, you run a bunch of scripts and you have the reply. If you didn't do that, you had to go back to, "Okay, let's do an audit, let's plan the audit for next week." Let's create a clone environment of what we have and try to understand the assessment. With metadata, well, it's all there.

Kris Jenkins: (34:27)
That's nice. I'm going to have to check that out. We'll put a link to that in the show notes definitely.

Francesco Tisiot: (34:34)
Yep, happy to do that.

Kris Jenkins: (34:36)
So before we wrap up, I do want to get back to Debezium because I just want to pick your brains on a couple more points, right? The first is it occurs to me it must be faster and more efficient, right?

Francesco Tisiot: (34:48)
Yes.

Kris Jenkins: (34:49)
Because you're picking up changes as they happen and you're only picking up what's changed.

Francesco Tisiot: (34:54)
Yes, exactly. So with Debezium, whatever your backend technology is, you are only picking what's happening. So this means that you are not going to miss any change and you possibly are not having problems with a change being replayed or something like that that can happen with custom queries. Then regarding it's fast, yes, it's faster. I believe there is a little note here about, depending on which technology you are pointing back, because for example, for Postgres, basically, the replication system of Postgres will ping Kafka with all the new changes. So it's almost immediate, right? For other technologies where you are querying like the change log view that is created on top of a table, that is still kind of a per query basis. So I believe you are able to set the polling inter volume there, but still, even if you have this kind of minor delay, it's not like you are querying the source table over and over again with a custom query that you built. You are querying a change lock table so possibly the impact of debt type of query is much, much minimal. You can run it much, much more frequently.

Kris Jenkins: (36:18)
Right. So it may be a push, it may be pull, but you're definitely just getting everything that's new?

Francesco Tisiot: (36:23)
Yes, exactly. Everything that is new. One thing, if you're trying the JDBC connector and you have hard deletes, check what's happened with hard deletes in your Kafka topic, that might be a surprise that I want to give to the people listening to this podcast.

Kris Jenkins: (36:41)
Yeah. So if that's one reason to go and check out Debezium, look at the way it treats hard deletes on the JDBC connector?

Francesco Tisiot: (36:49)
Yes, exactly.

Kris Jenkins: (36:51)
Have you got any tips for getting started with Debezium?

Francesco Tisiot: (36:56)
For getting started? So first of all, I would say the documentation is exceptional. There is are tons of things to learn for whatever source database you are using. Again, those connectors are different depending on the source database. So if you're coming from DB2, or if you're coming to Postgres, this set of instructions will vary. I would start depending on which is your sort technology with the Debezium documentation. That's the first place.

Francesco Tisiot: (37:28)
The second thing, I wrote a blog post about using the Debezium with Postgres, in my case. Again, that gives you a little hint on what are the fields that you're going to use? What are the discussions that you're going to have with your DBA if you want to take that out of Postgres? Because that's the other thing, when you are taking that out of the database, you will probably have to deal with the DBAs, which are the owners of the database itself. The cool thing about Debezium is that you are talking their language, you're using their tools in order to export the data to Kafka, and you are not creating a special case, a special type of query for Kafka. You're using exactly the same tools that they have been using to replicate the state tool and other instances of the same database type. So I would start there with the documentation. Debezium also has some simple code that you can use in order to start creating your Kafka connector, then check out the blog post. Then I believe also Kafka Summit recording might be available sometime soon.

Kris Jenkins: (38:43)
I would suggest soon if they're not already, depending on the timing of podcast releases. Yep.

Francesco Tisiot: (38:49)
And I will suggest, well, check out the full session. It contains the code that I use to recreate the example for both JDBC and Debezium, and it's also a nice way to get into how Debezium works.

Kris Jenkins: (39:04)
Okay, cool. Well, you've given us two tools to check out and they both sound very interesting. We'll link to both of them in the show notes and to your YouTube talk when it comes up. And in the meantime, thank you very much for joining us, Francesco.

Francesco Tisiot: (39:18)
Thank you very much for having me. It was a pleasure speaking to you.

Kris Jenkins: (39:22)
Cheers, bye for now. And there we leave it. Now for me, that's useful information I can file away for my day job and I hope it is for you as well. But right now I've got a hankering to find a stream of tweets about TV shows and look for patterns, that'd be a fun side project. That probably means that Francesco's very good at his job as a DA because I feel informed and inspired. So thank you, Francesco. If you are similarly informed and inspired by today's episode, now's an excellent time to click like or thumbs up or leave a comment or review on your podcast app or whatever. We always appreciate hearing from you so please drop us a line. And if you want to get in touch about anything on the show, my Twitter handles are in the show notes, along with links to everything we've talked about.

Kris Jenkins: (40:07)
For more information on Kafka itself, head to developer.confluence.io, where you'll find guides on how to use Kafka and how to make it integrate well with other systems like Postgres, Mongo, and others. And when you need to get Kafka up and running, take a look at Confluent Cloud, which is our fully managed Apache Kafka service. You can get started in minutes and if you add the promo code PODCAST100 to your account, you'll get $100 of extra free credit to use. And with that, it just remains for me to thank Francesco Tisiot for joining us and you for listening. I've been your host, Kris Jenkins, and I will catch you next time.