Streaming Audio: a Confluent podcast about Apache Kafka Podcast Artwork Image
Streaming Audio: a Confluent podcast about Apache Kafka
Kafka Streams In Action with Bill Bejeck
September 27, 2018 Confluent, original creators of Apache Kafka®

Tim Berglund interviews Bill Bejeck about the Streams API and his new book, Kafka Streams In Action. 

Tim Berglund interviews Bill Bejeck about the Streams API and his new book, Kafka Streams In Action. 

Episode Transcript

Speaker 1:0:01Today we've got a fresh new episode of streaming audio a podcast about Kafka confluent and the cloud.

Speaker 2:0:16Welcome back to another edition of streaming audio.

Speaker 3:0:19I am your host Tim Berglund. And in the studio with me today the virtual studio. Of course I have my friend a coworker Bill Jack. Bill welcome. Thank you. Great to be here. Awesome to have you here. Now you just finished a book Kafka streams in action. Yes. And I want to talk about that book talk about Kafka streams a little bit. What was it. I also am an author of technical book. Tell me what was it like doing this.

Speaker 4:0:47Paint enjoy all the same time. Very time consuming the most time consuming thing. I had taken it on. I knew I knew it was didn't know.

Speaker 5:0:58Yeah I knew it could be time consuming but it's one of those things you don't know you get into it and you're just like how can it take so long to write 200 words. What is wrong with my brain. Well that and you write you write Chachar and you feel reasonably comfortable with it and then you go through the editing process and then you go back and you never realize.

Speaker 6:1:27How unclear you're being. You never realize the assumptions you're making because the error comes back and says What do you mean by this what do you mean by that. You know. I can infer but you need to be explicit here and just that back and forth it just takes writing a rough draft or chapters time intensive but then there's the back and forth that really just kind of that's where the big bulk the time came for me.

Speaker 3:1:51And with that code involved you know obviously you know the underlying API sort of kind of well but you just write code for a book like this to be useful and that's that's pretty time consuming too. Yeah. Jacqui tell us about the API itself which is kind of walk us through. You want to go through like the outline or or what are in your view if if there's a listener who's new to the streams API. First of all how do you begin explaining it. What is it and then water. What are some high points. Just kind of talk to us.

Speaker 6:2:21Well if I guess we'll make the assumption that the person is familiar with coffee itself. Because to understand coffee streams you have to have a decent understanding of coffee itself. It's a library. Kafka streams is a library that runs on top of Kafka not literally on the same. Not literally within Kafka itself but if I take a step back and say for me with Kafka you have a producer and you write data into Goffer you've got data you want to get picked. You're picking up data you're writing in like event driven data you're writing those events into Kafka into a topic then to make use of that data you'd have a consumer that's going to be reading from that topic and it's going to be calling him off and you do things with those events you might be writing them back to you. You might be processing that data running it back to yet another topic you might be transforming it and writing the data into ElasticSearch so people can search on these events.

Speaker 3:3:24Anything like that and that that consumer that consumer I guess to to set you up doesn't do a lot for you. Right. It's like here's your messages here. Go do them do something.

Speaker 6:3:34Exactly. I mean it does. There is one cool thing about the consumer and this is more a thing of Kafka is you've got the notion of a consumer. So let's just say you've got a lot of records coming in and you need you want to lives and you want to have multiple threads right. And you have these consumers running in multiple threads. They're part of the consumer group. So for one for some reason one of those consumer threads were to die or something were to happen you have what's called that rebalance protocol. The Group Manager would recognize one of these consumers hey they dropped off contact the other two. OK. I'm skipping over details here about partitions but they would assign the work or the partitions that were assigned to the consumer that died. Those now get assigned to the ones that are still working.

Speaker 3:4:34Sounds great right. I mean it's cool that that happens. Romantically Yeah.

Speaker 6:4:38So you get that but no you're right you don't. You're doing all the work yourself. You set the consumer and your records. Basically it's like here you go here your records and you're kind of on your own there.

Speaker 3:4:49And then if you want and better balance is awesome if you don't have any state in memory right. If you don't know anything about the last message you got and and you know you get a partition taken from you and put to another consumer in the consumer group that's fine you have no state but if are stateless. Life is kind of bad for you.

Speaker 6:5:09If you're a consumer only right Jacqui stateful staple staple. So what Kafkas streams does is it's got underneath it. It's it's an abstraction layer. It's got consumer. It's depending how many threads you have. There's a consumer in a producer but you just point when you configure your Kafkas streams application them speaking at a high level here. One of the key things you get is the brokers or the Bouchareb broker list that it connects to and then you create a stream onto the case stream object off of a source you give it to stream builder source you give it a topic or a multiple topics or a regular expression to match patterns of topics and then you get back case stream object and then from there you can do all sorts of operations you know map Matt her you can do stateful operations where you can keep state context of messages that are coming your way like grouping for example if you like you want to group by the key or something. Exactly. And you can even Rechy it makes it easier to rekey because let's just say you're coming in and your messages aren't keyed you're getting customer data but it's coming off of the feed somewhere that has the customer purchase data in one object and then you need to extract out the customer ID and you want to put that as the key it makes it easy to Rechy that object. We will keep that record and then you do like you said you can group by key and then you can do count aggregations reducing on that.

Speaker 3:6:55And it looks as a podcast we don't look at code because it's new but it looks like you know if you've used any like collections API in the last five or 10 years or you know any functional language that has emerged in the last 10 years it's got all the usual suspects like you know you name them off. There's there's a map and there's reduce there's filter in it it it feels like functional programming is functional paradigm where you're no longer saying the API gives you a key value pair that you can imperatively two steps with it but you're saying hey this is a stream I go apply this function in this way to all these messages kind of or it.

Speaker 6:7:35Exactly. And that's one of the key things I like about it is you get out of the repair imperative to more declarative. You're not saying how to do it you're saying do this for you know for each record to get streamed through. And what's really nice is the operators. You have a Kingstree Taishi flap map. They expect to take a single argument methods same methods. So it is in this job a world that makes it really easy to write these concise lambda functions right that you pass in. So. You know if anyone remembers the pre Java 8 you know you try and write an honest class or an implementation of an interface. The boilerplate just kind of explode and yeah it's not all useful. You know it's a lot of curly braces.

Speaker 3:8:29I mean I ice straightforwardly simply didn't do it. I would like to write the code groovy or something I just would do it in Java because it was that bad.

Speaker 6:8:37Yeah. And that's what I really like. So you talk about functional flavor you create these fun you know functions are first class objects and then you can just you you just pass those in and so you're kind of taken out of. You know you just you have your event stream and you know what you want to do to each event and that's what you start building up so you can build up a. Simple typology or you can build up a pretty complex typology and it doesn't require a lot of code. There's something we can get into later sequel where you don't hear any code that's I'm getting ahead of myself here and that's what I like about it is it kind of lets you focus more on your business application logic. Yeah. Were you not worried about the little details right.

Speaker 3:9:27Also with that functional paradigm where you're basically just passing an Lambda's you know in a lot of the so like in my role sometimes I'll demo this in front of audiences right on live code I've all streams app that I live code front people on it. By the way if you've ever seen it kind of makes so shortcuts you'll probably look at it like oh that's going to create inconsistent results sometimes but it hit them as well. So yeah it's not wrong it's just simplified anyway. When you're doing that like the simple kind of stuff that I'll demo they'll always be just in line Lambda's. Of course they don't have to be you know you have proper methods in classes of their own where you're passing those in and the nice thing is that this all deals with or at least can deal with native types. If you've got this in cooperation with the complex schema registry you can have like Oh here is my sensor IoT POJO or whatever it is this thing that that capsule. It's all the stuff I get from my in Mike Mike Kafka message and those are the input and output as the return type and the input parameters to these lambdas are all those native types you can just write tests that are kind of trivial. There's no stopping to do is just here's this thing and here's my my type and I move them in and out.

Speaker 7:10:41Yeah that's actually the same lag you mentioned testing that I really like is depending on what you're doing. You know if you if you're calling filter you pass it a predicate. Okay. You should test a product you you could have a proper predicate. You know class. But if you have. Kafkas streams has a notion of passing in processors were transformers and they those were passed on by suppliers and the nice things are that you can just Pat you know use the land the notation and then have it invoke create a new processor each time. But that process like you said can be a standalone class and unit testing is easy because you just test the functionality within your processor. You don't have to worry about a lot of other things.

Speaker 3:11:32You don't have like all this API hanging around but you have to mock or anything like that it's just it's just your thing yeah.

Speaker 7:11:40And actually to that end there is a test driver that because you want to individually unit test all your. That's your individual part but it is nice to run in and then test to make sure everybody is playing nicely together. And there is a test driving framework that's nice because you can write your you know you can give it the inputs and specify you and then test the expected output. But it doesn't require Kafka it's all done kind of in memory which is nice because then that makes the task very fast to run.

Speaker 3:12:17What's that. What's that called. What would somebody Google if they wanted to get the docs on that.

Speaker 7:12:25Processor test driver I believe I could look that up.

Speaker 5:12:29Yeah. Like any programmer what you do is you google things you don't remember things. That's right.

Speaker 3:12:35They lost the ability to remember things. So what we'll do is we'll make sure to put that in the Schoenaerts because that's a super handy potentially super handy thing because when you're writing Kafkas streams apps you're being a Java programmer and when you're being java programmer you have hopefully little excuse for writing not writing tests because I mean these frameworks are are pretty rich and mature at this point and one of the things I love best dreams like I said is that I don't feel like it's getting in my way. If if I want to test first or if I just want to have test coverage from it it despite it being. You know if you think of it its operational life and its got thread pools and you know there's a producer and a consumer under the covers and there's all this infrastructure stuff happening as an API feels like it's trying to help me test my code which is always a nice thing.

Speaker 7:13:32Yeah. No it's really nice it kind of gets you out of it. Like I said before it gets you out of the little details and what I really like about Kafkas streams versus other frameworks is a you don't need a processing cluster you can spin this up and just run on your laptop. And it's all the Kafka protocol if you will. There's a couple of points. See if I can make sense of this here. So you you're ingesting your data into Callcredit let's just say that you're the nervous center of the nervous system of your operation let's bring your event driven data Esher.

Speaker 3:14:15I got your messages in a topic.

Speaker 7:14:17Yeah. And you want to do some work on those. I can't begin to tell you how powerful it is to open up your laptop write a simple program. OK you're not going to be running this against production. No one does that. But let's just say you have replicate environments you've got a Dell dev cough Jeff Kafka cluster through that's bringing data to open up your laptop as quickly read a program that's going to hit that and you can see the results and just run a simple program right there in the laptop that is going to be the real thing. You know you see right there in there what it is. And that's just so powerful because versus.

Speaker 6:15:01Writing the code maybe there is a legal framework you can run the stub it out but then to have to ship it out and deploy it and make sure you know to deploy it to this other framework and then go in and run it. It's several steps where this is just like I said You are so far I use intelligence a fire T.J. you run the code you run the code Lauralee and intelligent.

Speaker 7:15:28Yeah and that's it.

Speaker 6:15:30And it's just it could be it could be as simple as you know you're transmitting from that topic. You write it and you can see it from that topic and you get just be pretty tough to stand out.

Speaker 5:15:42And what happens if I do this what am I going get I want explore the data a little bit. You know you can kind of scratch that out in public static void main. You know a new news stream builder. And there you are. Yeah. And you're done.

Speaker 8:15:59Now let's dig into that separate processing cluster a little bit and let me frame that a little bit now. There is it's Kafka. There's a cluster. It's you know there's brokers and sometimes there's the development case where those broker you know there's a broker process running natively and my laptop or maybe I have a few of them in Docker containers or maybe I've got them in the cloud or I got some on prem thing and the VPN whatever you know there's there's brokers there's a Kafka cluster there's always a cluster.

Speaker 3:16:27So that's there. There's data infrastructure that exists and conventionally if you look at other other API or other other projects that have tried to solve the problem of stream processing they usually start with the assumption that also we're going to need a cluster of computers on which to run the stream processing program. That's the normal. And you said you would need that with Kafkas stream. So just talk us through from a deployment kind of standpoint by the way this is all very counterintuitive to people just encountering Koffler streams for the first time I've found that they usually find this mind blowing. So talk us through that no processing cluster. How does it work. And why did the team make that decision.

Speaker 6:17:15Yeah. Well the decision for. Having it be an application versus processing cluster that was prior to my joining but that provides some good reason.

Speaker 7:17:31Yeah well my my assumption here is for just for simplicity just because you have an application like I said you spend up on your laptop to test it out. But then once you have it it's a jar file and then you just deploy it and then you just kick it off and run it. So how you deploy it really depends on how much did you have and what your needs are.

Speaker 9:18:01It's very easy to let's just say you want to yet your processing needs are light.

Speaker 6:18:11So you created you take your jar and you start on a separate machine and you'll start processing you know doing whatever you want it to do for consumer messages and group name filtering and exaggerating and whatever.

Speaker 7:18:24Yeah whatever the operations are. And let's just say you want to push your final results out to Karka and then you want other consumers.

Speaker 6:18:34Other be ElasticSearch Mangieri or yeah whatever it. And then you've got the Kafka connector framework that can then distribute to those other disparate systems. But let's just say you've got heavier processing needs say you've got a fair number of partitions. I know I'm making an assumption here that your listeners know a partition where you've got a fair number of partitions and too much to run on.

Speaker 7:19:07Not too much but you you'll want to paralyze it you want to take advantage of the fact that you know by having multiple partitions you can process a state in parallel process these events in parallel. So you would take that same jar. And let's just say you've got a beefy machine that's got like 32 cores on it. You could spin off several instances of that on that one machine. And then each one has you have a unique application ID and the and each one of those instances would have that same application set.

Speaker 5:19:44So each one has the same application. Yeah yeah. Yeah.

Speaker 6:19:49Under the covers to Kafka progress to the group coordinator. It just lets producers and consumers. So it's going to say OK you spun off say five instances and we'll just make it simple. We're going to stay there. You can specify the number of threads you run with in a caucus change program application. For now we're just going to make it simple and say it's simple for it but you spin off five instances.

Speaker 3:20:16So each single threaded. Now yes one has a core and that's all cool yes.

Speaker 7:20:21So to the group coordinator it looks like five consumers from the same group are requesting access to this topic.

Speaker 10:20:30As far as the cluster is concerned this is a consumer group and there are five instances in the consumer group and so they're going to give it partition assignments and it's going to start processing that day and away you go. Now what's really nice about this. OK. So you're on your you've got your. Let's just say you've got ten partitions and you find that you run five instances and that does or that does your process he needs. That means in each instance is going to get assigned to partitions but then you find data Spight from time to time. Let's just say until the day you get this between 2 and 4 every day. As it's a bit of a simplistic example between 2 and 4 every day you get kind of a rush and usually like to double your capacity. So what you could do is let's just save on a separate second machine. You spin up another five instances.

Speaker 6:21:25So it's going to automatically rebalance so that each instance now was going to get assigned one partition and you've now just doubled your throughput because you've got a topic with 10 partitions so now you've got 10 stream instances running. Each one is processing you know one partitions worth of data.

Speaker 11:21:44And we're assuming there's a lot of that each partitions getting hit equally with an equal volume of data and it's pretty heavy. Right.

Speaker 3:21:52Has that operational simplicity here.

Speaker 10:21:56And but that happens you don't have to stop. You just open up those new five instances and there's a rebound once and it's automatically taken care of this new five instances. They get work the previous five instances give up half their work.

Speaker 6:22:12They're you know they're happy they have their each ones working on two partitions. They happily give up one of them and everybody's processing in there at your peak periods over and you're like you know you know the marketing folks and you use computer or whatever I want to see your clock costs whatever. Exactly. Exactly. You kill that machine. You're five instances go away your original five instances instances remain. Another rebalance happens and they go back to picking up their actual partition worth of work. And the cool thing is this is all dynamic. You don't occur in your downtime. You can add or subtract operational instances if you will just dynamically grow grow or shrink as it as it needs to. And that's that's the key thing here. So you know I go back to talking about running on your laptop. It's just it's a job you deploy that applications somewhere and that's it. So it doesn't. You don't need dedicated dedicated cluster.

Speaker 3:23:19Right. Right. So and So there you just said it. So what's interesting is I said hey what's this deal with you don't need processing Kloster and you described consumer group rebalancing which sounds like you know if I didn't know anything about this it sounds like you just described a processing cluster. So kind of in other words summarizing what you just said Kafkas streams is designed in such a way that you don't need to separate plot processing cluster because the Kafka consumer group coordination mechanism already means your application is one of those things right. Like you already get horizontal elastic scale ability if you're a Kafka consumer. So it's like the whole processing cluster thing is built in. So I've got some application that lets say my application does something other than stream processing. Right now I don't know what it is. Maybe it's a Spring Boot app or there's something that it does expose the web front end. I don't know and like maybe I need to scale that out maybe I don't maybe I should just give it out a little bit but I've already got some way of deploying that application. I know how to do that that's by definition maybe that deployment mechanism is incredibly cool. Like when I push commits murder commits to master and see test pass my crudities Kloster automatically upgrades as you know maybe it's like space age maybe I'm not deploying a war file to Tomcat because 2004 is a good year.

Speaker 5:24:52You know Lisch is a lot like the family family podcast.

Speaker 3:24:59So like it doesn't matter how do the deployment I can deploy code that is a given and like streams doesn't really have an opinion about how you do that but because you can deploy code and because consumer groups are what they are. You can scale your code and so adding stream processing to my Spring Boot or whatever thing it is I am whatever kind of job application I am certainly adds heat. You know there's more computation happening in consuming these messages and doing whatever it is it's doing. There's more aiyo which is going to put pressure on my scaling but look it's OK. We already know how to scale out because we're a Kafka consumer group so I just love the arc of I am myself familiar with this stuff so sometimes I forget to be impressed with the architectural elegance of the thing. But streams takes this really hard problem of how do we build a distributed processing cluster and says well we kind of had that lying around with Kafka anyway so well just use that one. You know a lot I love I love that answer.

Speaker 6:26:05Yeah I was going to say because Kafka itself was already a distributed system and stream which is a leverage is what's already built into Kafka. So it's just you know all that's already handled for you and you just basically and you know the one thing I forgot to mention that I think is powerful and you touched on it is I was speaking solely in terms of having a stream's application by itself an application a right like so. But you can easily embed that in an existing application like you mentioned something you have a spring application. Let's just say you're running some sort of application already and you need you want to get real time updates into that application. It's easy to embed Kafkas streams into that application.

Speaker 3:26:55It's a room with a single dependency. So add to your gradle or Mavin gold file.

Speaker 12:27:02And now you have the API so your okay.

Speaker 3:27:11Cross-examine you on that a little bit. That's a great story but hey isn't that assuming again a listener who knows something about Kafka maybe even something about consumer groups that whole scale story is true merely of being a Kafka consumer so. And and we know when writing Kafka consumers said this before I think if if you're stateless the way a consumer group scale is trivial. So if I'm a streams app and all I'm doing is filtering or mapping or mapping values or something scaleless like that that whole scale out and scale back scenario you just described sounds kind of trivial. What if I am stateful. What if I'm grouping for example or running an aggregation where I have to have things in memory talk to us about what streams does there.

Speaker 6:28:03Yeah well that's the other great point about Kafkas streams. If I could take a second take a step back and just look what stream processing in general there is at least for me. The first assumption is you have these discrete events that have nothing to do with each other and you're just process in each of them. But they're real in the real world that's not always true.

Speaker 3:28:27Like if you're going to say that's that's a great world like lots of star trek episode and I want to own a live in. Exactly. Right.

Speaker 6:28:34But usually you know some kind of state you know you have your track is your tracking stock purchase you know how many share you know or your purchases you know are you looking at events you know the same IP is you know trying to hit you know make strange request and you what's going on here. So you need to keep track in each state you need to remember if you like to think of it as adding context or remembering what you've seen before.

Speaker 3:29:01Yeah we have seen this credit card number authorized in the last five seconds more than three times. So that's a staple question.

Speaker 6:29:10Yes exactly. So streams offers state kind of run out of the box. It provides state and one of the key things about that state is it's local. So under the covers while there's there's different types of state stores you can have in-memory or you can have a persistent state store which uses rocks to be under the covers but it's local if local to that task. So you know you like you said you go back to the example you get a credit card number have I seen this before or how many times I've seen this in the past hour. You got to look it up. It's local. You're not making another network trip out somewhere to do that. It's a local it's local state right there.

Speaker 6:30:02And now again talking about the real world what happens when you crash. Because as much as we hate to admit it crashes happen things bad bad bad things happen bad things happen to good use. You have these the state stores even the in-memory state stores or backed by a change law topic. So basically as you're writing to your state store now there's some tweaks to this that I'll make but I'll just give you the proof the high level view right now as you're adding things just staged or they're being backed up by a changeful top. So in that instance let's just say you're running make it simple you're running on one hand. You've got one machine and you're doing count spy credit card ID and you wait and you can also window the window of these things.

Speaker 3:30:55But of course that also has lots a second doing its own podcast Yeah.

Speaker 6:31:03So you're you're doing Ruby and you're doing these aggregations and your computer does. You're able to bring it back up but everything on that disk is just gone. So you think OK what has happened to my state. Right. Well when you started back up that state your process is going to start back up at State Store was backed by a change locked up. It's going to read that change off topic and get back to the point it was of the last comment because it changed our topic is just a topic and so it commits as it's making as it's as you're saying the state of the store if you will you've got comments it's going to replay. It's got to refill that store up to the last Kalidasa. So it's always its fault tolerant all your states even your in-memory state stores are full tar in that respect.

Speaker 3:31:59So if I my last message that I had consumed because again the streaming application is consuming and producing some can do as a Kafka client you know you were adding all kinds of fancy abstraction on top of that but as far as Kafka knows there's this consumer and it's consuming from some input topic of say you know credit card authorization attempts and it's grouping them by card number which means in memory one way or another there's like a hash table.

Speaker 8:32:28And the key is card number and the value is I don't know whatever count or something. So there's this in-memory hash table but as I'm consuming messages from my authorizations topic I'm going to consume from some offset like the last one I read was was offset. One million forty eight thousand five hundred seventy six and commit that offset to the cluster saying I have officially consumed 1 million 40 of those Veivers and 6. Also you're saying it sounds like if I may. Atomically and at the same time more or less the version of the state that that hash table of credit card numbers accounts in memory I am effectively persisting that into a Kafka topic.

Speaker 6:33:12Along with that offset yeah when yeah when you when you commit you're committing basically make it simple. That record has traversed your entire typology and your committee saying OK I've read this record it's gone through all the processors. And if one of these processes were stateful whatever state that was that you added that's going to be sent to a conflict topic and that's that that's persisted to a Kafka topic. And then when you start back up you've got a separate consumer of the covers that reads from that and restores up to your last commit up to that.

Speaker 3:33:59Actually the last known offset of what was in there was in the store and that that restoring process that again is when a new stream streaming application instance is coming up or a new partition has been assigned or whatever you know. I have new stuff whether I it just came to life or I just got a new partition assigned to me. I need to swap that state back in and that separate consumer that's pulling in from that internal topic of my persisted state has to do that swapping in before I start processing new messages and I've really come online. Right.

Speaker 6:34:38Yes yes. Yeah and there's another thing there's something else along with that.

Speaker 10:34:45Because what that implies is you've got two instances instance because in essence is going to take over let's say instance B had a stateful operation.

Speaker 6:34:56Wow. When it comes up it's going to have to refill the state store. That instance be had instance is going to have to replay that word can save you fail over time is something the notion of a standby task which is configured it's a configuration item. Let's just say you configured your dreams application to have one standby replica. So instance B dyes will instance a has kind of is a has kept a standby replica of that stage door. That's one instance be so instance B dies a fails over it takes over the task from B. But the state store is already popular in this state. That was when it when it crashed. So the feel over time is much less.

Speaker 3:35:50Right. So if you rebalance time yes if you need rebalance time to be very small and you've got say latency guarantees even in the presence of failure you need your 19th percentile to be a certain thing then you simply invest in the compute resources to have enough standby instances that somebody is standing by all the time and you don't have to be doing this lengthy rehydrate process. Sometimes people ask me how long that takes. And as if there's a closed form answer it has.

Speaker 6:36:21Well you know you no you can't. It depends solely on how much how much data there. If you're it's kind of an obvious answer. There is a checkpoint going on. So like if you're restarting an application it's not going to reload the entire change law topic each time that happens in the case of you know a complete failure where like the disk is wiped out or something like that you know that that so when it comes back online is going to have to fully restore again if you're using standbys in an instance dies then it does not have to fully restore. And the the the trade off with the standby is you're trading rebalance time for disk disk space as well. Totally. Because you know but which kind of gets into a.

Speaker 3:37:21Operational topic and the thing that I think that requires tuning out. Final final thing I want to talk about this is another thing that when I talk to people but streams two things trip up developers. One is what do you mean there's no process in cluster and I love coffee streams answer to that a lot. So that's easy and fun for me to explain because I like really move it the right thing and I didn't even I didn't even give my entire rant on why that's such a great idea and I'm going to I'm going to withhold that. Maybe there'll be another episode where I can just go off on that. But I really love our approach of being an API and using consumer groups to scale. But the other thing that's counterintuitive for people but it's also the other thing that is not counterintuitive but just hard to get is tables right.

Speaker 8:38:14Casey seems like you've got a case for war and you didn't even say what one was you started talking about case stream and methods and it made total sense right. Everybody's like yeah I get it. There's topics and there's messages and obviously that's just class that abstracts over that and I got this functionally API and I can filter and I can group and you kind of see it in your head and you didn't say OK table but now you say cattail puplic. Wait. OK. I don't get it. So give us give us the five minute version of K tables.

Speaker 6:38:44Sure. OK so let's roll quick K Street stream isn't event street. It's an unbounded adventure. But each event has nothing to do with the other one. They're just independent happy events moving along. Katie who is considered more updates Street where going back to the example with credit card numbers and see your credit card numbers cheap in an event stream. If you got five events with the same credit card numbers the key you don't really care they're truly independently. But a cake table it's considered an update. So now you get a credit card number is you yet the first one they say the value is a. You get another one that comes in and the value will be. Now that second one that comes in is going to replace the first one because it's considered it's an update it's not there are no independent events anymore for keys that aren't related to each other. They're independent but they are no longer cable is. It's kind of it's an update stream.

Speaker 3:39:57And the topic is a changelog describing changes that happened to keys. And not exactly not events in the sense of like sensor readings or credit card authorizations but updates to a collection of keys.

Speaker 6:40:12Yeah. And what really kind of something under the covers that drives that it's really neat is the concept of a compact atomic because in copper you can have talks and obviously you know you you get events and you append them to the end of the war logs logs and keep growing. So eventually at some point you have to draw that data and you just delete the oldest data is different strategies you can configure. So it will get into that here. But essentially you're deleting older records is the bottom line. You know so you keep appending deleting the old stuff. Cable uses a compact a topic which is unique in the fact that you're not deleting data you're keeping the latest record for a given CI. So it goes through. So if I got my Tim record yesterday it let's just say I have an A grade you know I'm going to keep that topic small. And let's just say stuff is removed every few hours. But I got my Tim record yesterday. But that was the last record I got for Tim. It would still be there today because we compact the topic goes through and keeps the latest record for a good read. So it kind of. So it trims it that way. But you always have like you said it's a change. It's a it's a change long hours no longer just this log of discreet that everything and you really really care about the most recent one.

Speaker 3:41:48If you're trying to make it into a table whatever values of the Tim Key existed before you're explicitly saying you don't care about the history you just to know the current state of doing that. Keeping storage of this kind of data efficient is this topic. Law compaction thing.

Speaker 6:42:08Yes. And one thing that's interesting about a kitchen table is a kind of I don't know puts a database. The database is on its headset or flips the idea of a database around because your stream your stream as a table now it's you know it's the message streaming and it's constantly flowing through.

Speaker 3:42:32But at any point important time the current thinking that is exactly the console being up that you can turn streams into tables. I mean in the API there are there are ways of taking a case stream. Well I guess most obviously grouping right group and then apply any kind of aggregation to the group. The object that is going to be returned by that method is going to be stable. And so harnessed stream into a table. And of course if you've got this compact that changelog topic out there lying around in your Kafka cluster you can also say Hey I would like to have a Aşk table object backed by that thing and turn that raw changelog TOPIK data into this beautiful API object now that lets you do all kinds of cool things like you know you could join a stream to it or join it to another table or do whatever it is you need to do.

Speaker 6:43:24Yeah. Now there's one other thing can I tell you one of the things I really love about Kafka's dreams. We have to make this court totally. I would love the interactive quieres because I'd mentioned stage doors before and okay so you have. You have I won't go too deep into the weeds here but Kafkas streams Craige tasks depending on the layout of your source topic. It could be a task for partition I would assume.

Speaker 5:44:01I always tell people. I'm glad it's not only about Africa. I. Know that could be me.

Speaker 6:44:10You know there's other scenarios where you might not have a task partition but when you're explaining it is easy it's much easier to say I've got one topic with five partitions. You're going to fart. He's never got mad about that right. Yeah. Yeah. Zacharek. No what make it six tasks. Because what we're going to do is we're gonna say I want to run this into Wednesday. I want to run this in two separate applications for some fault tolerance whether it's on the same box or a different box or whatever. And it stateful but you want to see what's going on you want to get you into that state but you've got 2 application instances. So how do you you know so you're going to have top your partitions 0 1 2 in one instance and 3 4 5 on some of the words.

Speaker 3:44:56I have a card table and I want to get the value of some arbitrary key which may or may not be in my task right.

Speaker 6:45:04Yeah you don't know where it is you don't know where it is. So interactive queries Does it exposes a kind of it's an abstraction over the state store notion that it allows you to when those two instances go up though you have to specify the IVs. Forget the jack insigne will put that in the show notebook. They will know about each other and then you can do interactive queries where you can just query all of the state stores that are. Running under the covers in your streams applications but to you you don't know that you just issue a query to say you say hey I've got this get this credit card number. I want to see it count or all the purchase of whatever you've done your stateful operation you've done by this credit card number interactive queries allows you to see that regardless and you will find it wherever it is whatever state towards it you know find it and you can get those results as they're written.

Speaker 3:46:10So essentially confined to your task or not.

Speaker 6:46:14Exactly. So it allows you to set up. You can query that and see the state of your state so to speak. As it's happening because typically what would happen in the past was in my experience Prikhodko streams you get data you have to publish it out whatever you're doing you whether it's you get somebody to do some processing on it.

Speaker 11:46:44And if people want to see those results. Each time process you'd have to publish it out to a separate database. Exactly. And then there would be a web of like a web application that would hit that database table.

Speaker 6:46:56Now you don't need that database. You just do an interactive query and see what's in your state.

Speaker 11:47:03And it's if the key you're asking for like you just have to have one application instance fabrication instance doesn't have a record the other application instances for radius Zachares you so you don't need to you have effectively in the form of a table.

Speaker 3:47:22You do have a little distributed database under the covers there and so that it doesn't. You know where I think I'm careful the way I talk about this architecturally this does not make all relational databases go away it's like you're never going to ever deploy. But I'm not going to probably deploy fewer of them. And like the need for a big giant database I mean it kind of starts to get pulled into your Java application a little bit. Like you said you've turned the database Inside Out Now. The the the fundamental piece of infrastructure is this big giant log and you're just making little tables and little materialized views of that log in your services.

Speaker 5:48:00Through Kafkas streams. ZACH Well. My guest today has been Bill Beecher. Bill thanks for being apart. Streaming audio. Thank you everyone.

Speaker 2:48:08It was a lot of fun. You got it. And there you have it. I hope that was helpful to you if you got questions you can ask me at at T.L. Berglund on Twitter. That's T.L. BRG. You Andy or you can leave a comment on any of our YouTube videos. Your question might be featured on the next episode of streaming audio and feel free to subscribe to our YouTube channel and this podcast where ever find podcasts are sold and if you subscribe through iTunes. Be sure to leave us a review there that helps other people discover the podcast and just generally helps us get the word out. We appreciate your support. See you next time.

See All Episodes