Crossing the Streams - The Unofficial Apache Pulsar Podcast

Messaging, Streaming, and Events 101: Episode #1 of Crossing the Streams!

October 08, 2022 StreamNative Community Season 1 Episode 1
Crossing the Streams - The Unofficial Apache Pulsar Podcast
Messaging, Streaming, and Events 101: Episode #1 of Crossing the Streams!
Show Notes Transcript

Tim and Matt welcome you to the first episode of "Crossing the Streams", where we start to talk about event streaming, event-driven architectures, message queueing, Apache projects, and more!  Today we talk about some of the basic concepts around messaging and streaming.  We try to highlight the subtle differences and use cases.   If you are building systems that require realtime analytics, data streaming technologies, or are merely interested in learning tune in!

Crossing The Streams Episode 1

[00:00:00] Matt Yonkovit: hello everyone. Welcome to the first episode of Crossing the Streams. Yes, we are crossing the streams despite what those in the Ghostbuster universe told us not to do. It did save the day in Ghostbusters, and today we're gonna be crossing the streams, talking about event streaming and building event streaming apps, and everything in the Apache Pulsar. I am one of the co-hosts here, Matt Yonkovit. I am director of Community at Stream Native, and we are joined by our other co-host Timothy Span. Tim, how are you?

[00:00:33] Tim Spann: I'm doing good, I'm afraid in this stream.

[00:00:36] Matt Yonkovit: Oh, look at that. He's even going, pun. He's going pun related already. And we just started. So we are here to start this podcast off with a bang. But we are planning to go through and talk to other people who are building, apps that require. A modern infrastructure that relies on event streaming and an event driven architecture.

[00:01:00] We're here to talk about people who are, using Pulsar , Flink , even Kafka, you know, other things that are in this space. We wanna hear how they're deploying these streaming systems and how it is making their lives easier and, and help you figure out what you can do with some of the technologies that are out. Tim, why don't you, we start and you just give us a nice introduction for yourself to let those who are listening, to our first episode here, know a little bit about yourself.

[00:01:32] Tim Spann: Sure that makes sense. I've been, uh, around Pulsar community for a little over a year. Before that, I was a engineer in the field for Cloudera, covering Apache NiFi, Apache Flink, Apache Spark, Apache Kafka. So I was in the streaming space for like six years through them covering the [00:02:00] intersection where we hit data lakes.

Databases, Data warehouse, hadup, all these different, uh, increment of big data and all kinds of weird streaming data. In memory. Iot, I've got a big thing for devices and that, uh, fits in well with getting data in a stream, quickly, doing things with it. Before that, I was at Pivotal. Where I was doing pretty much the same thing, but also little bit on spring and microservices, which is come back full circle now that spring and P are actively working together on a couple of projects, which is, uh, pretty exciting for us who've done, uh, spring and Java for a number of years.

And I'm gonna be out at the spring conference later this year talking about. Well, we'll see if, uh, we're going full, uh, Ghostbusters outfit or [00:03:00] not.

[00:03:00] Matt Yonkovit: oh boy.

[00:03:01] Tim Spann: cluster on my back. Carried around that probably

[00:03:04] Matt Yonkovit: that would be cool. We could do a cluster of raspberry pies on your.

[00:03:08] Tim Spann: Hmm. Cause I mean, I've got some devices attached to my, uh, jumpsuit already, so we'll see.

[00:03:15] Matt Yonkovit: Okay. Wait, wait, wait. What is that device and why is it attached to your jumpsuit?

[00:03:20] Tim Spann: Wow cuz they got a place to put it on a belt. Uh, this is one of the nice, uh, devices that come from, uh, Ada Fruit, New York. Uh, Lady Eight and the team there have some awesome devices. Obviously now it's a little harder to get stuff cuz if. Global supply chain, not using Pulsar, you know, what are you gonna do?

 this one is interesting cuz off the shelf uses micro python to send messages out every couple of seconds. It's got a couple of sensors here for temperature, humidity,

[00:03:53] Matt Yonkovit: So wait, you have a pressure, humidity, temperature sensor on your.[00:04:00] 

[00:04:00] Tim Spann: Yeah. I gotta figure out where to put it. And you. Do I hook it up to the back? You know, maybe that goes in the back on top of the uh, pack, because I gotta still build out a pack, but

Nice is this is going to pulsar.

[00:04:15] Matt Yonkovit: Oh, oh, okay. That's cool. But I'm still trying to figure out if you're just trying to make sure you don't go overheated, was it, was it something like your family's worried that as you go out in that suit you could get overheated?

[00:04:24] Tim Spann: It, this actually is a use case that came up. I don't remember what year that was, a couple of years ago. I was working with, uh, a customer that does utilities and they were working on a combination smart suit, which it probably would've made more sense to be a smart jumpsuit, so they had some sensors in there.

So yeah, part of it is temperature. And then some of the other sensors I've got next to me makes sense, more sense air quality, carbon monoxide levels, other methane, other noxious gases. I mean, [00:05:00] heat's important. You don't wanna be overheating, but you definitely don't wanna be breathing in some weird, uh, weird chemicals that might be more important.

Sends it right to pulsar, goes to central command. You know, obviously you could change a color locally, but you could be busy not really thinking about what's going on on your self.

[00:05:19] Matt Yonkovit: so.

[00:05:20] Tim Spann: someone else can tell you, you know, you get a ping in your, in your, uh, wireless Bluetooth headset that says, Hey Tim, get outta there.

Uh, sensors are going off the. You know, maybe there's ghost, who knows, Whatever it is, get out of that building. So there is, there is a need to have real time stream, even in the, the physical world.

[00:05:39] Matt Yonkovit: Okay. So, but what you're saying, and, and this is a, this is actually a pretty interesting use case and a pretty cool thing and you're saying that there is some entity out there that was working on suit. For people who are working maybe in mines, maybe in, you know, like sewers or, you know, with, you know, petrol, chemical plants, whatever.

And [00:06:00] so they are building into their helmets, their suits, whatever the sensors that will report back to a central location. So if they end up passing out, then someone knows to go run and get them.

[00:06:11] Tim Spann: Yeah, or just obviously with all these things, the thing that's most important with streaming analytics is being able to do predictions, so predict to say, Hey, temperatures here. Number of hours in the field is here, location is here, gas levels are here. Uh, you got 20 minutes left before you gotta leave that facility, you know, and we could tell them, give them a timer and update that real time.

And it also could be they were also combining it with other streams, which is again, the best part is streams, is crossing. You got the local sensors, but then I also have feeds looking at, you know, local social media to say, Hey, is there some event going on? You know, check, uh, public weather feeds, traffic feeds, you know, transport feeds to say, Hey, there's some [00:07:00] cri criminal activity nearby.

The highways are jammed. Uh, the, it's starting to rain. All your senses are up. Uh, we're cutting our losses. Get outta there now, or, Or Yeah. And it's very easy to join them. What's nice is get all those feeds into Pulsar and then we join them with fling, sequels or any of the uhlin SQL gooey tools that, uh, our friends have out there

[00:07:26] Matt Yonkovit: We jumped ahead a little bit because that was an interesting use case that I didn't even expect to go into. It was just that when I see flashing lights on Tim's

[00:07:33] Tim Spann: Well, they, they're not that flashy. Those, I should have 'em flash maybe every time it does a reading.

[00:07:38] Matt Yonkovit: Okay, well e either way, when I see that, it's like, what the heck, what, what, what's Tim doing? But now I understand. It's to understand, if his air quality's okay and, to alert that there is proof of life because Tim never leaves his desk. Um, so that way his family knows he's still alive and stuff.

[00:07:54] Tim Spann: or if I'd start, uh, putting out too much hot air, we know.

[00:07:58] Matt Yonkovit: Yes. We, we know [00:08:00] what

[00:08:00] Tim Spann: talking too much hot.

[00:08:02] Matt Yonkovit: Tim Tim's talking too much hot air, but that's a very interesting use case. But here, here's the thing. I think that people hear that and they're go, That's very cool, but they don't actually understand some of the technologies underneath that. And so I figure today as we start this journey, we're gonna talk with a lot of other people.

 our next episode's gonna have Mateo on it, who is the CTO at Stream Native. And so we'll get into, why Pulsar was created, what use cases at the beginning. But I think we really need to start with the basics. Right. And from a basics perspective, when we talk about event streaming or we talk about messaging, what is that?

And some people get it, if you've been doing it, if you maybe have written, a ton of old school applications, you probably used MQ in the past. Um, if you are, you know, oh, look at. He just, he just cringed you, you know, for

[00:08:57] Tim Spann: I.

[00:08:58] Matt Yonkovit: listening. Yeah. Yeah. [00:09:00] You, you might know what MQ is, you know, and messaging systems and, and queuing systems.

Um, or you might be designing some applications to take of, you know, care of event streaming. But my question is, if you're not in that space, maybe we can help people. Some of these key concepts, and I think that the first one to tackle is, you know, we'll start with what is the difference between, message Queues and event streaming?

[00:09:27] Tim Spann: I mean that there's, there's some debate for that. I don't know if, if you wanna start with your opinion or you wanna read, read a one liner, someone's put out there. There's, there's some debate when one starts and what end. like, I think the problem is with everything, once a new term comes out, people try to adapt the term and their tech to whatever the new thing is.

So some people will say, Oh, messaging. Well, yeah, Kafka does messaging too and you know, you don't wanna [00:10:00] jms, you know, there's same ideas, but yeah, there's, there's some differences if you want to be very precise.

[00:10:10] Matt Yonkovit: so, So Tim, I'm gonna use my Newb because I'm newb. I'm a Newb here, right? I am not. I don't come from this space. You know, there are key differences. One of them obviously is with a queue. You're looking at writing to the queue and then pulling off the queue, and that message disappears. And from an event stream perspective, it is appended and it just continually grows that, and you just move the pointer along in that list. Is that correct?

[00:10:45] Tim Spann: That is correct for some use.

[00:10:52] Matt Yonkovit: For

[00:10:52] Tim Spann: I think the problem, Well, I think the problem is, yeah, in a pure messaging queue, that's what should happen. It should be like a [00:11:00] real queue, put some on, take it off, but it's gone cause you took it off. But that doesn't meet every use case. So people have, added additional functionality, which is it still a messaging system?

Is it a hybrid system? A lot of these will retain the message for a certain period of time, or we'll have ones that are persistent. You know, you can put it durable. Oh, it's a durable message. We keep it for some period of time, or you keep it forever or you keep it if it's in some state, maybe you gotta do an acknowledgement.

That's when all these things become complex and it becomes harder because people keep wanting new stuff and want to do different types of apps. So they add additional knobs and dials. So it started off, you're with the Atari joystick and now you're with a console with a thousand buttons. And it looks like, you know, you'll launch, you're in a spacecraft and you're like, [00:12:00] Oh, make sure you do these nine things over here and twe this and this.

Okay. And then your message comes in, goes out.

[00:12:06] Matt Yonkovit: Okay, but Tim, when should you, Okay, But so, so when would you use, a message queue or that kind of pub sub model versus a stream? Like gimme a couple examples when you would use one over.

[00:12:21] Tim Spann: Well, I mean, certainly there's a lot of traditional apps that are out there that already do it, so you use it for that reason and some won't work differently. The other is sometimes you wanna use messaging as a way to disconnect systems, which is a best practice in a lot of. You know, when you're, you have two unrelated systems, I wanna send a message between 'em.

I wanna send data between them or an action. Having it done asynchronously and having it done in a way that I don't care if either systems running is a really good practice, because then someone you know, brings it down, starts it. [00:13:00] I didn't lose anything. You know, I'm not like trying to make an HTP call and it happens to be down or I can't get there.

This also lets you have security, so it's a layer of isolation. This was huge in the nineties was the first time when you have the SOAs and the, uh, you know, you have, uh, servers sitting in the middle though. It act as a hub for your data. You get it there and you know, either you pull it or it gets pushed.

It. That's a really nice use case. Just having that async, just having that disconnection and just having someone else in charge of not losing that. So it's not, You don't have to sit in one vm, you don't have to sit on one machine, you don't have to sit. It's a way to communicate between apps, between servers, between systems in a very logical, asynchronous way.

That's the basics.

[00:13:56] Matt Yonkovit: And so this lets you fire and forget.

[00:13:59] Tim Spann: Fire and [00:14:00] forget. I sent a message. I did my job.

[00:14:02] Matt Yonkovit: This is a critical thing for when you are doing high throughput data transfer between systems and you don't want to block or wait. Right. You don't have to go check the state did it? Was it delivered? You can have, especially with modern cuing systems, like, like Pulsar has capability.

You can have, retries, you can have, , exception handling. You can have guaranteed delivery. There are things that are, that are part of the system where you don't have to worry about it. So your app can say, Yes, I need to write this transaction to this other system. And this is often used.

 In any app, think, think of this as, you know, we'll use a very classic example from a banking perspective, right? You do some sort of local transaction and , you need to send data or an audit log or something to, ancillary systems. How do you ensure that, you know, when you upload that, where does it go?

I used to work for a, uh, trucking company like [00:15:00] 25 years ago. God, it's been forever. And they had to have a, you know, a check on, who's renting trucks to cross check, like a watch list, right? To make sure that there wasn't some, like a suspicious terrorist activity or something, or this person wasn't on there.

And so they would send the data out, right? And it was, a constant like, okay, send it out, make sure it gets delivered. But it didn't necessarily have to be. Um, and that's another potential use case there. Now, we talked a little bit about event streaming with your devices, but maybe you can tell people, now that we kind of talked a little bit about that pub sub model, the queuing model, why event streaming?

Why streaming? What's the, the key thing that people get from streaming over that, that, that queuing?

[00:15:49] Tim Spann: Well, I think there's one part of the evolution and a lot of that was driven by. The cloud scale [00:16:00] and you know, realized initially in things like Kafka is I need to be able to send messages and things like ordering started to matter and things started to matter of, you know, different guarantees on. You know, I had to get it exactly once I get, I had to get it at least once and things were a continuous feed of data.

You know, messaging was, you know, like email. I send one and maybe I send another one. It's usually not continuous. It certainly could be, especially with iot. Another ideas there then you like, Is that Q and or streaming? With the streaming it's, I've got log every time a log comes in. Send that log event into my uh, stream and someone's gonna grab it right away.

Maybe do analytics or SQL or store it somewhere. I want data as it [00:17:00] happens, or, you know, for a spark use case, really small micro batch, you have a second or a couple seconds of, you know, a couple events together. But generally as it happens, send it to me because there's a lot of them. And I just need to get them in order, or I need to get them in something resembling in order.

You know, maybe there's a keys, I don't care, but get it. Do something with it as it happens, or very close to it.

[00:17:32] Matt Yonkovit: So it's, it's a big key difference between the queuing systems or the messaging and event streaming is. How the volume of data and how real time or as close to real time as it gets because it's kind of like getting your packages delivered, right? Um, you can get your packages delivered and it can be like when it gets there.

And that could mean that if you order two things at the same time, one. [00:18:00] Arrives later than another. Um, but if it is guaranteed delivery in a certain amount of time, like two hours, they're gonna have your groceries in within two hours before your ice cream melts. That's more on that, that, that streaming side. Yeah.

[00:18:14] Tim Spann: and I think another thing you hate to, uh, Put concept together with implementation. But, uh, a key factor that's highly differentiated when streaming came out was distributed nature, much like Hadoop, Federated, you know, file systems and storage. The same idea happened with streaming. When I think of streaming, no one's thinking single node, one server sitting somewhere.

And everything runs through there, maybe a backup. People are thinking, I have a lot of small servers. It scales up and down as needed. You know, these use cases are too big [00:19:00] for one server. You know, this really applies to the same period of, Yeah, it's gonna be hosted in the cloud or now Kubernetes. It's gonna scale out.

I'm gonna have clusters. I'm gonna have concerns about failover cuz it's running continuously. I can't go, Oh, we're taking the server down for the day or for the weekend. It's uh, this is running 24 7. It's never slowing down. Maybe it'll speed up, but it has to keep running. So I gotta be able to take servers in and out, real time, scale up.

As you know, more events come in that's, that tends to be. Philosophical and implementation difference with Stream.

[00:19:45] Matt Yonkovit: Yeah, I mean, it's the difference between maybe, taking sips through a straw versus drinking from the fire hose, right? . So you need to be able to like, you know, like scale to the fire hose size, you know, drinking vessel, which is quite a bit. And, I think that that's, [00:20:00] a very important differentiation as well.

Now, we did mention a couple of examples, right? So if we're tracking real time,, what sort of things are happening in your sensors, or maybe we're tracking GPS coordinates from, people driving around. Uh, classic application that is used in a lot of examples is, uh, kind of that Uber Eats or GrubHub type example.

You know, you've got food delivery. Where is your food delivery driver? Did the food delivery get to the restaurant? And is it in process? And what is the status of that? There's a lot of data flowing back and forth between disparate systems because, it's not like there is. One central device that sits there, it's a thousand different people's phones that are tracking, where your food is.

Right. You know, if I'm a delivery driver and I'm driving towards the restaurant, then my GPS coordinates have to go back up through the app and then that has to go back out to my phone as a purchaser, and tell me where they are. And so we [00:21:00] have to have this kind of two way flow of data. Um, but I think that one of the other things that's an interesting.

 Component is. The new application architectures that we're building, while we have focused on kind of reducing the footprint and creating a lot of microservices, it's made it so there are not just like a couple of components that have to share data. Now there's dozens or hundreds of components that all have to pull data from one another and send data back and forth, and that's made the highways between the microservices and the systems even more critical.

 That capability to kind of move things back and forth efficiently. Um, a, a really big problem that needs to be solved. What do you.

[00:21:49] Tim Spann: Yeah, that's exactly on point there.

[00:21:52] Matt Yonkovit: I think that that's the other, um, interesting thing, you know, when we talk about the streaming side or the pub people think that, [00:22:00] oftentimes they need, just streaming. Because streaming is the evolution of, queuing is it, is, is streaming really, um, the same thing as queuing?

You alluded to it not being before. So I'm interested in your take.

[00:22:15] Tim Spann: Well, I mean, it comes down to what your use case is. I think some people like is every, every communication between. Something and something else. Messaging, I mean, you're sending a message, whether you call that message an event, an object, a record. A file bites a blob that's going and being messaged somewhere.

Maybe it's going to multiple people. That is something that's not common in streaming is. You know, a broadcast message, and that's something that, uh, you do with messaging and with Pulsar. I want to send a blast message out. [00:23:00] I mean, there's sort of ways to implement that in Kafka, but it's, it's heavyweight cuz not supposed to be doing.

That stream is usually coming in, going to a processor, maybe a couple the more consumers of a stream. That could be, you know, heavy weight slowing down your process. But if things are in order and it's exactly once, kind of implies only you know, one or you know, a failover one on the other side, a lot of times I want to send, you know, I've say I wanna send a, like an email.

I'm blasting it out to a thousand subscribers. , uh, if, if it only went to one and they get it in order and then they put it in a database and then someone has to go read it, and then they get it. You know, a lot of times I wanna do a broadcast message or, you know, a fan out, or I wanted to [00:24:00] go into, uh, the queue just acts as a way to process work.

You know, something comes in, I want someone to work on it. I don't care who, whoever's available next. Scale this out as big as possible. That is a use case that's not going away. I want a lot of people to process this so it makes sense for it to have something like we have in Pollstar for shared, shared messaging comes in.

Whoever gets it first gets it, and you don't have to worry about duplicates or you know, some way you've gotta manage that yourself. Messaging queues have that. If you look at. You know, the, uh, enterprise integration patterns, I don't think all those are going away. And those are huge. You'll, I mean, there's a whole language for doing all these complex patterns of how, you know, you connect systems together with these pipelines.

Uh, you need some flexibility in how you subscribe [00:25:00] to data and how you, uh, process it along, how long you keep it, you know, there, there. There's a lot of decisions have to be made. So it's like, is messaging this fixed thing that always looks the same and event streaming is the new shining? You always do it that way.

Uh, sometimes you gotta do both in the same pipeline. Sometimes parts of the data go one way or another. Maybe you got split it out, you know, maybe there's one group is processing it. Queue style and one group is getting it in order cuz I'm taking CDC from a table and I'm gonna put it in another database table and that better be in order for that table.

[00:25:46] Matt Yonkovit: Yeah, that, So there's a lot of considerations for sure.

[00:25:48] Tim Spann: It's, it's, it's not, uh, it's not on or off stream or message.

[00:25:55] Matt Yonkovit: Yeah. So Tim, you actually talk to a lot of people [00:26:00] like you, you, you give a lot of conference talks, on pulsar type topics on kind of modern architecture design. I'm curious, what are the most frequent questions that people ask you in this space?

[00:26:13] Tim Spann: Well, I mean, the questions that come up are, uh, can't I just do this with Ka. . I mean, cuz Kafka's been around longer. People know it. They may already have it in their system and they're like, Why is there another one? You know? It's sort of like, you know, why is, why do I use Kubernetes when there was already VMs?

You know, there's, there's, uh, that one comes up a lot because people have Kafka and they're like, Why? What's different? Or people confuse. Paul Pulsar with Spark or Pulsar with Flink or Pulsar with Kafka streams. I think there's so much nuance out there. It's very easy to not know when someone says, Oh, stream processing or [00:27:00] streams, and people are like, Okay, data.

Most people get the idea of data's coming in. Maybe it's an event. It's a big stream. It's continuously coming. Okay, but after that, who does what part? What part of the infrastructure does What gets confusing?

[00:27:17] Matt Yonkovit: Indeed. So I think in the next episode, what we're gonna have to do is we're going to have to define where does Pulser fit versus spark? I think we have to go there because you brought it up.

[00:27:29] Tim Spann: Yeah, you do and you have to, We have to define what use cases go. What part covers what thing? Like that was a common one. People tell me like, Does this use case, should I do this in Apache NiFi, in Spark, in Flink, in Pulser, in a message, or just in a database? Some, some things aren't a stream,

[00:27:51] Matt Yonkovit: Uh

[00:27:51] Tim Spann: things are a batch.

[00:27:52] Matt Yonkovit: That's true, that's

[00:27:53] Tim Spann: Some things never change. Never move. I mean,

[00:27:56] Matt Yonkovit: I don't know. I used, um, my SQL back of the day for a queuing system for [00:28:00] several apps that I regret

[00:28:02] Tim Spann: I mean, if something's small, well, so sometimes it doesn't matter.

[00:28:08] Matt Yonkovit: All right, Fair enough. Next time we're gonna have Mateo on, and I think that we are going to have to, you know, get that discussion going around.

When do you use, you know, uh, Pulsar, when do you use NiFi? When do you use Spark? I think that these. Types of discussions are things that could help people, uh, kind of understand this ecosystem a little more. And we'll also dig into a few other topics around, you know, kind of the evolution of this type of system and why it's needed and why it was built in the first place.

[00:28:40] Tim Spann: Good topics.

[00:28:41] Matt Yonkovit: Yeah. All right, Tim. Well, everyone, uh, we hope you enjoyed episode one of Crossing the Streams. We will be back in another week or two with our second episode and let us know if there are topics or themes that you want us to. Thank you very much. All right.[00:29:00]