Picture Me Coding

Picture Me Coding is a music podcast about software. Each week your hosts Erik Aker and Mike Mull take on topics in the software world and they are sometimes joined by guests from other fields who arrive with their own burning questions about technology.

Email us at: podcast@picturemecoding.com
Patreon: https://patreon.com/PictureMeCoding

You can also pick up a Picture Me Coding shirt, mug, or stickers at our Threadless shop: https://picturemecoding.threadless.com/designs

Logo and artwork by Jon Whitmire - https://www.whitmirejon.com/

All Episodes

Picture Me Coding

Sailing to Byzantium

April 23, 2025 • Erik Aker and Mike Mull • Season 3 • Episode 70

This week Mike and Erik tackle Byzantine Fault Tolerance! But what's it all about? Gangsters? Generals? Constantinople? Take a journey with us as we sail off into the dizzying complexity of Byzantine faults.

Links

Send us a text

Erik:
Hello, welcome to Picture Me Coating with Eric Aker and Mike Mull. Hi Mike, nice to see you this week.

Mike:
Hey there.

Erik:
Mike, I'm in my backyard today. If you hear ambient noises, birds chirping, planes flying overhead, a lot of wind, it's a gusty, windy, breezy day, but it is gorgeous otherwise. So audio a little different today, but hopefully it'll come through loud and clear.

Mike:
It is a nice day in Southern California.

Erik:
It's gorgeous out. Yeah. I just got back from a swim meet, kids swim meet, and it was pretty beautiful out there in the sun and the pool, kids splashing along, racing each other.

Mike:
So just to be clear, you were spectating, not participating, right?

Erik:
I was parenting, you know, kids sports.

Mike:
I bet you could beat most of those kids.

Erik:
Okay. I'm going to tell you, honestly, I had to do volunteer work at the swim meet today. And part of my volunteer work was doing the timing for a particular lane. The way the timers work is they have some kind of radio frequency. And so when they like push some button to start the race, the timer automatically resets. It's assigned to a lane. All you have to do is push a single button when the kid finishes. So I saw a lot of timings today, and I'm telling you honestly, 90% of the kids I saw swim today swim faster than I do. Impressive. And these are kids 7, 8, up to teenagers. So they're pretty quick little kids, and maybe I'm not that fast. You got some music for me.

Mike:
Yeah, there's been a lot of stuff out, I guess, since we've last talked about music. I feel like I have to give a brief shout out to Bob Mould, who came out with a new album, mostly just because I really like the idea of a 64-year-old dude still rocking pretty hard. But the one I've been listening to a lot is this album called Radio DDR by a band called Sharp Pins. I think it's like a side project of some guy who's in another well-known band. But it's this kind of like jangly guitar, slightly retro power pop that I'm a total sucker for. So a lot of fun to listen to.

Erik:
Do you think DDR means Deutsches Demokratische Republik?

Mike:
That is my hypothesis, yes. I did not bother to look it up, but I think there was sort of, I think Radio DDR is a reference to, you know, sort of a German equivalent of radio for Europe or Voice of America or whatever.

Erik:
So for people who don't know, DDR is how Germans referred to East Germany, the German Democratic Republic. In English, we would call it GDR.

Mike:
Yeah, so, you know, nothing revolutionary here, but it's just a little fun listen, kind of the stuff that I have a fondness for, a weakness for, maybe.

Erik:
Power Pop. You know me, and I'm like a huge adult contemporary fan, but I just not like any adult contemporary. It's got to radically push the genre forward. I don't want to just listen to yacht rock. Okay? I want to listen to revolutionary yacht rock. I want it to be yacht rock that is melting my mind. And I think that's how Bon Iver fits in. The new Bon Iver album came out. It's called Sable Fable, and I'm a sucker for these records. I know I've told you this before, but there's this scene at the end of Crocodile Dundee 2 where they're playing the theme music at the end. And there's like, he might be dead, but then he comes up over the hill and it's burning and he's injured. And there's this inspirational music as they close the film. And I have this weird feeling where every time I listen to a Bonnevere album, it's like,

Speaker 3
yes,

Erik:
it's like the end of Crocodile Dundee 2. Some sort of 80s, very wide horizon optimism. And it feels good to listen to. I call it adult contemporary that is radical adult contemporary. Experimental adult contemporary.

Mike:
So what makes it radical? Is it like two synths instead of one?

Erik:
There's some funny samples in there. And the songs are slightly less predictable. It feels like it's of the genre, but it's like a Picasso Cubist version of the genre. We'll just sort of like cut up a bunch of pieces and then put them back together in like a weird tile form. You kind of recognize the components and you kind of recognize the feeling it's producing, but it's also slightly off kilter and a little bit like someone took a hammer to that mirror and just sort of smashed it. And then they're like, okay, I'm done.

Mike:
Is there any Michael McDonald backing tracks or anything?

Erik:
Oh, I don't know. He's a Doobie Brothers guy, right?

Mike:
Yeah.

Erik:
It fits right in there, Steely Dan. I don't

Mike:
know.

Erik:
Maybe I'm insulting it, but it's good stuff, actually. I like it.

Mike:
Yeah, I listened to it a couple of times, too. I think it's sort of intentionally divided up into kind of like two parts. But I liked kind of the first part and was less enthusiastic about the second part.

Erik:
I think it's because I was a kid when the music it's maybe paying homage to came out. It maybe cuts a little closer to home for someone who had to live through it. Yeah, that's right.

Mike:
I was there when the Yacht Rock came

Erik:
out. It's hard to go back to, probably. What do you got for me this week?

Mike:
So we're going to get pretty nerdy this week. So put your thinking caps on. I'm going to start talking about this book that I bought many, many years ago when I was just a rookie. I think I had just started working as an engineer, so this was probably late 80s, early 90s, something like that. And I bought this book by a mathematician named Dennis Sasha called The Puzzling Adventures of Dr. Echo. It's a pretty fun book. He created this character, Dr. Jacob Ecko, who's kind of like a Sherlock Holmes sort of guy. In the book, he's supposedly like a famous mathematician who invented something new and then sort of dropped out of sight for a while. And then he comes back as sort of like this consulting detective guy, except all of his problems are sort of mathematical computer science problems.

Erik:
I wonder if this book is still in print. You still have it? I still have it. I believe you can still buy them. He did several

Mike:
follow-ons to this as well with the same character. And I believe he still works at the Courant Institute in New York.

Erik:
So this is called The Puzzling Adventures of Dr. Echo. It's a math book, but these are programming puzzles? Yeah, so the setup is, you know, each chapter

Mike:
is like somebody comes to him with a problem. And, you know, it's like, I need to build these buildings and I only have this much time. Is there some way I can do it? Or, you know, my father left me an inheritance, but it's in this secret cave and I can only go into the cave once. And,

Mike:
the problems are kind of weirdly set up. But, you know, when you look at them, you realize they're sort of like computer science or discrete math problems.

Erik:
So you tried to solve these yourself before reading the answers? Is that what you did when you read this book?

Mike:
Yeah. So when I bought the book, I would read the chapters and think about them for a while and sit down and try to figure out the answers. Sometimes I did. Sometimes I got the right answer. Sometimes I got the wrong answer. But I'd usually come up with something and then go look at the answer in the book. Okay. Then I got to the section in Chapter 6. The first part of Chapter 6 is like a chapter called Knowledge Coordination 1. And should I just read out the problem? Yeah. If you have it there, yeah. So the problem is stated this way. There are two allied generals, A and B, whose camps are on opposite sides of a ridge. They can communicate with each other only by carrier pigeon. The pigeons sometimes get lost or killed by birds of prey. The generals must decide whether or not to attack the enemy the following morning. Whatever they decide, either they both attack or neither attack, because an attack by just one would lead to a disaster.

Erik:
I've heard this problem before. Many have

Mike:
now. I had not when I was a rookie.

Erik:
So when you were reading this book, you'd never heard of this problem?

Mike:
I had not heard of it. I did not know the setup. And I was thinking, there's always a solution. So it sounds, when you first see it, you think, oh, this is really tricky. But I'm sure there's a solution.

Erik:
Wait, so I think I cut you off before you actually, what was the actual question they asked in the book where they like, solve this? What did they ask?

Mike:
Yeah, so the setup is the first general sends his message saying attack at dawn.

Erik:
But

Mike:
for him to attack, he needs to get a confirmation back from General B.

Erik:
And so the question

Mike:
in the book is, is there any sequence of messages back and forth between General A and General B? that will lead them to the point where they can confidently attack the next

Erik:
morning. And it's like fire and forget async, right? I send off my pigeon. I can't just, I'm not synchronously getting a response, right? Is that right?

Mike:
Yeah, that's sort of the way. There's kind of a couple of conditions that are implicit here. One is that there's no time limit.

Erik:
Right, that's the async definition usually in the papers, right? There's no time limit.

Mike:
Even if you had a time limit, it's a little bit hard to figure out. But there's also nothing like, so General A can't say, I'm going to attack at dawn. If you get this message, light a giant signal fire. And if I see your signal fire, then I'll know you got my message.

Erik:
Oh, that's like an instantaneous form of communication using light. Yeah,

Mike:
that would be cool. That would work. You can only

Erik:
use these dumb pigeons.

Mike:
You can only use the dumb pigeons.

Erik:
So

Mike:
I thought, okay, you know, there's just... There's something I'm missing here.

Erik:
Wait, wait, you still didn't phrase the actual, what was the actual question that the book said? Like, can

Mike:
you solve, did

Erik:
they actually give you a problem to solve out of this? Yeah,

Mike:
so the problem, which I think I stated, was will any sequence of acknowledgements and counter-acknowledgements lead them to attack?

Erik:
Oh, I'm sorry, I missed that. Okay, so is there some sequence where you can set off a pigeon and then later you get some sort of pigeon? Is there some sequence which you can attack with? Like a two-phase commit type of thing?

Mike:
Yeah, so if you, General A sends a pigeon, General B sends a pigeon back, General A sends a pigeon back to General B saying, yes, I got your pigeon. Is there some
point where that ends? And so I figured, you know, he wouldn't have given us the problem if there weren't a solution for it. So I sat down.

Erik:
That would be weird to do, right? I've got this. This is the middle of the book anyway. It's chapter six. Let me just, here's one a little harder, readers, but give it a shot. I'm sure you'll make progress. So I sat

Mike:
down and I went through, you know, all sorts of scenarios and I tried and tried and tried. And I like literally two weeks later, I said, you know, OK, I give up. This is this is clearly too hard for me. So I go into the I go into the solutions at the back of the book and it essentially says, you know, this is a problem known to not have a solution.

Erik:
Gotcha.
Why did you do that to me,
Dr. Echo?

Mike:
That was so unfair.

Erik:
Can we work through a little bit of what the problems are? Like, okay, so I sent off a pigeon. I have no way to know if the pigeon died on the way, if it ever got there, right? There's no way for me to know that. I can't know it was received.

Mike:
Yeah, you can't. General A can't really know that General B received the original pigeon. And, you know, General B, if General B sends back an acknowledgement of receiving it, then General B can't know that General A got the acknowledgement. Okay,

Erik:
so that's one problem, one part of it.

Mike:
There's no way to extend that sequence indefinitely until something actually works.

Erik:
It's kind of counterintuitive, though, right? It doesn't seem like eventually if you just exchange enough messages, you'd eventually be like, yes, yes. There's some protocol where if we get to 20, then we're definitely going to go forward, right? You're still out there? 20.

Mike:
Yeah. Yeah. I mean, it's conceivable that, you know, that sort of thing would work in practice. You know, also, if you think about it in terms of like a modern scenario, you sort of think, hey, I'm going to just call this guy
on my cell phone and say, hey, let's attack at dawn. And he's going to hear me and say, OK, that sounds good to me. And so

Mike:
it seems
like, you know, it seems like that's a solution, too, if the messages were just exchangeable fast enough. But that's really a different scenario if you think about it deeply enough.

Erik:
Well, every time I've heard of this problem, though, there's also the idea that those pigeons, like let's say I'm General A and I get a pigeon from General B. There's a possibility I can't even trust that, right?

Mike:
Yeah, although in this particular scenario, there's no condition where, you know, the other general might try to trick you.

Erik:
Okay, all right. So I'm thinking of how I've heard of this problem later then, maybe.

Mike:
Yeah, so long story short, this problem is a reasonably well-known concept in distributed systems. It would later come to be called the two generals problem. That's basically where Sasha was taking the, you know, the description of his problem comes from the idea of this being the two generals problem.

Erik:
Now, but can I ask you, what's the difference between two generals problem and Byzantine generals? Are we going to get to that?

Mike:
Yeah, we will get to that. In fact, that's kind of our objective today. We're

Erik:
going to try

Mike:
to get to what's called the Byzantine generals problem.

Erik:
You know what's cool about not knowing anything is I don't have to put my thinking cap on. I can just sort of dumbly follow along and ask the questions.

Mike:
So where's this problem come from? So there's a couple of things. So for a long time, I think it was sort of the name of it was kind of attributed to Jim Gray, who was like a famous database guy.

Erik:
Database guy. Yeah, he did all the research, right? Database research?

Mike:
Yeah, he did database research, won the Turing Prize, and then sort of also famously disappeared on a sailing trip. Oh,

Erik:
that's that guy? Oh, yeah. Should we tell that story? I don't know it. I don't have it here. Well, let me back up just one bit here. So it

Mike:
turns out that the first known statement of this problem was actually in an earlier paper. And it's kind of funny, I guess, because the first paper was something about interprocess communications. So one variant of that would be processes communicating over a network. The paper was called Some Constraints and Trade-Offs in the Design of Network Communications.

Erik:
When do you think that paper came out? Do you have an idea?

Mike:
1970. Let me see if I have a date on that.

Erik:
Okay, so somewhere in the 70s, a lot of distributed systems papers. Yeah,

Mike:
I think it was 75, 76, something like that. Okay, I

Erik:
just kind of wanted to locate it with the Liskov and the Lamport and other stuff there.

Mike:
So the authors of that paper make this comment that, you know, if you have a well-designed interprocess communication system, then you really want something in there that's going to tell the user if the communication failed. And they make sort of an offhanded comment that if your interprocess method doesn't have that in it, it's almost impossible for the end user to build that themselves. And then they mention an appendix. And in the appendix, they have this paragraph. A group of gangsters are about to pull off a big job. The plan of action is prepared down to the last detail. Some of the men are holed up in a warehouse across town awaiting precise instructions. It is absolutely essential that the two groups act with complete reliance on each other in executing the plan.

Erik:
Gangsters.

Mike:
That's not how I've ever heard

Erik:
of it. I've never heard of the gangster. Where did gangsters come from?

Mike:
Yeah, it's kind of a bummer that this didn't turn out to be called the gangster problem. That's a good name for

Erik:
it.

Mike:
Inevitably, it would have been called the original gangster problem, and then

Erik:
that

Mike:
would have been fine. So I don't know if this paper had much impact, but it's now sort of cited as being the paper that first kind of articulated this idea.

Erik:
But just in the appendix, so it was just a kind of aside.

Mike:
Yeah, it was just kind of a throwaway in this paper on interprocess communications. Okay, it

Erik:
starts as the generals, starts as the gangsters, and then like a game of telephone, it gets turned into something else later on.

Mike:
Yeah, and then a couple of years later, Jim Gray, he writes this section of a conference proceedings called Notes on Database Operating Systems.

Mike:
you know, again, this is like early 80s, and it's at a point where database systems are still kind of in their infancy. You know, there's been a lot of research going on, but people are still trying to figure out things. And Jim Gray sort of says in the preface to this document that, you know, this is just like a collection of tricks and tips that people working on databases and doing database research have figured out over the years. He writes a section on two-phase commit. There's a point in there where he describes what he calls the general's paradox. And he describes it. I'll just read it again. So you'll see the similarity between what we read before. There are two generals on campaign. They have an objective, which they want to capture. If they simultaneously march on the objective, they are assured of success. If only one marches, he will be annihilated. The generals are encamped only a short distance apart, but due to technical difficulties, they can communicate only via runners. These messengers have a flaw. Every time they venture out of camp, they stand some chance of getting lost. They are not very smart. the problem is to find some protocol which allows the generals to march together even though some messengers get lost

Erik:
yeah just sounds like the other one like the pigeons like the dr echo book took this took this version of the problem right yeah

Mike:
so the dr echo version is obviously taken from this he just changed the sort of silly human messengers into

Erik:
pigeons pigeons but

Mike:
it's the same basic idea.

Erik:
So Jim Gray came up with the generals.

Mike:
Jim Gray, yeah, gave it the general's name. And this became known as the two generals problem from then on.

Erik:
So he brings this up in the book on, in a chapter of the book on databases. What's he, what's the chapter about, like, why does he use this?

Mike:
So he's using it as sort of a setup to discuss two-phase commit in databases.

Mike:
And, you know, I don't know if we have the ability or time to discuss two-phase commit here, but basically he talks about two-phase commit as a way that you can sort of work around this problem of a never-ending sequence of acknowledgements.

Erik:
But with two-phase commit, the difference there is you can't synchronize two different, the two generals. You could just sort of put them in order, right? You could say general A attacks, then general B, because general B can see that its knowledge is out of date. due to General Ace, two-phase commit succeeding. You just said, don't talk about it. I just went right in that hole, didn't I? I just put my foot right in the hole.

Mike:
Yeah. I mean, basically what he's saying is we can't rely on this sort of protocol to guarantee that commits are happening and we have consensus in our distributed database. So what we need to do is this slightly more complicated, slower thing to try to ensure that commits are happening reliably. But in any case, you know, he seems to give it the original name. Mm

Erik:
-hmm. The two generals.

Mike:
The two generals. And kind of interestingly, some people have also referred to this as the Chinese generals problem.

Erik:
Oh, I've never heard that.

Mike:
I was not actually able to find any reference to that being used in a paper until later when Leslie Lamport mentions it under that name. Let

Erik:
me summarize a little bit here. So two generals problem. You got the two generals. They're sending messages to each other asynchronously. You can't have any kind of synchronous instantaneous feedback on your messages. And you want to coordinate those two generals so they do something at the exact same time. And the proof is there's a proof that they cannot do that. And this is academically interesting. Obviously, it feels a bit counterintuitive. It feels like there should be some sequence of messages that would get this to work. But what's the practical implication of this, do you think? Is this useful? Being told, no, you can't do that.

Mike:
Well, I think it's an interesting problem because, you know, this is related to distributed systems. And in particular, it's related to trying to develop a consensus into nodes of a distributed system. So, you know, fundamentally, this is a consensus problem. You know, if you've got a state machine or you've got a value in a database and you want to make sure that two servers actually have the same value, what do you have to do to make sure that the two things are communicating their value to each other correctly?

Erik:
It sounds hard. I just, do you have the same counterintuitive like response I have where it feels like there should be? Like when you read this chapter six in the adventures, puzzling adventures of Dr. Eck, did you, your expectation was this should be solvable? And at the end, when you were told it wasn't solvable, did you still feel like, oh, but it feels like it should be solvable?

Mike:
Yeah, it seems like one of those things where at a certain point you've exchanged enough messages that you have a comfort level that you've received acknowledgments or,

Mike:
you know, you agree beforehand that, you know, unless you receive three acknowledgments from me or whatever, it just seems like there should be some sort of protocol you could set up that would solve it. So I could never quite find, given the constraints, I could never quite find anything for obvious reasons. But yeah, I agree. It's counterintuitive and it seems like there should be, well, I mean, there are ways that people have worked around this just using different setups.

Erik:
So we started with gangsters. Jim Gray turned them into generals. It shows up in this knowledge book, this Puzzling Adventures of Dr. Echo book that you had. Where's the Byzantine generals come from? That's different? I thought maybe I'm still thinking of a different problem, I guess.

Mike:
Right. So I have to digress a little bit to a paper from 1980. This will sound off topic because there are no generals, but bear with me a little bit. So, in 1988, Leslie Lamport, along with two other authors, Marshall Pease and Robert Shostak, wrote this paper called Reaching Agreement in the Presence of Faults. So, you can tell from the name that it's basically a paper about trying to get consensus among a set of servers, more than two now.

Erik:
Does that matter, more than two? It matters, right, because you can't achieve it if you have a small number?

Mike:
It matters, I think, partially because, you know, it's sort of like a more realistic system now. You've got an arbitrary number of servers constituting

Mike:
your system, and you want to achieve some sort of consensus. And you're also, you know, proposing this scenario where some of the servers in your system may be faulty.

Erik:
I guess stupidly in the N equals two solution, you can't just be like, well, there's only one working one. We're just going to go with that.

Mike:
Right. So, you know, they don't, they give sort of an idea what they mean by faulty. But, you know, you can sort of think of it as either the faulty ones don't return a response or potentially they return the wrong response.

Erik:
This is right around the time of FLP too, right? In the presence of a single faulty server, no consensus is possible?

Mike:
I believe this actually came out a couple of years before FLP. So the FLP result was not yet known to them when they published this. So again, there's like no metaphors here. No generals, no armies, no pigeons, no people. No gangsters. That's a bummer. No gangsters.

Mike:
it's just processes and some sort of communication between these processes. And I think they even stipulate that the network is reliable here. So it's not a situation where there's like network partitions or whatever.

Erik:
Okay, interesting. Reliable network, can we get a consistent result?

Mike:
Right. So they come up with this surprising thing, which I think is also not intuitive. So what they discover is that if there are n total processes in the system, the only way you can reach a consensus is if at least three m plus one of them are non-faulty, meaning where m is the number of faulty ones. If there's only one faulty processor, then you would need four processors total to make this work reliably. Even with one faulty processor, you'd still need three good ones to get a reliable consensus value,

Mike:
which I think was a
surprise to them and a lot of other people, too, who assumed that probably majority rule would work here.

Erik:
Now, I've read the later Lamport paper, and I've got to admit, I didn't really understand it. And I've seen this 3M plus 1 thing, and I really didn't understand that either. And I don't understand the proof. And I'm just looking at it like, okay, I guess I'll just accept it takes 3M plus 1 total servers to get a consistent result if M are faulty. All right, I will accept that. Do you actually follow the proof in that? Do you understand? Is there like some intuition you have for, yes, you have to have triple plus one to have an overwhelming correct number of servers? Probably not

Mike:
well enough to describe it on a podcast. I think in the later paper, where they do use a metaphor of sorts, it makes a little bit more sense. Yeah, it's tricky and it's surprising. They also introduced the idea here that, so imagine that one source of fault is that a processor intentionally sends you a wrong answer. So there

Mike:
could in this situation be a malicious server which is trying to, for whatever reason, defeat the consensus.

Erik:
It got compromised, yeah. Like

Mike:
I got a bank

Erik:
or something and

Mike:
somebody's

Erik:
trying to load up their account with a lot of money.

Mike:
Yeah, that's a good example. So, you know, one problem there is that since all of the processors are communicating like one-to-one, if there's one faulty process, you know, say there's A, B, C, and D and D is the faulty one, D could actually send different results to A, B, and C,

Mike:
complicates the problem. because now they're not even reliably working from the same set of answers. So one thing they proposed in this earlier paper is we could use something like a message digest where the message sent by the faulty process has to be signed. And so if they try to lie about the response in there, there's sort of a way for us to say, okay, this was what they actually sent. This was not like modified in...

Erik:
Wait, wait, wait, wait. I have a question there. So in the Byzantine Generals paper, that later paper, don't they talk about how it's like generals send messages out and then those messages get propagated by their receivers? So if A sends to B, then B sends to C like, hey, I got this message from A. Am I misremembering? Isn't that what the paper... And so this message digest thing, like it's not... When I send you a message, it's not necessarily a message I'm originating, it may be one that I got from somebody else and I just want to make sure you heard it too. Is that what's going on here?

Mike:
Yeah, that's also in play here. So again, let's say we have A, B, C, and D. A sends their value to B, C, and D. B sends their value to A, C, and D, and so forth. One way you can sort of also verify that people are getting the same answers is that after A sends his answer to B, B can say, hey, here's the answer I got from A.

Erik:
Yeah. Is

Mike:
this the same as you got from A?

Erik:
Okay. Yeah. Yeah. And the message digest thing helps there because you could have like a signature, for example, right? And C could be like, oh yeah, that looks like that was signed by A. I can tell that.

Mike:
Yeah. So there's no potential here for one of the generals acting like a man in the middle who's changing the messages and passing on different values to different generals.

Erik:
This problem is not easy for me to understand. And I don't think you're making it easier. I think you're making it harder on me. I honestly was like, okay, cool. We're going to talk about this topic. I find it interesting. That's great. I'm going to get it. But I feel like I'm getting farther away from my goal, which is understanding it.

Mike:
That's why they call it Byzantine, dude.

Erik:
But Byzantine, in your telling of the story, hasn't even emerged yet on the scene. I just keep saying, isn't this the Byzantine generals paper from Lamport? I keep kind of cutting to the chase there.

Mike:
Yeah, so let's cut to the chase. So not long after these three authors published this boring-sounding paper about consensus in the presence of faults, they wrote another paper called The Byzantine Generals Problem. And again, I will read this one because it's, you know, probably sound familiar to you now. The way they put in the abstract is, Reliable computer systems must handle malfunctioning components that give conflicting information to different parts of the system. This situation can be expressed abstractly in terms of a group of generals of the Byzantine army camped with their troops around an enemy city. Communicating only by messenger, the generals must agree upon a common battle plan. However, one or more of them may be traitors who will try to confuse the others. The problem is to find an algorithm to ensure that the loyal generals will reach agreement.

Erik:
And they're Byzantine for some reason, right? That's the Roman Empire, the end of the Roman Empire. Yeah, so my understanding

Mike:
is that, you know, Lamport wanted to come up with this sort of metaphorical description of the problem, for which he had a tendency, if you've heard us talk about Paxos in the past. But he didn't want to offend anyone.

Erik:
Oh, like Chinese generals was potentially offensive.

Mike:
Yeah. So if you're going to accuse people of being traitors, you don't want to call up some particular country or whatever. And so he figured there are probably not that many Byzantine
folks around anymore. Although I also
heard a story that he had originally, he was going to call it Albanian generals problem. And then somebody pointed out to him that there were still Albanians.

Erik:
So he came up with Byzantine.

Mike:
Yeah. So they called it the Byzantine generals problem. And the weird thing is that this is roughly the same paper as the previous one. Oh, it is? It's got a few new things, but it's not really a new result.

Erik:
Can you do that? Can you just write a new paper that's like the same paper you wrote before? I didn't know you could do that. What is that like? Like, hey, we're going to re-release these songs, but we're going to produce them differently? Taylor Swift did that, right? It's about earning the money, right? Wasn't it?

Mike:
Yeah, that's probably why they did it, for the dollars.

Erik:
To own the rights to their paper? Yeah, I mean, to be fair,

Mike:
Lamport does have precedent here since he wrote multiple Paxos papers because, you know, it's like, here's the Paxos paper. And then there was like Paxos paper that you could maybe understand.

Erik:
Wait, wait, but Paxos was the 90s, wasn't it? Isn't Byzantine generals earlier than that? It's in the 80s, yeah. So they wrote the same paper. I'm still confused by that.

Mike:
Yeah, so Lamport, as we talked about in a previous episode, works for Microsoft Research.

Mike:
so this paper you can find on the Microsoft Research website. And at some point, Lamport wrote a prologue to it that links to the paper. And he talks in there about how he'd always thought that Dijkstra's dining philosopher's problem probably got more attention than it deserved. And he thought that it was because it had this clever name.

Erik:
Really? That's funny.

Mike:
There's a
sentence in this prologue that says, the main reason for writing this paper was to assign the new name to the problem. What? What?

Erik:
Wait a minute. The paper that I know, that I've heard of anyway, He wrote that just to give it a name that people would remember? That's what he claims, yeah. That's funny.

Mike:
So it's not like literally the same paper. They didn't just like copy and paste the old paper and, you know, change a few things to give it color. There are some new results here, some new proofs. And there's a new section toward the end where they talk about, you know, what happens if the generals can't communicate directly and they have to communicate through some other generals, an intermediary sort of

Erik:
kind

Mike:
of like a case where, you know, some network link is broken. And so they have to communicate through a different route or

Erik:
server

Mike:
or something.

Erik:
So sidebar here. Let's just give for our listeners the dictionary definition of Byzantine. Because I always thought when I saw Byzantine generals, I thought, oh, they're old generals from an ancient civilization. But this is also just super complicated, right? So the dictionary definition of Byzantine, I took this from Merriam-Webster, related to Byzantium, which was the Roman name for Istanbul. And that is the Eastern Roman Empire period. It ends with the fall of Constantinople. That's Byzantine. There's also the architectural style of this period, this location. But the sense of Byzantine with a lowercase b, this is what I usually think of, devious or surreptitious manner of operation, they write in Merriam-Webster. And they also give the definition intricately involved or labyrinthine. That's what I thought Lamport was talking about. Like, here's a super complicated thing. So these are Byzantine generals. Everything they can do is really complicated, right? Or maybe they're being devious and we don't know. we can't actually tell if they're following the protocol or they're lying.

Mike:
Yeah, I actually thought that was the source of the name too. You know, I had heard the story about the word Byzantine coming from the idea that the bureaucracy of that period in history was so elaborate and layered that to get anything done, you had to go through all of these sort of elaborate processes and you know that's the source of the dictionary definition so i just assumed that that was that's why they had chosen the name for this problem and maybe that

Mike:
know part of the inspiration yeah apparently it was just you know he wanted it to have a cool name he wanted it to not reference a country that still existed in my defense i

Erik:
gotta say i've tried to read this paper probably three times and it confuses the shit out of me every time there's there's not just so first of all with the two generals problem it's like i got general a i got general b okay i can hold that in my head lampords byzantine generals they've got generals they've got lieutenant generals there's like loyal lieutenants and disloyal lieutenants and somehow i just cannot hold all this information in my head long enough to understand what's going on and i just have to sort of accept the conclusion like, okay, I get it. Not I get it, but I accept what you're saying. 3M plus 1,

Mike:
right? That's correct, yeah. And they do actually state the problem a little bit differently in this paper. As you mentioned, they formulated in terms of a commanding general and some lieutenant generals. So instead of just having some arbitrary number of generals passing messages between each other. Now there's one of the generals
who
is the commanding general, and the rest of the generals are, I guess, subordinates who you think of as lieutenant generals, but still generals.

Erik:
So this has some kinship with like Paxos, right, where you've got different, you've got deciders and proposers and stuff. It's sort of like that. You've got the general sending orders. It also has some kinship with some of the other distributed system papers where it's like, I have a receive cue and a send cue, sort of, maybe? You could map those ideas onto this?

Mike:
I think that's probably correct. A lot of these ideas, they're sort of developing ideas of consensus as they write these papers. And so, you know, things that we think of now as common in consensus systems are probably emerging here. They do sort of argue about, or they, let me see if I can state this, So they put it in such a way that they can state it in terms of just the messages sent from a particular general, which is why they're able to reformulate it in terms of a commanding general and lieutenant generals.

Erik:
My gut is like, do we need the lieutenants? Can we get rid of the lieutenants? It's too much information.

Mike:
But it's roughly the same thing, you know. So they're passing these messages and nobody sees. If the commanding general has two lieutenants, they don't see the other lieutenant's message directly. They only get it indirectly when the other lieutenant general passes what they say they got from the general to the lieutenant. So there's this rounds of message passing again where the more generals there are, the more message passing they have to do in a series of rounds because they have to send their messages to the other generals. And then they have to get the messages that they got from the other generals and send those, communicate those to the other generals to verify that they're the same as what they got.

Erik:
I'm lost. I'm so lost. So the other problem with this too is like in the two generals problem, it's, hey, let's attack at seven or let's attack at dawn. But in the Byzantine generals problem, it's like, let's attack or let's retreat, which also complicates it. Maybe I'm getting hung up on something that's not actually complicating it. But definitely the generals and the lieutenants get me totally lost.

Mike:
Yeah, it's, I think, even slightly more complicated than that. because the way they talk about it in the paper is it's possible that even the loyal generals can have a difference of opinion about what to do. Oh, I don't

Erik:
even remember that. That's in the paper?

Mike:
Yeah, so imagine there are 10 generals overall. And one scenario is that all of the loyal generals, There's seven loyal generals and three disloyal generals.

Erik:
That satisfies our 3M plus one, right? Yeah,

Mike:
so this should satisfy the constraint. So if all seven of the loyal ones say, let's attack, and then the three disloyal ones say, no, let's retreat, then
that should not screw up the decision, so long as that the generals have sort of agreed beforehand that what they're going to do is go by majority vote. So they're going to get... Oh,

Erik:
yes. Sorry, I remember. I recall now. Yeah, right. Go ahead. Keep going. Sorry.

Mike:
Every general is going to get a value from all the other generals plus their own. So at
the end of the process, every general is going to have 10 answers to the problem.

Erik:
Then you've got seven attacks, three retreats. Pretty overwhelming. Let's just go for it. We're going to win.

Mike:
So in that case, the loyal generals come to an agreement. They attack and the desired outcome is achieved.

Erik:
Aeneas running out of Troy. Ilium with his dad on his back. Got it.

Mike:
Exactly. But it's also possible that among the seven loyal generals, four of them might say, let's attack. And three of them will say, let's retreat.

Erik:
Right. Split. Right. And then you have three retreat, six retreats potentially because the

Mike:
disloyal

Erik:
are also on the retreat side.

Mike:
Yeah, so then the disloyal can actually affect the outcome. And so the question there, which they put in the paper, is, is that a problem? In one sense, the outcome is different than what the majority of the loyal generals want, but also

Erik:
the

Mike:
loyal generals were split four to three, so they were already pretty unclear on what needed to happen. So is the fact that the disloyal generals influence the decision all that big of a deal here?

Erik:
You could reasonably be expected to support attack or retreat. It's like, eh, it could go either way. Yeah.

Mike:
They actually do give an algorithm in this paper that correctly passes all the messages between the generals. If you've got M faulty or traitorous generals, then you need to have 3M plus 1, at least 3M plus 1 generals overall
to make it
work. assuming that is the case, then they have this algorithm they call the OMM algorithm, which will effectively get the right information to the right generals.

Erik:
What's the right information, the loyal general's answers, you're saying?

Mike:
Yeah, so according to the conditions of the paper, what they want to happen in the end is that every loyal general has to get the same set of answers in the end. All seven of the loyal generals would have the same set of 10 responses when the algorithm is complete.

Erik:
So hold on. You already said this earlier, but just to restate, that means disloyal general number four can't send attack to somebody and retreat to somebody else. You're saying everybody gets all the same answers.

Mike:
Yeah. After you go through all the rounds of passing messages and you've exchanged, you know, because it takes the more generals you have, the more rounds of data passing you have to go through to get everybody to have the same set of answers.

Erik:
Because you could be like, wait a minute, you told two to attack and you told me to retreat. Which is it?

Mike:
Right. Yeah. So like in that case where one of the disloyal generals has sent different responses to different people, the loyal generals have to collect enough values for that disloyal general to know what the majority response is that all of them have seen. So it requires several rounds of message passing to happen.

Erik:
Isn't this kind of like quadratic too because it's like I send messages to everybody, I get everybody's messages, I get everybody's messages to everybody else?

Mike:
Yeah, this is not necessarily a practical algorithm.

Erik:
Okay. Was I wrong? Is it exponential? Is it like...

Mike:
I would have to think about that, but it definitely grows pretty rapidly in the number

Erik:
of times. Yeah, it's one of those scoopy parabola graphs. Okay. I don't know if you're going to bring it home for me here. Maybe I'm asking for something I just cannot grasp. Where do I go with this? So I think the interesting

Mike:
thing about this paper, whether we understand it or not, is that I think when they proposed it, it was
an interesting distributed system problem and something
that they were potentially going to have to deal with in real systems. But again, remember that the Byzantine General's paper came out in 1982. That is significant because it's considerably before there was distributed systems on the internet.

Erik:
Yeah, it's pretty early to be thinking about this stuff. Can I ask a dumb question? Jim Gray's two generals is saying there is no answer. There is no proof to satisfy this. Is that because it's two? Is the Byzantine general saying, well, if you have 3M plus one, it is possible to achieve a consensus?

Mike:
I think that is the case because of the fact that you can sort of agree beforehand on some sort of, you know, if all the generals are agreed on, you know, this is the protocol we're going to use to decide what the answer
should be. Right, a protocol. Which,
you know, might be majority voting or whatever. So, you know, but in the case of two generals, you know, there's not really a way that you can use majority voting.

Erik:
I guess I want to know if we can use Byzantine generals to figure out when we're ordering lunch on Slack, are we ever going to agree on where we're ordering it from and what we're going to get?

Mike:
Yeah. Does

Erik:
it get harder the more people on the team? Go ahead.

Mike:
I'm actually surprised that that application does not exist. I was actually having a discussion with somebody about this just a couple days ago. Back when I worked for Yahoo, they had a contest to have people develop Yahoo Messenger applications because they were trying to popularize it. And I had an

Mike:
idea for an application that would essentially do this very thing where you would propose like four or five places to go to lunch and then people would choose their favorites and you would try to come to a consensus on where the group was going to go to lunch.

Erik:
But it's maybe like these people are competitive. They're disloyal. They're proposing stuff that'll make people sick. So they drop out. They're no longer competing against them at work that day.

Mike:
Yeah. Or, you know, there's always the guy who wants to go to Subway or something like that, you know.

Erik:
You're saying the protocol disallows Subway as an answer to the Byzantine lunches problem. I forget where we were going with that.

Mike:
You

Erik:
said this paper came out in 1982, very early distributed systems resolved before the internet. And that's a fair point. But where are you going with that? So,

Mike:
two things happened in subsequent years that made this problem considerably more important. One was the internet. So, you are now in a position where you're not running your distributed systems on some internal network that you have control over any longer. You know, you might be communicating with things that you, with servers where you didn't write the code. You have an expectation of the answer, but you can't really rely on the answer. Another thing that happened even much later on and which has kind of caused research in this area to explode again is blockchain.

Erik:
Oh, yeah. That's what I think of. I think of Byzantine fault tolerance.

Mike:
Right. So this whole idea of dealing with faults and potentially malicious servers in a distributed system over time became known as Byzantine fault tolerance. And it's still a pretty active area of research. Can

Erik:
I ask, so when I think of Byzantine fault tolerance, I thought it's like, again, it's Byzantine generals problem, but we've got a collection of servers. And those servers might be just broken or they might be malicious. And I really have no way to know. I mean, broken could look like malicious. Malicious could look like broken. Working could also look like those things if I'm not sure. So that's how I think of it. Is that a fair informal description?

Mike:
It is, yeah. And then in the case of blockchain later on, you also have the potential of, you know, So blockchain generally is, by definition, decentralized. So now
you are in a position where you can't even control the participants in the distributed system.

Erik:
Massive shared ledger. Anybody can write to it. How in the world do you agree on a consistent reckoning of who paid what to whom?

Mike:
Yeah. And, you know, people have incentives to try to figure out ways around the consensus. Yeah.

Erik:
I mean, if you can only buy some pizza for 10 Bitcoin, then maybe the incentives are smaller. But if you're buying hundreds of thousands of dollars worth of pizza for every Bitcoin, then yeah, I guess the incentive is there to try to figure out how to mess with that system.

Mike:
There is a lot of incentive. Although, to be fair, it's probably easier just to scam people than it is to defeat consensus. So anyway, so yeah, this became a very significant area of research in the years after the problem was proposed by Lamport and Pease and Shostak. And, you know, Lamport's idea was that if they gave it a fun name, it would get more attention. maybe that's part of it but i think probably also the uh the emergence of the internet and these giant distributed systems probably had more to do with it

Erik:
i think he's on to something people for a while there they were collecting together technology stacks and then giving them names and then it'd be like hey are you using this stack and it'd be like what the hell are you talking about

Mike:
yeah yeah i can see that you know it's a lot more fun to say i'm using the lamp stack than it is yeah

Erik:
exactly yeah you also pointed me at a paper called practical byzantine fault tolerance and i i modeled my way through i could almost kind of squint and almost sort of see what they're talking about. That's a Miguel Castro paper, right?

Mike:
Yes. So this goes all the way forward to 1999, at which point,
you know, we are considering the internet in play.

Mike:
They say in the abstract, they think these algorithms will be increasingly important in the future because malicious attacks and software errors are increasingly common and can cause faulty nodes to exhibit arbitrary behavior.

Erik:
Wow, software errors too. I got to consider that because I honestly, Mike, when I think about this problem, I think, yeah, let's just make those infra guys make the firewall really strong. I've got all my trusted stuff and my trusted network. I don't really have to try to solve Byzantine fault tolerance, right? It just sounds way beyond my skills and capabilities permanently. So maybe Dave Beasley should do a practical Byzantine fault tolerance paper class and he can try to get his students to implement this. Did you read this paper and think, I could build what they're talking about?

Mike:
No.

Erik:
You're like me.

Mike:
Go ahead. I read the paper and I was like, okay, they use this to build a network file system, I believe. I was like, yeah, maybe with four or five years or, you know, like a group of postdoctoral researchers, I could probably hack this together. But the idea here was, you know, Miguel Castro was a student of Barbara Liskov, who we talked about on a previous episode. You know, she's mostly known for her abstract data type work, but also did a lot of stuff on distributed systems. And I think the idea with this method was that, you know, they reached a point in history where they actually had to solve this problem. People had been trying to do it for a number of years, and most of the systems had been somewhat impractical, either, you know, requiring synchronous calls or setting things up in such a way that they could be like DDoSed and things like that. The ideas in this paper were an attempt to, as the title says, make Byzantine fault tolerance practical.

Erik:
If I got nominated to do that work, I'd be like, I think I'm just going to nominate someone else. Can I give someone else this ticket?

Mike:
Yeah, that would be a tough cheer ticket. You might have to break it up into an epic.

Erik:
I have one other question for you. A cyber blockchain, do you know of any other application where the Byzantine fault tolerance appears or the Byzantine general problems appear?

Mike:
Yes. They are in certain, like, network file systems. So I think I want to say Dropbox, but that might not be right. This stuff actually comes up in, like, sensor networks, too. So, like, when you're trying to look at sensors on airplanes and things like that, it's less likely that one of those sensors will be malicious. but it's sort of crucial that you be able to deal with the possibility that it might be faulty.

Erik:
Oh, okay. Okay, so possibly faulty, and we have to figure that out somehow with our protocol.

Mike:
Yeah, you know, like, you know, half of our sensors say we're upside down, the other half say we're right side up. Yeah.

Erik:
Wow, in a single plane, that's actually really interesting. So maybe you don't have to figure out that they're faulty, but you have to figure out the right course of action in the presence of faulty sensors.

Mike:
Yeah, and there are some actual papers over the years on systems like this. There was an early paper called SIFT that was about design of a fault tolerance aircraft control system.

Erik:
Jeez, this is heavy stuff. So we went pretty far here. We started with your book that you enjoyed, and it's an unsolvable problem, two generals. Then we went back, we found out they were two gangster groups of gangsters, and then we went forward again to the generals, and then we found out that Lamport just really wanted to give it a sexy name, and his sexy name was Byzantine generals, and that's how we know the problem today.

Mike:
Interesting chain there,

Erik:
I

Mike:
thought.

Erik:
Well, it's a good journey. I'm glad that I too have gone on it with you. You know what, though? I'm really glad that you are not going to be testing me on this at the end of today's episode. I don't think I would pass.

Mike:
Did I not mention the test?

Erik:
No,

Mike:
you didn't mention the test.

Erik:
Do I have to build something? If I have to build something, I want to go back and study for a few months.

Mike:
Fair enough. We'll do another episode sometime in August about your attempt.

Erik:
Thanks for listening, everybody. Hopefully we did a passable job here on Byzantine Fault Tolerance, or actually, hopefully Mike did a passable job and dragged me along with him. If you'd like to reach out to us, you can email us at podcast at picturemecoding.com. We do have a Patreon. If you'd like to support the show for $4 a month, you can become a supporter of the show, and we will thank you on the program. This has been Picture Me Coding with Eric Aker and Mike Maul. Thanks so much for listening. See you again next week. See you next time.

Erik Aker

Host

Mike Mull

Co-host