Serverless Craic from The Serverless Edge

Serverless Craic Ep46 Resilience Hub

October 05, 2023 Treasa Anderson Season 1 Episode 46
Serverless Craic from The Serverless Edge
Serverless Craic Ep46 Resilience Hub
Show Notes Transcript

How important is Resilience Hub, Chaos Testing and Well-Architected?
We attended the AWS Resilience Day at the Titanic Hotel. We were sitting in the same room where the ill-fated Titanic was designed and drawn! We discuss what we learned. Including the tools and strategies that help software engineers build resilience that were not available for the Titanic engineers. And we talk about the fact that it isn't just one thing that leads to disaster for ships or workloads.

Serverless Craic from The Serverless Edge
Check out our book The Value Flywheel Effect
Follow us on X @ServerlessEdge
Follow us on LinkedIn
Subscribe on YouTube

Dave Anderson:

Hi, folks. Welcome to the latest edition of Serverless Craic. We're back from a summer break, so we thought we'd get things going again. My name is Dave Anderson. I'm in sunny Greece for a conference. Hey, Mark, how's it going?

Mark McCann:

It is going good. I have been event storming this morning, which was good craic. I am looking forward to getting back into recording Serverless Craic videos.

Dave Anderson:

Mike, how are you?

Michael O'Reilly:

I'm good. I'm well rested after our summer break. And keen to get into serverless craic.

Dave Anderson:

One thing I notice is that Serverless is becoming a normal thing. A lot more people are talking about EDA, event driven architecture and well-architected. There's a maturity thing happening with a lot of the stuff in the book, there's a lot of interest around modernization. We're in the cloud. Now, how do we take it to the next level? For me that is Serverless, EDA and well-architected. A lot of people talk about resilience. As Werner Vogel says 'Things fail all the time'. So how do you make sure you workloads are resilient.

Mark McCann: With re:

Invent early season there are announcements on Serverless. EDA and well-architected services and capabilities. It's never been easier to build a well architected, resilient and reliable workload on AWS. The tools and capabilities are coming online to make it easy for teams. In the past, you needed to acquire knowledge to understand and deliver a resilient and reliable workload. The guidance and is getting better to provide you with guard rails.

Dave Anderson:

Adrian Cockcroft talked about continuous resilience for years. Adrian was way ahead of the curve. When he started talking about it, it was hard to do. It is still hard, but there are good tools available. We were at a AWS Resiliency Day in Belfast. There were discussion on chaos testing, resiliency hub and correction of errors.

Mark McCann:

It was really well run event. The speakers were fantastic, and the content good. It covered well architected framework to resilience modelling capabilities. In other words, the Resilience Hub tool which looks very promising. And the Chaos Engineering practices that we've been advocating for, with the AWS Fault Injection Simulator (FIS), which is also good. They also looked at theorectical stuff around disaster recovery and business continuity planning. And one thing I liked was the correction of errors, incident analysis and post incident responses. How do you incorporate these things into your ways of working to make sure that they don't happen again? All in all it was a good day and if you get a chance you should go.

Dave Anderson:

It wasn't lost on us that the event was in the Titanic Hotel, in the drawing room that the designers of the Titanic used. It is a big room with with loads of windows on the ceiling that was built over 100 years ago. it's where the architects sat and designed the ships built by Harland and Wolff. We were looking at Titanic artefacts and talking about Resilience. A lot of reminders why this stuff's

Mark McCann:

Speakers later in the day referenced it. It was a important! gift that kept on giving. One of the things we discussed is that it is never just one thing. Delivering a reliable and resilient workload system does not just take one thing to happen. My daughter is researching Titanic for her school project. It wasn't just the fact that Titanic hit an iceberg. The ship was on fire when they left the harbour. The guy who had the keys for the binoculars make didn't make the trip, so they couldn't see the iceberg. They didn't have enough lifeboats because of time pressure to set sail on time. Why were they going too fast? It's because the coal was on fire. They were dead 22 knots instead of a normal speed. And there were time pressures from White Star Line to be beat the record. It's not just one thing. There's lots of different things that culminate into a disaster. The small things that seem innocuous or inconsequential in isolation, build up and result in a tragic disaster.

Michael O'Reilly:

Planning for Resiliency for workload or app must be proactive. With Titanic, did they try and simulate any of these things, or practice their reactions. How would they detect when issues happened? Technology helps with Resiliency, but a lot of it is just technique, process and practice. How do you plan for unforeseen circumstances? Adrian Cockcroft discusses his Netflix experience with chaos engineering, chaos testing, chaos monkeys and chaos gorillas. And it's an interesting way of thinking about stuff. What happens if we lose this part of the system? What would we expect to see? How would we recover? With the Titanic, you can imagine that they didn't run scenarios, and maybe they had too much confidence in the unsinkable ship.

Mark McCann:

With software, we're in an enviable position where we can run experiments and hypotheses for relatively little money. It would be hard to do that in the physical world, with ships. What happens if we hit an iceberg and the bunkers are on fire? You're not going to simulate that too many times, at scale. We can do that in the cloud. We can run experiments, inject these faults and simulate what happens. We have good engineering and well architected practices for testing and gaining confidence in your system? What scenarios are you adding into example mapping sessions to cover what ifs?

Dave Anderson:

It is correction of errors. Titanic is no different to a modern project, because there is a human element. There are immediate pressures on the company, competition with other companies, people being arrogant and vicious, not knowing what they don't know, being in a hurry or not enough money. In other words all the human elements that come into play when you're doing something big. The correction of errors stuff is interesting. When you do post event analysis, it's not about someone forgetting to do something. It's about what part of the system prevents that from happening? Let's not put a lock on cupboard with the binoculars, because that will happen again. Correction of error is not about the human error. It's what system design allows that error to happen? Humans are humans and stuff is going to happen. You must make it hard to do the wrong thing. The pressures are the same because White Star was a company with normal pressures.

Michael O'Reilly:

Even when you know the complexities of your system, there's only so much you can do up front with those practices. Things still go wrong. But when things do go wrong, information is valuable. By being proactive and setting RTO's and RPO's and running chaos tests, when something happens, information is invaluable. It highlights weaknesses and gives you an opportunity to tighten things up. From an engineering perspective, it's an interesting area because scenarios are creative.

Mark McCann:

There are key elements for a psychologically safe environment. Failure is a learning experience and it's something that you can use to improve. You must make schedule mechanisms to get people into the right headspace. Don't expect exploratory testing or chaos engineering to happen.

Dave Anderson:

It might sound like it's optional to do this You must put space in your plan to do exploratory testing, chaos engineering, or to do a game day. Put mechanisms in place have conversations and get everybody involved. stuff. But when your system goes down, you're money every hour. And you realise that it wasn't optional.

Michael O'Reilly:

We see a proliferation of serverless adoption, serverless usage, and EDA, which goes hand in hand with distributed micro architectures. Resiliency and planning for disaster recovery is critical,

Dave Anderson:

It's still emerging. And it's a LeapFrog thing. You may think that you haven't started this yet. But it's a good time to start now. If you started this 10 years ago, you would have had to figure it all out like serverless or cloud. If you're in there at the start, you have to knock corners off and figure out how things work. If you start late, it's not a disaster, because you can just use the latest and greatest. It's like standing on the shoulders of giants. So if you jump into Resiliency Hub and Well-architected there's a whole load of stuff you can use jout of the box. As we have disccused, it's never one thing, It reminds us of the book 'The Perfect Storm' and the boat that left Gloucester and headed into te big storm of 1991. It wasn't just the storm that caused the boat to sink, it was a whole bunch of stuff. They documented a whole bunch of stuff. It was a perfect storm of events that ended in tragedy. It's what you have to be resilient against.

Mark McCann:

During the AWS Resiliency Day, we went over When you consider engineering excellence you should be doing disaster recovery strategies from 'Backup and Restore' to'Pilot Light' and 'Warm Standby' to 'Multi Site Active-Active'. When you embrace a Serverless First, Well-architected mindset, your workloads are intrinsically mature on the disaster recovery spectum compared to a traditional workload. You can have Resiliency and Reliability baked in, when you embrace serverless managed services like Multi AZ and Multi Regions. You still need to plan for this stuff but you will be further down the road than somebody trying to do it themselves. this anyway as part of quality.

Michael O'Reilly:

Chaos tests build confidence especially when you're doing EDA architectures, integrating with several components or services in different ways. What happens if one of these things goes down? You must simulate those things and game Day it. Let's go in and take one or two of them out and see how the team reacts to see how resilient solutions are. We do this a lot. Every time you do it, you learn. You add to your arsenal and experience.

Dave Anderson:

When I first started working, I wrote a test to detect if someone pulled the network cable out of the machine. The machine had dual network cables. The second test was when we actually unplugged the machine when it was running. it was a Telco system! This stuff is not new, but it's evolving rapidly. So that's the craic. Like and subscribe to Serverless Craic. Have a look at The Serverless Edge blog and like or follow us@ServerlessEdge on X.