WeCyberYou! Unlocked Podcast

Cyber Security Frameworks Demystified Part 8 - ISO/IEC 27031

Season 1 Episode 8

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 19:30

In this episode, we break down what the ISO/IEC 27031 is, how it helps organisations prepare for cyber incidents and major disruptions and why ensuring ICT readiness is critical to keeping businesses running when everything else fails. 

Duration: 0:19:30

Visit https://www.wecyberyou.com for more cyber security education, resources and awareness content like this. 

Thank you for listening. 
WeCyberYou! Team

Support the show

Like and follow us to be notified when a new episode is released on this channel.

SPEAKER_01

When you picture corporate cybersecurity, you probably imagine a giant digital fortress.

SPEAKER_00

Right, like massive impenetrable walls.

SPEAKER_01

Yeah, exactly. Heavy cryptographic gates and automated guards just keeping all the malicious actors out.

SPEAKER_00

Yeah.

SPEAKER_01

But uh what happens when those walls are breached?

SPEAKER_00

Or worse.

SPEAKER_01

Right. Or worse, what happens when the massive power grid running that entire fortress just completely fails? You know, the alarms go silent and the gates are stuck wide open.

SPEAKER_00

That is the nightmare scenario.

SPEAKER_01

It really is. And welcome to this deep dive on the WeCyber You Unlocked podcast. Before we get into today's topic, please take a quick second to follow the channel and remember to visit WeCyberU.com for more content exactly like this.

SPEAKER_00

Highly recommend checking the site out.

SPEAKER_01

Definitely. So today, for you the listener, we are looking at the official documentation for ISO 27031. Our mission on this deep dive is to understand the actual physical mechanics of how organizations keep their digital lights on when absolute disaster strikes.

SPEAKER_00

Yeah, and it really is the ultimate playbook for the worst case scenario. I mean, we spend so much time in the tech industry discussing how to prevent an incident rate.

SPEAKER_01

Oh, all the time.

SPEAKER_00

But ISO 27031 is entirely focused on the mechanics of surviving one. It forces an organization to accept that the unthinkable is going to happen. And when it does, how do you mathematically and structurally ensure the business keeps functioning?

SPEAKER_01

Okay. Let's unpack this because before we can look at how to fix a disaster, we really need to understand the overarching framework of technology readiness.

SPEAKER_00

Right, the big picture.

SPEAKER_01

Yeah. So if we look at other frameworks, like say ISO 2701, you can think of that as the anti-lock brakes on a car.

SPEAKER_00

That's a good way to put it.

SPEAKER_01

Right. It's doing everything in its power to stop the crash from happening. But ISO 27 C031 is the crumple zone and the airbags. It accepts that the crash is currently happening.

SPEAKER_00

Yeah, you're already hitting the wall.

SPEAKER_01

Exactly. And its entire mechanism is designed to absorb the kinetic impact so the passengers, which is the business, actually survive the wreck. It specifically zooms in on technology readiness.

SPEAKER_00

Aaron Powell What's fascinating here is how it acts as a mechanical bridge. It complements broader standards like ISO 27001 for information security. Right. And uh ISO 2002301, which handles overall business continuity. But ISO 27031 carves out its own critical niche by focusing purely on the technology side of that equation. In the documentation, the core concept driving all of this is IRBC.

SPEAKER_01

IRBC. Wait, which stands for what again?

SPEAKER_00

Right. So that is ICT readiness for business continuity.

SPEAKER_01

Aaron Powell ICT readiness for business continuity, which sounds like really heavy corporate jargon, but the underlying concepts are actually quite elegant.

SPEAKER_00

Aaron Powell They are. I mean, IRBC is really the architectural heart of ISO 27031. It basically means engineering your IT systems to possess three specific qualities. They must be available, they must be resilient, and they must be recoverable.

SPEAKER_01

Hold on, let me push back on that for a second. Availability and resilience. Those sound like the exact same thing to me. Like if a system is available, isn't it inherently resilient?

SPEAKER_00

Aaron Powell Not necessarily. And the mechanical difference between those two is actually where a lot of infrastructure design completely fails. Yeah. So availability is about redundancy. It means having a secondary load balancer or a backup power supply so that if component A fails, component B takes over.

SPEAKER_01

Aaron Powell The system just stays up.

SPEAKER_00

Right. Resilience, on the other hand, is about how a system behaves under extreme stress or partial failure.

SPEAKER_01

Oh, okay.

SPEAKER_00

Like if a massive denial of service attack hits your network. A resilient system doesn't just crash, it degrades gracefully.

SPEAKER_01

Degrades gracefully.

SPEAKER_00

Yeah. So maybe the high resolution images stop loading on the application, but the core text-based transaction database keeps functioning. It bends instead of snapping.

SPEAKER_01

Well, that makes total sense. So availability is having a spare tire.

SPEAKER_00

Right.

SPEAKER_01

Resilience is having run-flat tires that let you keep driving at a slower speed. And recoverability is how fast you can get the car to a mechanic and replace the axle entirely if you hit a massive pothole.

SPEAKER_00

That is a highly accurate way to look at it. The ultimate goal here is to engineer an environment that prevents disruptions where possible responds effectively when the environment is compromised and recovers the underlying data quickly.

SPEAKER_01

Just maintaining critical operations during the chaos.

SPEAKER_00

Exactly.

SPEAKER_01

So we have our crumple zone, but how do we actually design it? Because wait, so not all systems are created equal, right? Definitely not. Like if my company gets hit by a massive power outage, surely the IT team isn't trying to save the employee cafeteria menu server at the exact same time they're trying to save the primary customer transaction database.

SPEAKER_00

No, they absolutely shouldn't be. You cannot save everything all at once. And honestly, if you try, you will just end up saving nothing.

SPEAKER_01

It's like emergency room triage.

SPEAKER_00

Exactly like that.

SPEAKER_01

If you walk into an ER, some agents need a heartbeat instantly. They get rushed straight to the back.

SPEAKER_00

Right.

SPEAKER_01

Other people with a sprained ankle can sit in the waiting room for hours. But here's my question. In a hospital, a doctor makes that call. In a massive corporation with a complex distributed architecture, every department head thinks their specific application is the patient bleeding out.

SPEAKER_00

Oh yeah. Everyone thinks their project is tier one.

SPEAKER_01

Right. So if every department claims their system is tier one, who breaks the tie?

SPEAKER_00

And that is exactly why the standard introduces two core mechanics: alignment with business continuity and risk management. An IT department cannot operate in a vacuum. That makes sense. Nor can they be the ones making the final triage decisions.

SPEAKER_01

Aaron Powell So how does that actually work in practice? Do they just put the sales director and the IT architect in a room until they agree?

SPEAKER_00

No, it is much more empirical than that. A proper BIA relies on financial and operational modeling. The business units have to quantify the exact cost of an outage.

SPEAKER_01

Like literally assigning a dollar value to the downtime.

SPEAKER_00

Exactly. They calculate that if the CRM goes down, they lose $50,000 an hour in sales.

SPEAKER_01

Wow.

SPEAKER_00

But if the internal HR portal goes down, it costs them maybe a few hundred dollars in lost productivity. The numbers dictate the triage order.

SPEAKER_01

Aaron Powell Oh, so it completely removes the emotion from the equation.

SPEAKER_00

Yes. The technology serves the business, not the other way around.

SPEAKER_01

Aaron Powell Okay. The documentation highlights two crucial metrics that come out of this business impact analysis. And I want to make sure you, the listener, really crasp the mechanics of these because they are basically the foundation of any recovery architecture. They are RTO and RPO.

SPEAKER_00

Yeah, let's start with RTO, the recovery time objective. This dictates how fast specific systems need to be back online before the business suffers unacceptable, irreversible damage.

SPEAKER_01

Okay.

SPEAKER_00

So for a crapel financial transaction system, your RTO might be five minutes. For that internal HR portal, your RTO might be a week.

SPEAKER_01

Okay, so the BIA gives us our RTO. It's our math. We know the bank app needs to be up in five minutes. Now the second metric is RPO or recovery point objective.

SPEAKER_00

Right.

SPEAKER_01

If RTO is about time moving forward to get back online, RPO is about looking backward at your data. It defines how much historical data the business can afford to lose.

SPEAKER_00

Exactly. It dictates your backup point.

SPEAKER_01

So if my RPO is 24 hours, I am backing up my data once a day. If the system crashes, I lose yesterday's work, and that is deemed acceptable by the business.

SPEAKER_00

Correct. But if you are a major financial institution, an RPO of 24 hours is catastrophic. You cannot tell millions of customers that their transactions from Tuesday simply vanished into thin air.

SPEAKER_01

Oh yeah, that would be a nightmare.

SPEAKER_00

A total disaster. For those Tier 1 systems, the RPO must be near zero, meaning data is backed up continuously in real time.

SPEAKER_01

Wait, an RPO of zero, backing up data continuously in real time sounds insanely expensive and resource heavy.

SPEAKER_00

It is incredibly expensive.

SPEAKER_01

How does a company even pull that off without slowing their entire network to an absolute crawl? Every time I save a file, it has to instantly copy to another server.

SPEAKER_00

Well, which is why the BIA is so vital, right? You only apply an RPO of zero to the most critical databases. Mechanically, to achieve near zero RPO without tanking network performance, organizations don't use standard backups.

SPEAKER_01

What do they use then?

SPEAKER_00

They use synchronous replication over dedicated high-speed fiber channels. When a customer makes a deposit, the primary server writes the data. And before it even confirms the transaction to the user, it waits for a confirmation from a secondary server miles away that the data was also written there.

SPEAKER_01

Aaron Powell So they are literally writing the data in two places simultaneously.

SPEAKER_00

Exactly. Or they use asynchronous replication with continuous delta syncing.

SPEAKER_01

Delta syncing.

SPEAKER_00

Yeah. Where only the tiny block level changes in the data are streamed to the backup site milliseconds after they happen. It requires massive bandwidth and incredibly sophisticated storage arrays.

SPEAKER_01

Aaron Powell Here's where it gets really interesting though. We've established the risk and we know our objectives with RTO and RPO. But math doesn't reboot a server.

SPEAKER_00

No, it certainly doesn't.

SPEAKER_01

What actually happens when the nightmare becomes reality and the data center is literally underwater? How do we move to the action phase?

SPEAKER_00

Right, the response.

SPEAKER_01

Yeah, the standard outlines the key components of an incident response. And one thing that really caught my attention was the communication plans. I find it fascinating that a highly technical IT standard explicitly demands communication protocols.

SPEAKER_00

Well, because a disaster is rarely just a technical failure. It is almost always a human coordination failure as well. Oh, for sure. Think about it. If a massive outage hits, the IT engineers are frantically digging through code to restore routing tables, the executives are demanding answers for the press.

SPEAKER_01

And the customer service team is giving clients completely inaccurate information.

SPEAKER_00

Exactly. The disaster multiplies exponentially due to chaos.

SPEAKER_01

But practically speaking, if the network is down, the company email server is down.

SPEAKER_00

That is exactly why ISO 27031 requires out-of-band communication plans. You cannot rely on the infrastructure you are trying to fix to communicate about fixing it.

SPEAKER_01

Oh, that's a great point.

SPEAKER_00

Mature organizations will have entirely separate cloud-hosted communication platforms, like a standalone Slack workspace or dedicated emergency cellular devices that do not touch their primary network.

SPEAKER_01

So it's totally isolated.

SPEAKER_00

Totally isolated. The communication plan dictates exactly who talks to whom on what isolated platform and at what intervals.

SPEAKER_01

So it's like the core switch is down, we are failing over to the secondary site, notify the client success team, we will be degraded for two hours. It basically manages the panic.

SPEAKER_00

This raises an important question about the actual restoration, though. Communication is just the wrapper. Inside that wrapper, the standard requires an ICT continuity strategy and ICT recovery plans.

SPEAKER_01

Aaron Powell What's the mechanical difference between a strategy and a plan in this context?

SPEAKER_00

So the strategy is the overarching architectural approach. Are we using redundant physical systems in a hot site, or are we relying on cloud failover? Okay. The recovery plans are the tactical, step-by-step technical procedures. It is the literal runbook an engineer executes.

SPEAKER_01

Aaron Powell When you say cloud failover as a strategy, what is actually happening mechanically? We hear that term all the time. Is there a literal heartbeat signal between the primary data center and the cloud backup?

SPEAKER_00

Aaron Powell In many architectures, yes. Quite literally. You will have automated monitoring systems constantly sending a health check or a heartbeat to the primary application. If that primary application fails to respond to three consecutive heartbeats, the automated failover mechanism kicks in.

SPEAKER_01

And how does the user traffic know to go to the cloud instead of the dead server?

SPEAKER_00

Aaron Powell That usually happens at the DNS level. The system automatically updates the domain name system records, essentially changing the internet's address book on the fly. Oh wow. It drops the time to live or TTL of the routing record to maybe 30 seconds. So all incoming internet traffic is instantly rerouted from the dead physical data center to the standby environment in AWS or Azure.

SPEAKER_01

That's incredibly fast.

SPEAKER_00

Yeah. When designed perfectly, the end user might just experience a slight page load delay.

SPEAKER_01

That is incredible. But you know, the sources also give us some very practical examples of real-world threats ISO 27031 prepares you for. And not all of them are simple hardware failures.

SPEAKER_00

No, definitely not.

SPEAKER_01

We are talking about ransomware attacks shutting down entire networks. And this is where I kind of get hung up. If you get hit by advanced ransomware, how do you even recover? If everything is connected and syncing continuously to hit those RPO targets we talked about, wouldn't the ransomware just sink to the backup and corrupt that too?

SPEAKER_00

That is the exact nightmare scenario. And it happens frequently. Wow. This is why the ICT continuity strategy must account for the specific nature of the threat. To combat ransomware, you cannot just rely on standard continuous replication. You need immutable backups.

SPEAKER_01

Immutable meaning they cannot be changed. Correct.

SPEAKER_00

The storage architecture is designed so that once a backup snapshot is written, it is cryptographically locked. It physically cannot be modified or deleted by any user or administrator, even if they have top-level network credentials for a set period of time.

SPEAKER_01

So even if the ransomware encrypts the primary network and the administrative servers, it just hits a brick wall when it tries to infect the immutable storage array.

SPEAKER_00

Exactly.

SPEAKER_01

So the recovery plan runbook for ransomware wouldn't just be restore the backup. It'd be step one, isolate the network. Step two, verify the immutability of the backup repository. Step three, forensically scrub the hardware. And step four, initiate the restoration.

SPEAKER_00

Exactly. It turns an existential company-ending crisis into a highly stressful but entirely manageable engineering workflow.

SPEAKER_01

So what does this all mean for day-to-day operations? Because a brilliant architectural strategy and an immutable backup on a piece of paper are completely useless if the engineer just freezes when the alarms actually go off.

SPEAKER_00

If we connect this to the bigger picture of the standard, that is why the final major component of ISO 27031 is testing and continuous improvement. You cannot assume your plan works. You have to actively prove it works through rigorous testing.

SPEAKER_01

What does that actually look like? Are IT directors running around like chaos agents, literally pulling fiber optic cables out of server racks to see what happens?

SPEAKER_00

Honestly, in the most advanced tech companies, yes.

SPEAKER_01

Wait, really?

SPEAKER_00

Yeah, that is a practice known as chaos engineering. Companies like Netflix famously develop software like Chaos Monkey that intentionally terminates production instances randomly during the workday.

SPEAKER_01

Just to see if the failover works.

SPEAKER_00

Exactly. To ensure the automated failover mechanisms we discussed actually work under real-world conditions. But for most standard enterprises, testing starts with tabletop exercises.

SPEAKER_01

Walk me through a tabletop exercise. How do you simulate a digital disaster in a conference room?

SPEAKER_00

Aaron Powell The security leadership will draft a highly specific worst-case scenario. Let's say a novel ransomware strain has just compromised our Active Directory environment.

SPEAKER_01

Okay.

SPEAKER_00

Active Directory is the system that controls all user permissions and passwords. So they say the attackers have changed all administrative passwords, you cannot log into your laptops, you cannot access the digital runbooks, the manufacturing floor has halted. Go.

SPEAKER_01

Wow. That immediately exposes the human flaws. If the recovery plan is saved as a PDF on a server, you can no longer log into your plan as useless.

SPEAKER_00

Exactly. The tabletop exercise forces the team to realize they need physical printed copies of the runbooks stored in a secure offline safe. It forces them to realize they need an out-of-band communication method because their corporate email uses Active Directory to authenticate. Right. The entire goal of the simulation is to break the plan, identify the gaps, and continuously improve the architecture based on those lessons learned. Technology changes, cloud environments shift, and threat actors evolve daily. And an untested plan is an obsolete plan.

SPEAKER_01

So taking a step back and looking at everything we've covered, from crumple zones to immutable backups to tabletop simulations, why should the listener care?

SPEAKER_00

It's a valid question.

SPEAKER_01

Yeah, why does going through the rigorous, expensive process of aligning with ISO 27031 actually matter to the bottom line of a business?

SPEAKER_00

Well, the documentation makes it clear that this isn't just an IT checklist. It is a core business survival tool. First, it drastically reduces downtime and financial hemorrhage. When transaction systems are offline, the company is burning money by the second.

SPEAKER_01

Every single minute of an RTO has a dollar amount attached to it.

SPEAKER_00

Absolutely. Second, it fundamentally changes your posture against cyber attacks. If you have a tested, immutable backup architecture and a zero trust recovery plan, ransomware loses its leverage.

SPEAKER_01

Because you can just ignore the ransom.

SPEAKER_00

Right. You don't have to pay it because you can confidently rebuild the environment yourself. Furthermore, it supports strict regulatory and compliance requirements that governments are increasingly mandating for critical infrastructure and financial sectors.

SPEAKER_01

And I think the point that really hits home is that it protects reputation.

SPEAKER_00

Oh, massively.

SPEAKER_01

If an airline scheduling system goes down for an hour, it makes the evening news. If a bank app is offline for two days, people are moving their direct deposits to a competitor. Trust takes decades to build and a single poorly managed IT incident to destroy. Yep. ISO 27031 proves that disaster recovery is no longer an IT problem. It is a core business operation.

SPEAKER_00

Trust is the ultimate asset you're protecting when you implement these frameworks.

SPEAKER_01

So to summarize, the core takeaway from the source text. ISO 27031 is the standard that helps organizations engineer their IT systems to handle extreme disruptions and keep the business running no matter the circumstances. It moves an organization from hoping they survive a crash to mathematically engineering the crumple zone so they know they will.

SPEAKER_00

It does. But uh, I want to leave you with one final slightly provocative thought to ponder based on one of the practical examples the standard addresses. Cloud service outages.

SPEAKER_01

Oh, this is the cloud paradox.

SPEAKER_00

Exactly. We discussed cloud failover. The idea that if your local systems fail, the massive, highly redundant infrastructure of a major provider like AWS or Microsoft Azure will seamlessly take over. Right. And many modern organizations build their entire disaster recovery strategy around this exact assumption. But what happens when the disruption isn't local? Oh man. If a massive cascading infrastructure event takes out the major global cloud providers themselves, which we have seen happen in limited capacities due to routing errors or massive localized weather events, where is the ultimate fail-safe?

SPEAKER_01

Wow. If your backup plan relies entirely on someone else's computers, what is your strategy when their computers go down?

SPEAKER_00

It really forces us to question the extreme limits of external redundancy. At a certain point, the cloud is just another physical data center vulnerable to the laws of physics.

SPEAKER_01

That is a chilling thought. And definitely something to keep you up at night if you are a chief technology officer. Just when you think your digital fortress is secure and your crumple zones are perfect, you realize you might not control the actual ground the fortress is built on.

SPEAKER_00

A very unsettling but necessary realization.

SPEAKER_01

Well, on that slightly terrifying but incredibly important architectural note, we are going to wrap up this deep dive on the Way CyberU Unlock podcast. Thank you so much for joining us. Please make sure you follow the channel so you never miss a deep dive. And head over to WeCyberU.com right now to keep exploring the complex frameworks that keep our digital world turning. Until next time, keep your backups immutable and your plans ruthlessly tested.