Andrew (00:00): Those things would make Kubernetes secure by default for the average workload, but they would dramatically increase the complexity of a new user tinkering and learning by themselves.

Marc (00:21): Welcome to DevOps Sauna Season IIII, the podcast where technology meets culture and security is the bridge that connects them.

Marc (00:36): Hello, we are back in the sauna. This is the first episode of season four. I'm super excited to be here with our first guest of the season, Mr. Andrew Martin, founder and CEO of control-plane.io. Hello, Andy.

Andrew (00:54): A very good day to you.

Marc (00:55): Very Good day to you too. It's so nice to have you. And I have a cohort here that I would like to introduce Mr. Darren Richardson.

Darren (01:03): Hey, Marc.

Marc (01:04): Hello, Darren. Nice to be back in the sauna. And Andy, I'm really, really excited to hear there's so many topics that we talked about in the pregame here. And I know that you'll be at the DevOps conference global, which will be local in London and streamed across the world on March 14th of 2024. But Andy, control-plane.io, what are you onto?

Andrew (01:25): We are big into all things cloud and container security. And specifically, these days that has a largest intersection with supply chain security. Of course, it still all anchors back to standard Linux kernel security, cloud configuration management, and broadly how developers can ship their workloads to production as quickly as possible with guardrails in large, regulated organizations.

Marc (01:50): Cool. One of my favorite things about like state of DevOps in the last maybe half a year or so has been that DevOps has finally gotten to the point where security measures can actually be taken into use and shift left is not just a buzzword anymore, and people are actually starting to be able to get it and take security practices on a day to day basis. Would you like to open up a little bit about how you see this now?

Andrew (02:16): Absolutely. The journey to merge development teams with their security colleagues has taken us quite some time. And there's various reasons for that. Of course, from a development perspective, optimal engineering involves just getting the thing shipped to production as quickly as possible. And nobody likes the traditional view of security, which is the 'no department' blocking deployments, pushing timelines to the right. So merging these two has been a kind of two pronged assault. On the one hand, we've got the DevOps or 'security champion' approach where an individual especially starting off in application security sits with a team and helps to champion the kind of practices that make sense to reduce vulnerabilities in production. That's everything from static code analysis, IDE embellishments, these days, it's actually looking at the libraries that are being pulled in for supply chain risks, transitive dependencies that may also have an indirect but actually significant impact on production safety. Those tools have spread out into the the infrastructure and operations community as well. We now have one of things like static analysis on TerraForm manifests for that pushed, we're doing most things in languages that can be paused circuit, things like YAML, DSL is moving away from JSON, which was very difficult to even comment, for example, which meant we had sort of amorphous blobs that defines security or production configuration. On the one hand, we've got this advent of tooling, and spreading of the impetus to do so from a development perspective. On the other hand, we've seen security teams move from what we've consider more traditional point and click into automation. This has involved changing skill sets in security teams. It's involved kind of crossing the chasm between those previously siloed development operations, and now the third-party development operations and security, we're starting to see things that really will move the needle dramatically in terms of overall infrastructure, provisioning and security. There is a common cloud controls framework that was proposed by Citibank, that control plane helps to instigate by Threat Modeling cloud services help you to build that back department that has now been moved into the finance organization. This is an exhaustive list of cloud threats, add controls, what's happening next to this initiative is now in the way is moving those defined and sort of static definitions into a dynamic test language that is called OSCAL. And it's out of NIST. What it does at the moment is provide a machine readable format for security testing, the fact generic testing, we're looking to extend that to add automation into the specification itself and then in my mind, that finally closes the loop between the automated testing that developers are used to unit integration acceptance system acceptance tests and transmuting that across into the security team. They're able to run security flavored system acceptance tests, and more dynamic independent tests further into, for example, a call graph for things that are not just available from the public facing attack surface. Finally, we will see a cohesive approach to DevSecOps across all of those three departments.

Darren (05:28): That's interesting. I'm wondering about this kind of automation scripting, because from a security standpoint, it's been a case for a long time that automation is definitely the thing lacking for security teams. It's kind of been holding back a problem that held us back purely because of the low footprint on the ground for security team. When you talk about this security scripting language, it makes me wonder how that fits in maybe modularly with something like existing YARA rules, those kind of frameworks, and whether they kind of slot together to create this ecosystem of automation.

Andrew (06:02): I agree entirely, the absence of available API's to program security tooling has led to that situation where the security team had a UI driven, where OSCAL and YARA might intersect is because OSCAL is focused on broadly, cloud configuration at large, it doesn't specifically, for example, look at packets on the network, what it could do, would be to take sampling of the last interval overlapping with his last run, because the idea with OSCAL is to run this continuously. There's a constant updates on cloud security posture. If that last set of network traffic packet dumps were moved into a location where YARA could be run on them, and then the assertion from OSCAL would potentially be absence of packets, or some tolerance depending on whether things further down the line and set them I guess, obviously, deep within a call graph, you'd expect to have no malicious indicator of compromise. The intersection is that kind of dynamic system acceptance testing (DAST), and looking to integrate with other frameworks that way. And two points, I think we probably saw a sea change here, when Burp Suite finally added an API as one of the sort of first and loudest to do so.

Darren (07:12): Year, very much so. But it sounds like then you're describing kind of a way at approach where the OSCAL is kind of taking the more broader view and then having more in depth automation in the systems themselves. It's kind of overworked, he's trying to he's covering that kind of area. Then my question would be, how does it if it's dealing with Cloud configuration, how does it handle cloud agnosticism? How does it handle being over multiple cloud platforms, because most of the security posture tools I've worked at, they're all focusing on a specific cloud provider.

Andrew (07:46): Absolutely. The dream of TerraForm to provide a unified abstraction across different clouds never really came to fruition, there is some incentive for vendors to have their own uniqueness to the API's. There are obviously very large scale nuances, especially the way that packets are routed, for example, Google will try and get IP as close to the endpoint from the edge as possible. AWS will encapsulate stuff furiously. Azure will do a mixture of everything kind of wormhole packets between places and in unusual ways. The cloud controls and OSCAL project actually focuses on duplicating, and this is this part of the project has yet to be done. This is my expectation, but duplicating the human readable controls into concrete implementations on a per cloud basis. Because even though hybrid clouds is a very sort of rarely used and very difficult methodology to deploy with, even though hybrid cloud has been fated as a mechanism to distribute resilience, many organizations don't use it because of the costs of employing individuals and re-deploying onto those targets. Realistically, the bandwidth between data centers for different cloud providers generally has reasonably substantial latency. We see less hybrid cloud approaches and we expect OSCAL to test against each different cloud provider with a unique set of tests, but with the same name, so describing the same threats, and mitigating control, and then with exemptions where that's not appropriate. And to point, absolutely. This does sit hand in hand with dynamic testing. If we were to divide up the types of controls, we would say preventative pipeline controls like static analysis, signature verification, SBoM and software composition analysis, we say types of controls that are running at runtime that look to identify indicators of compromise. YARA being a good example there. Then potentially remediated controls which again, sort of hot word on everyone's lips, his AI and adorning of potentially delegating to actuating AI models that can then for example, tweak maybe TerraForm configurations to close down something that's incorrect. We already see things like Cloud Custodian do that. An embellishment of those capabilities is probably more likely than something coming up from scratch because obviously, the intelligence required is already codified into those projects. Rather than being emergent in some magical way.

Marc (10:13): You mentioned SBoM in there. I know you have a talk called Smarter Than Your Average SBoM. There's a lot of talk on SBoMs among our customers in some of the work that we're doing. Would you like to tell us a little bit about what you're looking at? Maybe a little bit about this talk Smarter Than Your Average SBoM? Kind of how you see this today. And in the near future?

Andrew (10:37): Yes, absolutely. By dear friend, Matt Jarvis, and I put together this talk for Open Source Summit Bilbao, it is now part of the AI strategy team at Sneak.What we looked at was a multitude of things. First of all, where does SBoMs or to where do SBoMs get us compared to software composition analysis? Because ultimately, initially, the discovery phases same given a package or folder or file system, what are the open source? Well, what are the visible components there, or what can be inferred so for example, Golang packages up, its dependencies in a readable form within the statically compiled binary, open source software will publish the manifest source code, which we have access to also Java will give you an SBoM XML. First of all, it's a case of discovering the first level dependencies or the packages that are installed. This can be other applications. In the case of a container image or file system. It can be libraries that are installed, it can be things that are just downloaded from GitHub, and it can be application libraries that are part of the file system node that we're scanning from, or perhaps linked in from other places. The already the nuance and complexity of this is dependent upon the contexts, the language and perhaps even configuration of the language itself. From that perspective, we initially gather everything that the application needs for its context. This the first level of dependencies is what some SBoMs will generate. Other SBoMs will walk that dependency graph and pull all the transitive dependencies down. There are pros and cons to both of these approaches. On the one hand, taking the first level of dependencies doesn't necessarily tell you what the package was built with, because not all dependencies pin their versions. If I'm an application, and I'm running in nodes, for example, I can use semantic versioning. To indicate that I'll accept any patch version, any major minor patch, any patch version, semantically versioned shouldn't ever break an API, so I can safely in inverted commas do that where that breaks for SBoMs and supply chains is it's a moving target defined in a static manifest file in the SBoM outputs. Initially, that there's a risk at that point, then on the other hand, we've got an SBoM that will drag down every single transitive dependency. At that point, there is a risk because of the exponential volume of those dependencies compared to the first level that a user has essentially dependency blindness. There's far too many things we look at CVEs already, they're not necessarily clear as to their actual impact. The context that they're required to run in one of the classics was a SQLite's, vulnerability in Chrome would always show up in container image scans for a long time, contextually, it's not used in that way, and therefore there is no dependency or vulnerability. But context is everything with these things. This really muddies the water as well. On the first hand, we've got the actual SBoM, what does it comprise of? Is it an accurate rendition of what's in the image? And then the two competing standards CycloneDX and SPDX take slightly different approaches. The question is whether to include vulnerability data. The CVE is linked to that version in the SBoM itself. This comes into a fundamental question, is an SBoM, a build manifest for the point in time that that application was built from source? Or is it a living and mutating documents, as CVEs are released against the packages that are included in the SBoM my view on this is that the vulnerabilities should be pulled for an API whenever the SBoM is validated, because otherwise it might lead to a false sense of security further down the line. The idea with shipping these SBoM is that the producer signs the SBoM in combination with packaging their artifact into a container to upload to a package registry into a binary or a tarball, whatever that would be. When they distribute that they offer a side channel download of the SBoM to the consumer, the consumer is then able to validate the signature that the producer added to the SBoM to validate that both the distributed artifact and the export metadata come from the same producer or both come from trusted producers with a relationship and then to use the SBoM to identify for couples or vendor produced software, what artifacts are in here and if Log4j turns up again, which inevitably, it will we see into that melting vulnerabilities every couple of years, where in my estates, 10s or hundreds of 1000s of machines is this too fluid. We know from AWS as monitoring of Log4j that it's still going on, there's still plenty of deployments that are vulnerable. In some cases, they've had to take proactive measures. That's the dream of the s bomb. The question then becomes Where is an SBoM, different the software composition analysis and vulnerability scanning. Ultimately, they have the same fundamental mechanism. But we're not validating the contents of SBoM if we've got open source software with SCA. So the two things should sit hand in hand a closed source software, we don't have much option because we weren't able to scan it in the first place. There's also a question of the sanctity of the signature and the key material used to generate that signature throughout the entire SBoM generation life cycle. The talk goes into that threat model in a little bit more detail.

Darren (15:44): Okay, this is actually kind of great, because I want to touch on something you mentioned, towards the end about Log4j and when it's discovering where it is, which is the dream of SBoM, obviously, but it also seems like it's trying to address this issue that I've had with security reporting for some time, which is that we measured it CVSS scores, we measure things on a scale of 1 to 10 to measure the severity of an impacted system. But we have no good metrics for measuring the impact of exploits on the whole, like we knew Log4j was huge, because it was hitting Java, it was hitting the Apache logging system, which was basically everywhere in all of those systems. But the what it sounds like, is that the like, no, it kind of sounds like the idealized SBoM is intended to use in that way to kind of cover the base that we're currently not covering in cyber-security, which is not 'how hard is my system being hit'? But is how 'hard is the internet as a whole being hit'?

Andrew (16:49): Yes, is the internet on fire dots org or dt com. I don't remember. It's a service that's even resolvable via DNS records. I haven't looked at it for a while, but it used to be a favorite. I entirely agree. Asset Management down to the components or library level is not something that organizations generally have matured into. Looking at some of the IBM reports on data breaches last year, there's about 50% of companies in the US who are actually using supply chain security or hardening techniques, but we keep on seeing supply chain attacks rising up in the rankings. Certainly some of the targeted attacks that we've seen, could bring widespread devastation on the back of SolarWinds, of course, and those CISOs have just been by the CISOs and some of the C-levels have just had something issued by the US government, which is not necessarily an indictment or formal proceedings, but it doesn't set the scene very favorably. So yes, absolutely. Asset Management is one path of that. And the second as you say, CVEs are a generalized and off-criticized metric. We've just seen CVSS for land. Broadly, I think, and this plays into VEX, which is the vulnerability exploitability exchange format, which I'll move on to but the fix for CVEs, in my opinion, is probably completely intractable. But it's to give that contextual hint to the consumer or the person trying to use that piece of potentially vulnerable code. It's being used in this way. And the side effects are wide because one of the magical things about running containers, of course, is being able to lock down the Linux kernel's response to the application. So seccomp for the system calls, AppArmor for user space behavior, SELinux, I mean, depending on Ubuntu or Red Hat for a huge amount of file system access prevention. These are very effective countermeasures. And if a CVE has a vulnerability in a container, when the container has caps this admin, let's say, well, that's the entry point for a lot of kernel bugs. So we turned it off in general. When Dirty COW came out, it was ptrace system call. So the question is, is it possible to run a vulnerable application in production with the correct security context? I think it is. Because that's the point of having these contexts. It does then require further vulnerability research by the analysts who discovered it to retest under those conditions, but without that contextual information, and then the branch beneath it giving a hint to the consumer as to what is actually wrong, then, yeah, we're in a space where it's very difficult to understand everything. This concept of VEX essentially says, if I'm a vendor, and I'm shipping you an appliance, and I use OpenSSL, and OpenSSL just had a CVE, that CVE is registered against the version of OpenSSL. Now, I can do a number of different things in that situation. I could run some sort of symbolic execution and analyze whether I'm actually hitting the vulnerable method signature. Let's say it's a very old encryption cipher. Well, I don't use that. I can prove me walk the graph of my application. It's abstract syntax tree and see I never hit that method. Should I be able to use that application in production? Absolutely, because otherwise adds toil, it delays releases. Ultimately, we're In a competitive landscape where speed of deployment is key, and so, VEX should give us a mechanism to run insecure software, theoretically insecure in a safe manner be that by not calling it the method call, if it's open source software, we can kind of tree-shake the application library itself and just remove that code altogether. There may be side effects on test suites doing it that way, of course, and we can also use application to instrument the thing. The library just doesn't want to prevent it from being called altogether. The combination of CVEs that actually give us useful information, I don't mean to besmirch the great work that goes on, but gives more contextual clues to the user. In addition to vexed documents being distributed by vendors, will give us a slightly clearer landscape and ultimately save the redundancy of organizations repeating the same vulnerability analysis work in parallel across the world. Of course, the caveat here is that if x documents is signed by the producer, who do we trust? Do we trust the original author, again, with key material compromise and theft? As a side note, do we trust, let's say, an independent consultancy, that has done this analysis work, released it into the commons, where is then the chain of trust that allows that trust relationship? So while this is the direction we're heading in, we still come back to what is ultimately a form of PKI. And the the problems that are inherent in that ecosystem.

Marc (21:25): All right. That was quite comprehensive. <laughs> I'm still in sort of making something out of it, but you're on fire, Andy. There was some talk in the, I think, pre-pregame about Kubernetes and how still, there's so many organizations that we're talking to now that Kubernetes is still a new thing. And they think that it's secure, kind of out of the box. And there's a bunch of kind of common pitfalls. And I think there's a bunch of things that you've talked about in some some different areas. So how would you like to kind of address that: organizations taking Kubernetes into use and needing to secure that?

Andrew (21:59): Yeah, absolutely. We operate in the regulated sphere, healthy organizations to unlock next generation technologies, that is for many organizations, still containers, still cloud, still Kubernetes. There is an inherent complexity in shifting to a distributed systems mindset that many developers didn't have to concern themselves with. When we would deploy LAMP stacks. We scaled vertically instead of horizontally, and generally just interacted through a shared data store subscription. That's all changed with Kubernetes. A distributed system by default that has incredible benefits for resilience, scalability deployment, it does add some additional issues. Because inherently, as TabbySable (Tabitha Sable) has said, Kubernetes is a robot that we pray to with our hopes and dreams, and hope that they manifest into reality. We are waiting for eventual consistency to resolve the YAML that we put to the API server. And so we have to completely remodulate the operational standards and expectations that we have from a distributed system, compared to a single monolith. Where this has caused all sorts of problems in Kubernetes starts off with the fact that Kubernetes is designed for developers, its rapid growth and transformative effect on the industry and all the major clouds. If you remember, EKS was the last of the cloud supported Kubernetes platforms launched because AWS had already moved with prior to that the ECS container service. But still, the biggest cloud still eventually stood up something Kubernetes flavored because everybody was using it and dreaming of this cross cloud portability. So when starting off with a fresh Kubernetes system, it doesn't have network policies installed, it doesn't have what is now called security standards, it sits without any sort of integrated secrets ma