Genealogy of Cybersecurity - Startup Podcast

Ep 4. Innovation Sandbox Winner, Hidden Layer, on ML System Attacks and MITRE Atlas

April 24, 2023 Paul Shomo / HiddenLayer Founder Chris Sestito Season 1 Episode 4
Genealogy of Cybersecurity - Startup Podcast
Ep 4. Innovation Sandbox Winner, Hidden Layer, on ML System Attacks and MITRE Atlas
Show Notes Transcript

Hidden Layer Founder Chris Sestito of RSAC Innovation Sandbox winner Hidden Layer, discusses AI adversarial attacks on ML systems, the ease of stealing intellectual property with ML, Chris Sestito’s history at Cylance building next-gen ML systems, and a historic 2019 attack on Cylance using adversarial ML. 

Hidden Layer brings us up to speed on this new important attack surface. Paul and Chris discuss MITRE’s new framework for attacks on ML systems, MITRE Atlas, and if the media is under covering adversarial machine learning attacks.

Find Hidden Layer on the web at HiddenLayer.com, or on Twitter @hiddenlayersec

Checkout MITRE Atlas, a knowledge base of adversary tactics, techniques, and case studies for machine learning (ML) systems.

Hidden Layer CEO and Founder Chris Sestito can be reached on LinkedIn: https://www.linkedin.com/in/ctito/

RSAC Innovation Sandbox startup competition can be found here, and RSAC is on Twitter @RSAConference.

Paul Shomo can be found at LinkedIn.com/in/paulshomo. Send Paul Shomo feedback on Twitter @ShomoBits.

2 I wanted to ask about a very specific type of machine learning on machine learning attack, which surprised me. When I look at some of the resources on your website, it sounds like stealing machine learning models' intellectual property is a little easier. Let's say you have a targeted advertising model, and I want to steal it with you spent a $1 million over a couple of years making it that you know maybe it's a Google targeted advertising model. If I want the inputs and outputs to that model, I spin up a Google account. I serve the web and I start seeing what kind of ads are serving.

Automate that and extrapolate that over many, many, many labels, I'll have the labels required to then ultimately rebuild a surrogate of that model that's trained to come up with the same results. And so you know with very little resources I have now recreated something that you spent quite a bit of time on the research and development. It sounds like the cost of stealing intellectual property just went way down. The genealogy of innovation is the cybersecurity startup and emerging tech podcast. I'll interview top entrepreneurs, startup advising CISOs, venture capitalists, and more.

The startup world is full of innovative minds. This is the place to explore new threats, new approaches to cybersecurity, and more importantly, attack surfaces that arise as technology regulation and business evolve. Welcome to the genealogy of innovation. I'm Paul Shomo. Really nice to meet you. My name is Chris Tito. I go by Tito. I'm one of the co-founders here at hidden layer, most of the CEO.

I've been in cybersecurity now for about 15 years, about ten years ago, became very interested in the application of machine learning in the cybersecurity space and really have been kind of working and really dedicating the rest of my career towards sort of the bridge between those two domains and well, first off, congratulations. Welcome to the show and congratulations for becoming a finalist at innovation sandbox. I mean, it's a huge accomplishment.

But actually, one of the things I wanted to ask you first is because you spent 5 years at silence during the rise from the mid to late 2010s, which were like the 5 good years to really build out one of the few it's not this was really one of the few vendors that actually delivered a primarily machine learning based approach to cybersecurity that succeeded. And it's really underselling just to say succeeded because that approach really revolutionized antivirus.

Could you tell us a little bit about that, your work with them and why you do it if you can give us any insight into why they were successful where so many in that first wave of AI and ML weren't? Yeah. That would be a great question and I'll try and answer that in order there. So yeah, I spent a good amount of my time there and it was so I ended up working what we called research and intelligence and ended up running it and it was really what we referred to as both the data science and threat research parts of the organization.

And you know it was a really great organization, especially in terms of the willingness to pioneer new technologies for sort of stale problems that really we hadn't gotten right over a long amount of time. And I was excited there because one of the challenges I had always had on the end of every side of things was the way in which we enabled threats and the labels that we use to inform our customers. I always thought we were we being cybersecurity was far too superficial in that regard.

We tend to use very scary labels like Trojan or virus and things that weren't really all that informative around kind of like, here's what you're up against, here's what you should do about it you know. This is what it means in a code functionality level because I think that those things are all very important. So I think if you're sort of an antivirus company that's been around for a million years and you want people to think that you saved them from 10,000 things today, then they scary sounding labels are great.

But if you really want to start helping someone understand kind of like you know true threats look like and what true risk looks like as a result of them, labeling is really important. That becomes ever more important when you start using machine learning models to identify threats because the model doesn't care. If you take something like ransomware from a WannaCry back in like 2017 you know, we as human beings go, okay, want a price when it's ransomware, but it machine learning model goes, well it also has work and functionality and exploit capabilities and all these other things.

7 And doesn't really know why you care more about ransomware or ransomware label than any of those other functions. So it becomes even more important to start labeling accurately. So it was a really cool place for that's one of my favorite things about it was really trying to tell true stories behind functionality and capabilities there. But yeah, it was a fantastic area. We really believed in machine learning. One thing you said in that particular, you talked about the labeling of families, which people don't think about, but that's very kind of politicized for publicity and the math doesn't care.

So that's kind of a cool perspective. I didn't think about it. The truth-telling of machine learning for categorizing. You also brought up something else that you brought up that the dataset that you could get a hold of a set of I assume you're talking about malware samples. 

Now, one of the quirky things I think about AI companies and we have this now is when I listen to some people you know whether it's sign language or a company that succeeded, they're always talking about building up artificial intelligence, they're building up models there, they're shoveling and data that's normalized and they're doing training, reinforcing, deep learning, et cetera, but they're building up an intelligence and it always involves exposing it to data.

And a lot of startups, you can never tell where they get the data, but I guess with threat intelligence, you actually did have a good dataset to build from. That's kind of what it sounds like where you're alluding to. Yeah, I think that that's an incredibly important part of the equation here for sure. And I think that that's actually something that I like to talk about because some organizations tell how large their datasets are in terms of we can build the best machine learning because we have this vast dataset we can train it on. And I think that's important, but more important than having a large dataset is having a dataset that's well representative of the problem space that you're trying to operate within.

And generally speaking, to do that, it requires a large set that still needs to be very well representative of the room. So in the case of silence and the models we were using, you could think about you need a dataset that represents all of the good parts of Windows environment, all the bad parts of what we're trying to stop and that needs to be very well representative of what that model is expected to make decisions in terms of when it's when it's live when it's making those real world product decisions. And so it's very important. Your data needs to be more representative. It needs to be labeled very well.

It needs to be very explicit on sort of what is good. What is bad? I'm glad I have you here because there's a lot of people doing it wrong or not understanding what right looks like. Yeah. It's a very interesting new era and where before we were amassing you know I spent a lot of years in commercial software development, we're amassing lines of codes opening features and architectures, but now it's we're amassing data. We're building up neural networks, large language models, we're accruing intelligence basically through this process.

Wildly different. And so you want to tell us a little bit about hidden layer in how you fit into this because I know you have something to do with machine learning on machine learning attacks. And I think why don't you give us a little overview and I'll ask you some questions. Yeah, absolutely. So in layer was really actually born out of an event that happened to us at silence and that was actually an adversarial machine learning attack on our ML model that we use up in this environment. And so that was highly consequential. It was brand new.

It was in the 2019 time frame that was a pretty novel type of attack. And so you know after the dust settled myself, my cofounders were also on the team with me at Thailand. So we said, you know what? This is going to be a problem for anybody is deploying machine learning models into their hardware or software products. And so we knew that there was going to need to be a dedicated type of security product for this. And because there's the need on the threat landscape side that these threats are starting to evolve.

And when you think about sort of just the extreme rate of adoption of AI and ML, whether it's a generative AI, some of these large language models that you talked about or whether it's things that are being developed in­house to solve problems, different decision making engines that were stepping into the space. It's getting deployed everywhere. It's in every industry, every company is a big data company now, and you all want to take advantage of what some of this technology can bring in terms of efficiency or increasing quality. But security, like most of the time, it tends to be a little bit of afterthought.

And most of this is being built on open­source code and when you take all of that, you start to realize how vulnerable artificial intelligence is, both from a code standpoint, like it's very easy to abuse. You know, machine learning models, high malware, and weights. We see this as well. And I can show this. We showed quite often with the demo. You know, how abusable the technology is at a code level as well as at a logical level. Somebody can infer information or data out of a model that you allow them to interact with because it is part of the product design.

So all of that is really why we created inline. We have an entire platform now, multiple products dedicated to protecting models in real time, detecting and responding to attacks against them, as well as scanning them from an asset perspective to make sure they have compromised at the code level. And that's what we're bringing to the marketplace today. Very cool. And you mentioned, obviously, machine learning and AI is taking off even for core business, whether you're a lumber distributor or your head of retail chains, you're developing machine learning models and AI.

Is that what you're protecting or you're also where you're protecting their cybersecurity products that have machine learning in it, or is it both? Yeah, it's everything. We protect anywhere where machine learning models are being used. So we like to say we're looking at them very much from kind of the application. And so versus other kind of maybe organizations or academic exercises that think about protecting the model and predicting the model sort of looking out. We wanted to make sure that we could build a solution that was noninvasive. We wanted to make sure we could secure your algorithm.

We can secure your data. We can secure your product without having to reach it and touch all of it. So we actually took a lot of the same technologies that we worked on in the EDR space and applied it to machine learning. And that's why we call our product machine learning detection and we look at behavioral patterns of interactions with those models. And that allows us to protect and understand what an attack is taking place without having to see any of your raw data or without having to see your algorithms themselves. And we also have our scanner. That does analyze the model. We actually hand the scanner off to you. So you can use it.

So that way we still don't have to see any of that IP. And that allows us to work with many different organizations with a lot of this first. Very cool. Very cool. I want to get into mitre Atlas because it sounds like you might have had some involvement with that. But first, I wanted to ask about a very specific type of machine learning on machine learning attack, which surprised me. So if you think about the idea of stealing code and then trying to reverse engineer this mountains of code, it seems like a lot of work. I was always surprised that the Chinese government would be get a lot of publicity whether it's overplayed or not that they were trying to steal intellectual property or reverse engineering.

It always seemed like a lot of work to do that as opposed to just why don't you put the resources on innovation. But when I listened to some of the when I look at some of the resources on your website, it sounds like stealing machine learning models intellectual property is a little easier because I can put forward my adversarial machine learning models to learn how your stuff works and steal it. Could you explain that? Yeah, absolutely. And there are many different techniques to do exactly that. So I think you know one common technique we talk about a lot is even building a surrogate model.

So looking at if you have a model, maybe it's in a hardware product, maybe some software product as long as somebody can interact with it, they can see the input sign, they can see the output side, and they can ultimately build enough labeling on their own to retrain your model. There's a lot of examples of it. So let's say let's say you have a targeted advertising model and I want to steal it because you spent a $1 billion over a couple of years making it that maybe it's a Google targeted advertising model. If I want the inputs and outputs to that model, I spin up a Google account. I serve the web and I start seeing what kind of ads you're serving.

And if I automate that and extrapolate that over many, many, many labels, I'll have the labels required to then ultimately rebuild a surrogate of that model that's trained to come up with the same results. And so you know with very little resources I have now recreated something that you spent quite a bit of time on the research and development side. That's just sort of a quick example. I came up with just right here on the spot, but there are many types of examples of building kind of a surrogate to essentially extract that model.

There's also, you may extract it so that you can attack it off body and build an attack that you can then bring online, knowing that it will be successful now. And you don't have to construct it and learn about decision boundaries and features and all those things live for somebody can see with network anomaly detection as well. So there's many different ways to do that. And there's many different motivations as well. So it sounds to me as you go through that. It sounds like the cost of stealing intellectual property just went way down, which means that's probably going to be a big activity in the future.

Yeah, probably the scariest thing that we've learned throughout all of this is just how easy these adversarial machine learning attacks are. And if you think back to kind of the early 2000s, when we started seeing tools like meta split, become available and now all of a sudden anybody who wants to conduct some sort of hack or attack really just has to interact with the menu, we are now at that spot with adversarial machine learning as well. There are over we actually published some research about a month and a half ago showing 26 different automated adversarial machine learning attack tools that are available on GitHub.

It's just as commoditized as any other type of attack now. So really you don't have to be a data scientist. You don't have to be an expert developer. You're really just need to want to attack a model and go and download some of these tools that are available. And they'll build these parameters. They'll really work through with a significant sort of ease of use extraction on conducting these types of tags. Yeah, that's interesting that you brought that up to 26 tools because I might attack already has a framework miter Atlas, which is a little surprising.

DevOps doesn't have a framework. API attacks don't, but machine learning, attacking machine learning systems does. Did you have some involvement in that? Yeah, they're a fantastic team over there. The miter team, doctor Christina laga and her team have really built something that we all need to pay a lot of attention to, which is that that might our Atlas framework. It just like miter attack in the EDR and endpoint protection space might or Atlas really is defining techniques and tactics that adversaries can use in that adversarial machine learning space. And so we are involved with them now.

We weren't involved in conception, but we like to work with them both on sort of use cases of things that we're seeing in the wild as well as how to continue to grow and develop that framework so that organizations can take advantage of that type of data and then understand how to protect themselves and we actually built that miter Atlas code and terminology into every single part of our product. So you know one of our core beliefs is that data scientists should not have to be security operators. They shouldn't have to worry about whether or not what they're building is hardened enough and they should be able to just continue to solve problems.

And then likewise, security operators shouldn't have to be data scientists in order to protect ML assets so there are patients. So we keep that in mind with everything that we build is how to empower both of those two different groups in their existing domains and miter Atlas is an enormous tool for us to be able to do exactly that. So when we detect something and we put it out you know, we have output Splunk or Datadog for these security operators, we do that with miter Atlas terminology so that they have that third party resource to go understand exactly what's going on.

And vice versa, when data scientists see, hey, somebody's performing an inference attack against our model, they can go out to a miter Atlas and see in that exact same length of verbal can understand. So it's a very, very powerful tool when it comes to sort of defining this problem space. And you mentioned an inference attack. There's all kinds of stuff I saw there that gives you a new vocabulary, which is an important part of talking about it, but also thinking clearly about it. Question I had, though, is miter jumping the gun and getting really excited about a future attack surface or are we really seeing this impact companies now that just hasn't been publicized?

Because you mentioned 26 adversarial machine learning tools. And I'm not really hearing a lot about that. Is this just under covered by the media? So there's a few reasons for that. And the answer is, yes, it is absolutely yours. Absolutely happening in the wild. I think there's a handful of reasons you're not hearing a whole lot about it. Number one, I think in the security world, we've gotten very used to ransom, which is very loud, very in your face, that nobody isn't aware that they've been able to ransomware attack. They can't get in either files. There's a ransom note.

If you think back to prior to ransomware, we had to look for things like back doors for things like group kits that were existing. A little bit under the radar. These types of adversarial mail attacks can be a lot more akin to that environment where you really have to go looking for this type of attack. If somebody is skewing your model over time, poisoning something or has built something into that model of the code level, you need to understand that that's happening. You need to look for it. So I think part of it is the organizations really haven't embraced a way. And there's really not that many that exist in layers, one of the first or actually is the first and there's others coming in coming around now as well.

But it's definitely a problem that if you're not looking for, you may not be aware that it's happening. The other side of that is regulation. We have some strict regulation in place from different agencies on what you need to do and what you need to pay most of the public when it comes to things like data that's been exfiltrated from your network, data that's come out of a database, but actually all of that is very far behind when it comes to ML operations. So we're working with some agencies there to sort of re up kind of what the requirements should be if some information is inferred out of your database or sorry a or MLS pipeline and how you should tell your shareholders or the public or something along those lines.

So you know I think the UK is a little in front of us right now on that, but we certainly need to catch up there as well. So it's a little bit of a mix. It's organizations not knowing when they've been exploited and it's the ones that do know currently right now are not under any major obligation to tell everybody. So you know we need to work on that from a few different areas, but I can tell you with absolute confidence that it's something that is pretty widespread at this point. Yeah.

4 Well, hopefully this podcast gets some information about that out there because it's very fascinating and the fact that a whole miter framework Atlas came about and this is just isn't part of the public discourse is a huge thing. So I want to get into a little more detail about how your product works. Or how it's deployed so we can kind of picture the way it works. So you're part of the AI application security space. That's what Gardner calls you. And when you're securing applications, it's wildly different than what a lot of us are used to chasing malware.

And so for what we've seen with the app sec and the development operations securing those DevOps, et cetera there's all these products that have shift left into code doing static analysis and I already mentioned poisoning training sets so that would be like during the development of ML. So there's that. That bucket, then there's the dynamic testing where they, you know, when you're testing it, I guess like in QA for vulnerabilities. But then there's also this other bucket that really frustrates me because they keep changing the words for it.

I think it's you're baking into the application in the production environment runtime is typically what but now they're saying interactive. The buzzword keeps changing. So there's this static while you develop dynamic while you're testing it and then this other runtime interactive when it's deployed. Are you in one of those categories or more than one? What do you look like in terms of deployment? And great question. So we have a product in two of those different categories. So our machine learning detection and response product you know, if the term is runtime in the application world, the term would be like inference time in the ML operations.

So this would be when your model that's baked into your product or your pipeline is making a decision. That's what we're protecting. And that's really important because that tends to be exposed as well. So if you are, for instance, let's say you have a model that's protecting a fraud use case, you have a model that's detecting credit card transactions of whether or not they're fraudulent. That's in real time that's happening to potentially interrupt a transaction and maybe fraud.

But it's also highly exposed because as an attacker, all I have to do is go feel some credit cards from the dark web and start performing transactions that are very measured and potentially even using some reinforcement learning myself on trying to understand those fraud model decision boundaries. And so you know I have all the inputs and I have all the outputs that model. So that's a really good use case for our product because that's something that protects in real time. So that's that kind of realtime solution. And that's where we're kind of heavily differentiated in that regard.

0 Because there are some other folks out there that I get are helping you on complexity robust as your model or essentially hardening your model to be a little bit more robust against these types of attacks. We don't love that approach because it's highly invasive. It makes your model more expensive. There's not a ton to really distribute and extrapolate out from there. And if you have a lot of different use cases in your organization, it requires a whole bunch of different solutions there. So we like the generic sort of MLDR solution that can really protect you across the board.

Now, that's all decision time. And we also have that model scan here. This can be really while you're building or while you're bringing ML assets into your organization. And so at the moment you know, we have found hundreds of examples over 500 examples of compromise models out there on these big model repositories. OpenAI or hugging face or sites like that, we found a lot of compromise models. And so as a data scientist, if you pull one of those models down in your environment, and the virus isn't scanning that EDR is not going to pick up on it. So you need to understand that you didn't bring anything unsafe into that organization.

So you could say that's more training and development stage for the model because I would say roughly 90% of model bills out there start from some sort of pre trade state and typically that's from some of those models. So let's say you're building a facial recognition model, you're probably starting with some sort of image classification model to start with like maybe you're playing like resonant down. From Microsoft and then you're then you're training from that point on because it's just a much, much better and more efficient way for data scientists to move on. So there's a lot of transfer learning.

There's a lot of pre trained models being passed around and it's a really good opportunity for attackers to hide things in. So in that regard, that model scanner that we have, that's a little bit more on the kind of on the transfer learning side for in terms of MLOps. I don't know how we can say maybe that's on kind of like the supply chain side for application security or for any other cybersecurity domain. But so we exist in those two components. We have other solutions that are that we're working on right now for things like, hey, when you're building a model, if you build it using these parameters, you may see these susceptible to attack X, Y, Z, here's some recommendations.

But our major differentiator is being able to stop these attacks in real time, as well as on the supply chain side, making sure you're not either a cause and issuing a compromised model out in the world or receiving anything in your organization that's already been compromised. So many vendors who are looking at behaviors and looking at live attacks and actually going back and forth with the adversary tend to put services on top of it and I see already it sounds like you have some kind of is it a managed service or some type of service for people doing?

I guess that would be IR for machine learning system attacks. That's exactly right. Yeah. We do have a selection of services we offer. And it's all really based on the fact that this is a very new space for a lot of people to consider. And so we do some AI red teaming where we can show you kind of what's going on with your model where it's exposed. We can do that with just some level of customization towards what these organizations think are important. Typically, when we start with some of the customers that we've already worked with, we just come in and we do a method assessment overall and you say you know, here's where you're most exposed.

Here's where your models do have potential for what types of attacks and that kind of thing before we go in. And then we also offer trainings and some educational services. We have trainings where we'll teach security folks more on the data science side. We have trains for data science folks more on the security side of the house and get everybody kind of plugged into solving that same problem. Because at the end of the day, we want to empower these data scientists to continue building these new developing. We don't want to slow them down, but we want them to understand that you know their models and their efficacy is going to be protected from the outside world.

And then a total room here is really continue that incredible adoption of AI without letting threat actors slow down. Makes sense. It sounds like you have the full spectrum of what we need for a brand new attack surface. How do we find you and your company hidden layer on the web? This place to reach us is hidden layer dot com. You can make a request there to take a look at take a look at the product. We're also going to be at RSA this year. So if you're going to be there, you can sign up to come see us on the website as well.

We have an artist to take 23 weeks out. But we have plenty of different ways to get in contact there if you want to see a demonstration or you want to reach out to us. We're also on LinkedIn with the hidden layer and Twitter in layer sec. So plenty of different ways to catch us on either social or on our website. Well, good stuff. Thanks so much, Chris. And this is very fascinating, very cutting edge, and I could see your differentiator is being first, and this is an area I'm going to be watching closely and learning about miter Atlas. I think we all should look at this brand new area of cybersecurity. Thank you so much for coming on.