Reimagining Cyber - real world perspectives on cybersecurity

AI and ChatGPT - Security, Privacy and Ethical Ramifications - Ep 62

April 05, 2023 Reimagining Cyber Season 1 Episode 62
Reimagining Cyber - real world perspectives on cybersecurity
AI and ChatGPT - Security, Privacy and Ethical Ramifications - Ep 62
Show Notes Transcript

This episode features “the expert in ChatGPT”, Stephan Jou. He is CTO of Security Analytics at OpenText Cybersecurity. 

“The techniques that we are developing are becoming so sophisticated and scalable that it's really become the only viable method to detect increasingly sophisticated and subtle attacks when the data volumes and velocity are so huge. So think about nation state attacks where you have very advanced adversaries that are using uncommon tools that won't be on any sort of blacklist.”

“In the past five years or so, I've become increasingly interested in the ethical and responsible application of AI. Pure AI is kind of like pure math. It's neutral. It doesn't have an angle to it, but applied AI is a different story. So all of a sudden you have to think about the implications of your AI product, the data that you're using, and whether your AI product can be weaponized or misled.” 


“You call me the expert in ChatGPT. I sort of both love it and hate it. I love it because people like me are starting to get so much attention and I hate it because it's sort of highlighted some areas of potential risk associated with AI that people are only start now starting to realize.”

“I'm very much looking forward to using technologies that can understand code and code patterns and how code gets assembled together and built into a product in a human-like way to be able to sort of detect software vulnerabilities. That's a fascinating area of development and research that's going on right now in our labs.”

“[on AI poisoning] The good news is, this is very difficult to do in practice. A lot of the papers that we see on AI poisoning, they're much more theoretical than they are practical.”



Follow or subscribe to the show on your preferred podcast platform.
Share the show with others in the cybersecurity world.
Get in touch via reimaginingcyber@gmail.com

[00:00:00] Stephan Jou: The techniques that we are developing are becoming so sophisticated and scalable that it's really become the only viable method to detect increasingly sophisticated and subtle attacks when the data volumes and velocity are so huge. So think about nation state attacks where you have very advanced adversaries that are using uncommon tools that won't be on any sort of blacklist.

[00:00:29] Rob Aragao: Welcome to the Reimagining Cyber Podcast, where we share short and to the point 

[00:00:33] Rob Aragao: perspectives on the cyber landscape. It's all about engaging, yet casual conversations and what organizations are doing to reimagine their cyber programs while ensuring their business objectives are top priority. With my co-host, Stan Wisseman, head of security strategists, I'm Rob Aragoa, chief security strategist, and this is Reimagining Cyber.

[00:00:54] Rob Aragao: who do we have joining us for this? 

[00:00:56] Stan Wisseman: Well, Rob, today we're going to be talking about artificial intelligence. I can't think of anybody I'd rather have this discussion with than Stephan Jou and Stephan is the CTO of security analytics at OpenText Cybersecurity and he leads various analytics related initiatives for us.

[00:01:12] Stan Wisseman: Previous to OpenText, Stephan was the CTO of Interset at Microfocus, a leading edge cybersecurity and In-Q-Tel portfolio company that uses machine learning and behavior analytics. He was also at IBM at Cognos where he the development of 10 products in the areas of cloud computing, visualization, data mining, and neural networks.

[00:01:33] Stan Wisseman: Stephan, it's great to have you on the episode with us today to help us better understand what's going on in this market. Do you have anything else that you'd like to add about your background first before we get started? 

[00:01:43] Stephan Jou: I guess the only thing I'd add is in the past five years or so, I've become increasingly interested in the ethical and responsible application of AI.

[00:01:51] Stephan Jou: You know, pure AI is kind of like pure math. It's, it's neutral. It doesn't have an angle to it, but applied AI is a different story. So all of a sudden you have to think about the implications of your AI product, the data that you're using, and whether your AI product can be weaponized or misled. And so I've started thinking a lot about ethical and responsible AI.

[00:02:11] Stephan Jou: Blogging about it, writing about it, and I was honored to be an invited participant to the 2018 G7 Multi-Stakeholder conference on AI. And I was consulted by the Privacy Commissioner of Canada and the regulation of AI for data policy back in 2020 just before everything shut down for, for covid.

[00:02:29] Stephan Jou: So this is a fond memory, but it was also prescient. 

[00:02:33] Stan Wisseman: And you also did a lot of work tracking the covid. Virus as far as using some AI technology for that. 

[00:02:41] Stephan Jou: That's a fantastic example of AI for good. So using analytics to track the spread of covid starting with Canada and some of the regions, but then eventually expanding our models to track global spread of Covid.

[00:02:55] Stephan Jou: That was a fun project that we sort of did over a couple of weekends because the team and I felt that we wanted to sort of give something back to the world. 

[00:03:04] Stan Wisseman: I remember that. So ChatGPT is the topic de jour. Right. I mean it's, of course it's taking the media by storm and, you know, it, it is kind of cool seeing how different users are posting, how they are getting these results from, whether it be directly from Chat GPT or using some of the tools.

[00:03:22] Stan Wisseman: Like what like Bing, Microsoft invested in open AI, and I think it's around 10 billion, dollars and they to use it in being in other tools. And that seemed to have spawned some of the. AI people, players that we think of it being, you know, early adopters of AI like Google to even accelerate some of their offerings to market.

[00:03:43] Stan Wisseman: You're the expert, you know, what is, you know, going on as far as these generative AI tools and what's all the fuss about?

[00:03:50] Stephan Jou: It's so funny. I mean you call me the expert in ChatGPT  I sort of both love it and hate it. I love it because it. People like me so much attention and I hate it because it's sort of highlighted some areas.

[00:04:07] Stephan Jou: Of potential sort of risk, I guess, associated with AI that people are only start now starting to realize, and at the same time, it's also sort of being credited with AI powers that are aren't as realistic as I think people want it to be. That's a chat. GPT think we can all agree it's pretty amazing.

[00:04:25] Stephan Jou: It is officially the fastest growing internet service ever. I believe it got over a million users in five days, a hundred million users in two months. That's faster than Twitter, faster than Facebook, faster than anything else we've ever seen before, so, right. I think it's in some ways, like I said, it, it's kind of nothing new.

[00:04:43] Stephan Jou: GPT three, which is the engine underneath chat, GPT was released in 2020. And it more or less had the same capabilities in generating texts and doing what's called few shot learning. But ChatGPT what they added, which is new, which was very clever, is the chat interface, right? So by, by learning from human conversations like movie scripts, all of a sudden you didn't need to be a data science or programmer to use it.

[00:05:09] Stephan Jou: So that was the case of GPT three. So the same engine, you kind of needed to know how to use the API to use it. You can need to be a bit of a data scientist. There was a playground web UI, but it was still pretty geeky. There's these knobs and levers that most people wouldn't understand, but by basically removing the barrier of entry to use this very powerful pre-trained, large language model to generate whatever you want by just talking to it, that.

[00:05:35] Stephan Jou: got everyone's attention. So suddenly everyone became an AI expert, right? They're all sort of talking about all the things that AI could do, whereas I've been talking about it for years and years. So that's why I kind of smile a little bit when I hear stories on the news. But yeah, certainly was fussing over, right?

[00:05:52] Stephan Jou: It’s sort of made hard for powerful, powerful AI incredibly accessible.

[00:05:59] Rob Aragao: Well, that's the key though, right? We talked about this before with you and it's been around for a little bit, but basically all of a sudden here comes this interface and it unlocks kind of the access for people to start kind of digging in and seeing what it's all about.

[00:06:12] Rob Aragao: When you think about the value of AI and how it can help different industries and specifically obviously cybersecurity, right? We’ve, seen it help in different types of capabilities. Detection and response, as an example, driving better efficiency. What are some of the things that you're seeing, maybe even beyond detection and response capabilities now. What's exciting you relative to the cybersecurity applications of AI?

[00:06:37] Stephan Jou: Yeah, that's a great question. Two examples I think come to mind. One is similar to where the example that you gave, we, we do have a lot of AI that we've been developing in the labs for what's now called user entity behavioral analytics. So that's my bread and butter. So essentially using time series, anomaly detection to learn normal behavioral patterns of users, machines, processor use networks, devices, and so on, in order to be able to detect and quantify abnormal behaviors that may be indicative of some sort of threat.

[00:07:08] Stephan Jou: And the current state of the art is, pretty good. It's, it's pretty good at detecting kind of, I'll call it naive insider threat and, you know, just, you know, standards or script key type activities. But what I'm really excited about is the techniques that we are developing are becoming so sophisticated and scalable that it's really become the only viable method to detect increasingly sophisticated and subtle attacks when the data volumes and velocities are so huge.

[00:07:36] Stephan Jou: So think about nation state attacks, where you have very advanced adversaries that are using uncommon tools that won't be on any sort of blacklist to be able to perform their act. So the most recent example I can think of is someone that sort of tried to exfiltrate data intentionally but did not use any of the standard sort of file copy commands.

[00:07:57] Stephan Jou: He just used dis partitioning tools to be able to sort of mass clone the entire content. Now that's not a tool that would typically show up in any sort of rule, so you can only detect that through normal detect. And in terms of the scale think IOT scales where we have like nation state attacks on power grids, right?

[00:08:16] Stephan Jou: There's so much data that's impractical to have a single rule set that can encompass all the possible attack factors. You really need AI that can serve scale and understand each device in your IOT grid sort of individually and be able to get at these really subtle patterns out otherwise could be lost in volume.

[00:08:35] Stephan Jou: The second example is something a little bit closer to some of the technologies that we see in ChatGPT where you are using something called a large language model, which really understands how to read stuff, right? So ChatGPT has learned how to read and speak human language. And then we also have language models like computer language models that learn how to read and write source code.

[00:08:58] Stephan Jou: So think about the Codex model that's underneath co-pilot, which is the GitHub extension that allows you to sort of generate code from a comment. So using that same type of AI to get a deeper understanding of source code and vulnerabilities, patching and complex SBOMs so that the software bill of materials that really require

[00:09:19] Stephan Jou: someone like a programmer or an AI that can think like a programmer to be able to sort of read source code. So that is, I think, something I'm very much looking forward to using these technologies that can understand code and code patterns and how code gets assembled together and built into a product in a human-like way to be able to sort of detect software vulnerabilities with how they're having to rely on traditional pattern matching techniques.

[00:09:43] Stephan Jou: So that's a, a fascinating area of development and research that's going on right now in our labs.

[00:09:48] Stan Wisseman: Well, I can imagine a lot of companies or employees at companies would be interested in experimenting and seeing what they can do with those different aspects of generating code for different use cases, going back to your comment at the beginning of ethical behavior of AI, but also just proper use.

[00:10:07] Stan Wisseman: Are there, are there concerns about IP ownership or any kind of concerns about sharing of confidential information with these different tools that companies should factor into their policies and what other steps should they potentially take to control risks with employees leveraging these technologies and ways that perhaps the company doesn't even know they're doing.

[00:10:32] Stephan Jou: Yeah, that's, that's a tricky one as well. A lot of these most powerful models, they are exclusively cloud-based, and because these models are so big you can't download them and run them on your notebook. I remember GPT two was small enough that I could download it and run them my, my on my notebook.

[00:10:49] Stephan Jou: But these large language models part of the secret power is they've read so much stuff and they have so many parameters that that they seem to have developed these emergent properties that make them very, very powerful. But that same largeness makes it impractical to run in anything but a cloud.

[00:11:07] Stephan Jou: So what that means is that if you're using certain models like the GPT three model that's underneath ChatGPT. If you are using those APIs, you are inevitably going to a cloud service. So that means whatever data you're sending to be analyzed, letter prompts you're providing to train or fine tune your model, it's being sent to, to essentially a third-party and

[00:11:32] Stephan Jou: the way confidential customer information is handled, for example, if that was something that you were concerned about your employees potentially uploading to the cloud. It really varies from company to company, and it's only enforced basically by policy. There's no easy technical solution to prevent confidential data to be serve uploaded because the model can't tell the difference between real confidential data versus fabricated confidential data essentially.

[00:11:58] Stephan Jou: And so really, the onus unfortunately, is on the company to serve, understand the technology that their employees might be using, the policies around it, whether those policies are being enforced or whether it's something that can even be enforced. That leaves the companies in a tricky place.

[00:12:17] Stephan Jou: They, they'll essentially be forced to review the AI policies for applications and vendors should be hopefully encouraged, if not forced by regulation to disclose those AI policies. So that includes things like how they train the model, where the data comes from, how uploaded data might be reused, whether there's bleed from one tenant to the other.

[00:12:37] Stephan Jou: I guess the good news is this is a very comfortable position because it's what vendors and companies should be doing today for data privacy, right? So GDPR brought similar regulations where we had to sort of disclose how data is being used. That said there are some alternatives. So you can look at some of these offline models.

[00:12:54] Stephan Jou: So some models are small enough that you can download. Some of these things can be sort of reproduced on the smaller scale inside the lab. I'm a big fan of stable diffusion, for example, the image generation technology that is similar to what's in Dali, except instead of consuming it as a cloud hosted API, you can just download it running on your laptop.

[00:13:14] Stephan Jou: You can air gap it and you can sort reassure yourself that nothing's been sent outside. So there are alternatives to a lot of these more impressive technologies. 

[00:13:23] Rob Aragao: So, Stephan, let's kind of delve a little bit deeper into the data side of it, right? And you kind of talked about it a bit and you said fabricated data.

[00:13:31] Rob Aragao: When we look at AI it truly relies on good data sets. And one of the attack scenarios that we're seeing out there is AI poisoning. So as data is being captured, the attacker's basically trying to get in  and maybe modify the data, inject their own doctored dataset into that model.

[00:13:50] Rob Aragao: And obviously that can lead towards the AI algorithms going in their own direction, right? Incorrect results, biased results, and so on. What are the things that are out there that you're seeing that can help alleviate that type of concern?

[00:14:02] Stephan Jou: Yeah, that's an area of active research actually, so I don't have a perfect answer for you, but I can sort of give you directionally what people are thinking about.

[00:14:11] Stephan Jou: And you're exactly right, Rob. I mean, the way machine learning works is it needs to learn from something, right? So the without data there's no model to be built. And of course, as, as you might expect, garbage in, garbage out, right? So if you give it bad data, then you get bad results. Now this, this happens

[00:14:30] Stephan Jou: normally, I guess just as part of the way a model like GPT three learns, right? A lot of these large language models read from the internet, right? And as we all know, stuff on the internet is not exactly high quality, right? Because the internet is made up of humans and not all humans are perfect.

[00:14:48] Stephan Jou: So why does codex the code writing? Sometimes generate bad code. Well, it's because it's learned from humans that sometimes write bad code. The other problem with a data source like the internet is it’s not filled uniformly, right? There's a demographic skew to content on the internet that's largely, you know, based on English speaking countries, is largely male.

[00:15:13] Stephan Jou: There's much more male content there, so there's overrepresentation/underrepresentation from different social demographic groups and that leads to bias. Now, the good news, I guess, is there are technical and mathematical techniques to detect and quantify bias and try and correct for it. So that's the good news.

[00:15:31] Stephan Jou: So an area of active research for everyone in the AI community is both the detection and the correction of bias. And that's better than what happened in the past, right? In the past, you know I think a lot of the challenges of human society and prejudices were just kind of not really recognized or talked about.

[00:15:48] Stephan Jou: But here, because this is essentially math, we also have mathematical techniques that people are developing and actively work on to try and detect these things. Now that's, that's just sort of unintentional bias and problems. You, so you talk specifically about AI poisoning, and that's very interesting. AI poisoning is much more intentional, right?

[00:16:07] Stephan Jou: So this is where the adversary intentionally inserts bad data in order to trick the model. There are all kinds of sort of sophisticated examples of AI poisoning. That vary from algorithm to algorithm, but I'll give you my favorite sort of simple example. Let's, let's suppose you have a very simple system that is trying to look for unusual amounts of files

[00:16:27] Stephan Jou: copied to A USB key. So you can imagine how such a system might be able to sort of learn over time, you know, with enough data from you, how much you know specifically Rob uses his USB key. And it might say, okay, an average key copies files five files a day. So if you are trying to intentionally poison that model, you can imagine how the adversary might start off by copying 10 files a day and maybe, you know, 20 and 30.

[00:16:55] Stephan Jou: And essentially, he's poisoning the model by increasing the threshold that the attacker believes is built into the model, the general idea is if you know the algorithm. And you know that that's the only algorithm at play, you can apply this sort of mathematical tricks in order to poison the model.

[00:17:14] Stephan Jou: The good news is, this is very difficult to do in practice quite honestly. A lot of the papers that we see on AI poisoning, they're much more theoretical than they are practical, right? So if you are employing this type of algorithm, then here's something that we've fabricated the model that can defeat this algorithm.

[00:17:33] Stephan Jou: If you're using cluster algorithm here's an anti-clustering algorithm that would make you misclassify. But in actual sort cybersecurity algorithms in actual cybersecurity products it's a lot more difficult to do this in practice. So, as an example, so take what I just talked about for, tracking exfiltration, using USB key. In a typical cybersecurity application that has models like this to detect unusual behavior such as the line that we have at OpenText.

[00:18:01] Stephan Jou: You don't just have a single model that is looking at the number of files that you're copying to your USB key. You have hundreds of models that are all running in parallel simultaneously looking for different vectors of unusual behavior. So to avoid detection completely, the attacker would essentially have to poison all these models simultaneously.

[00:18:21] Stephan Jou: So maybe you think that you are tracking the file sizes copy to USB key exclusively. But did you also know that we might be looking at the processes that you're running? Did you also know that we might be tracking the types of IP addresses that you are using to log in? Did you know that we're also tracking the timing between key store behaviors, for example?

[00:18:41] Stephan Jou: So the fact that there's hundreds means that in order to sort of completely evade detection, you would have to launch a simultaneous AI poisoning that would be really, really impractical to do in.  real life. Mm-hmm. 

[00:18:55] Stan Wisseman: So how does that differ from an attack technique like, you know, prompt injection?

[00:19:01] Stephan Jou: They're, they're technically different techniques, although they do share a similar goal. So both prompt injection and AI poisoning is intentionally malicious insertion of some sort of data to make the system do something harmful, right? So we need to take a step back and talk about prompt engineering specifically.

[00:19:19] Stephan Jou: So prompts are the texts that you would type into the text box to do, to tell ChatGPT to do something, right? So those are called prompts. And because of systems like ChatGPT, we now have this entire sort of area called prompt engineering, where there are sort of essentially tips and techniques to be able to sort of get

[00:19:41] Stephan Jou: the chatbot system to do what you wanted to do. So here's a certain phrasing that you can use to sort of make your GPT system talk like a pirate, or here's a set of series of prompts in which you can sort get the GPT system to drop the filters that it normally has and make it, you know, swear more, right?

[00:20:01] Stephan Jou: And, and so there are all these sorts of things, that you can do to sort of make unintended behavior happen. So prompt injection is essentially, sending to the ChatGPT system or equivalent, prompts without the knowledge of the user that is also interacting with that chat system. 

[00:20:22] Stephan Jou: The examples are pretty theoretical. But they have shown that you could do something like embed the Bing chat, which uses a similar chat system as ChatGPT from Oakland AI. You can embed Bing chat into a website, but you’ve embedded these prompts so that when the user subsequently uses that chat system, it's been served guided information already to give malicious answers or to try and steal credit card information or to

[00:20:49] Stephan Jou: talk about something that it shouldn't really be talked about. 

[00:20:54] Rob Aragao: I want to go back to something you brought up earlier. So we all know the kind of craze around ChatGPT exploded just at the turn of the new year. And so, as you said, over a hundred million people jumping in.

[00:21:10] Rob Aragao: Well, as they're engaging with the platform, they maybe supplying some of their own personal information. So it's capturing all this personal information that can maybe potentially attribute to you as an individual

[00:21:27] Rob Aragao: And then the other thing you mentioned was, each organization has to think about and should be applying their own policies or guardrails and how that type of information will be kind of contained, shared, and whatnot for AI purposes.

[00:21:41] Rob Aragao: But there's this potential for privacy and regulation specifically. Right? So connecting those two things together what are the concerns in that captured personal data? And then what are your viewpoints and what are you seeing

[00:21:55] Rob Aragao: around regulation. There's been some chatter, but you know, obviously it's still very early on. 

[00:22:00] Stephan Jou So let's start with personal data and how that's handled. I guess I'll tell you a story first.  When I first used ChatGPT, the first thing I did was I asked it a question about a sort of health condition that my wife was going through.

[00:22:19] Stephan Jou: And so I went in and I said, my wife suffers from this, and what are the symptoms and what can I do about it? And the first thing ChatGPT said, which was amazing, was that I'm sorry to hear about your wife. And so sort of actually apologized and then sort of gave the answer.

[00:22:36] Stephan Jou: And I thought that was just a fantastic example of kind of the emergent human-like behaviors  it was actually really, really interesting. So what did ChatGPT do with the personal data that I uploaded? So I now know that it sof recorded some of that information and it went in according to OpenAI to possibly be used for a model improvement, right?

[00:23:02] Stephan Jou: And that's known, but to be honest it, it's probably no different than doing searches in Google today versus doing my searches in DuckDuckGo. So just like if I entered information asking about the treatment for condition X in Google, I know that Google is storing some of that information, at least at a metadata data level.

[00:23:23] Stephan Jou: But I know if I enter that same search query into DuckDuckGo it’s not. So different tools handle personal differently. OpenAI's policy do make it clear that they store and use all your prompts, but the same is not true of other systems. So for example, if you go to Perplexity AI, you get a lot of the same benefit as being chat and ChatGPT

[00:23:47] Stephan Jou: but without the policies that are related to storing that personal data, right there essentially is no history there. So unfortunately it goes back to sort of understanding the policies of the company that is behind the system that you're interacting with and to make it even trickier

[00:24:03] Stephan Jou: these policies change all the time. 

[00:24:06] Rob Aragao: Well Stephan, thanks again for coming on. It will not be the last time we talk about this topic, and I'm sure we’ll have you back on because this is just moving very quickly, which is great to see. And again, the passion that you've exerted on the topic, and we know your background

[00:24:18] Rob Aragao: obviously this will continue to evolve and we're going to have much more conversation on a topic relative to of course, the implications around cybersecurity. So thanks again. 

[00:24:27] Stephan Jou: Absolutely, anytime. 

[00:24:29] Stan Wisseman: Hey, thanks Stephan.

[00:24:33] Rob Aragao: Thanks for listening to the Reimagining Cyber Podcast. We hope you enjoyed this episode. If you would like to have us cover a specific topic of interest, feel free to reach out to us 

[00:24:42] Rob Aragao: and you can find out how in the show notes. 

[00:24:45] Producer Ben: Hello, producer Ben here, and before you move on to another podcast on your playlist, I have a suggestion for you.

[00:24:53] Producer Ben: Why not stay here and listen to even more from Rob and Stan? You've just listened to episode 51 over Reimagining Cyber, which means there are plenty of others for you to devour. For example, back in episode eight, the guy spoke to Jeremy Epstein, Lead Program Officer with the National Science Foundation.

[00:25:14] Producer Ben: He shared the importance of sociotechnics and sociotechnical research. And how it can be used to improve one's cybersecurity landscape. 

[00:25:24] Jeremy Epstein: There are just cybersecurity challenges everywhere, whether it's your coffee pot or your car. All of these are sociotechnical problems. They're not just purely scientific problems, they

[00:25:37] Stephan Jou: cut across computer science and social sciences like criminology, sociology geriatrics, et cetera, et cetera. There are many different areas and it's pretty much hard to find anything that isn't affected by cybersecurity. 

[00:25:51] Producer Ben: Jeremy Epstein there from the National Science Foundation. Finally then, you know the drill.

[00:25:57] Producer Ben: We'd love you to listen, rate, review, subscribe, follow, and share with your friends and other folks in the cyber industry. Thanks so much for listening.