PrivacyLabs Compliance Technology Podcast

Privacy Preserving Technologies with Patricia Thaine

June 10, 2021 Paul
PrivacyLabs Compliance Technology Podcast
Privacy Preserving Technologies with Patricia Thaine
Transcript
Paul Starrett:

Hello, and welcome to our podcast today sponsored by PrivacyLabs. My name is Paul Starrett, the founder of PrivacyLabs. Remember, PrivacyLabs is one word we have the pleasure of having Patricia Thaine here today, I'm very honored to have her, her background is essentially unrivaled in her area. What we're going to do today is try to bring the perspective of what does privacy preservation mean, that's a fairly broad topic technically and its application to the types of data and the ways in which it can bring a solution to your compliance needs is quite broad. So we're going to try and wrangle that first and then get down into some, some specifics. So with that said, what I'll do, Patricia, if you could tell us about yourself, and about your company, and then we'll get into the first question.

Patricia Thaine:

That was great. Paul, thank you so much for having me, it's an honor to be here. So I am a PhD candidate at the University of Toronto. My research is on privacy-preserving natural language and spoken language processing. And I am the co founder and CEO of a company called Private AI and there we make it super easy for developers to integrate privacy into their software pipelines.

Paul Starrett:

Got it. Great. Okay, great. And just so people know, it's your company is at private-ai.ca, or private dash or hyphen IA.ca. Because you're in Canada, Toronto based, if I'm not mistaken. So I think the first question really is, there are many different technical solutions to preserving privacy. And I wanted to give us a sense of the different buckets of those different areas. And how they are sort of distinguishable.

Patricia Thaine:

Absolutely. And I actually have a privacy preserving technologies decision tree that you can take a look at at the private AI website, it can get quite confusing because one, you have to understand the technology, you have to understand the use cases and you have to understand when they fit in, depending on which regulations you want to comply with, for example. So you'll often hear about homomorphic encryption, secure multi party computation, anonymization, synthetic data generation. And in a lot of cases, you might hear these technologies being pitted up against one another. But really, in the best of all worlds is these are all complimentary. These aren't one or the other. You can't solve everything with homomorphic encryption, you can't solve everything within anonymization you can't solve everything wit synthetic data generation. So i you're looking at homomorphi encryption, for example, ho what might be some good us cases is if you want to comput something in the cloud. And it' personal data that you'r ingesting from users, but it's fairly repeatable computation Or you could search through database to find the result of query, for example, and the send it back to the devices o the edge. So homomorphi encryption allows you to ad/ multiply numbers that re encrypted, while they're st ll in their encrypted form. nd then you when you decrypt he output that can be just on he user side. So that is a rea ly cool technology. It has limi ed use cases. But those limited se cases make it so that if you do want to use this technology, it tends to be for things l ke credit card numbers for real y, really sensitive informati n, where you're okay taking an extra amount of processing t me in order to do this, because he downside of having t at information leaked is so h ge that you need to prevent it in any way possible. Got it Ye h, in terms of things l ke differential privacy, or example. So differential priv cy that is used, in various cas s, if you're talking ab ut structured data, or queryin a structured databa e, differential privacy can be u ed as at the output of the query to a database. So you can ad a little bit of noise to if ou make a query of how many peo le in this database, have, ou know, green hair and smo e, right, you'll get an approxim te result. And the that means t at you don't know if there was ne person, no people, or a cert in or 10 people, but what you w ll get is a result that will al ow you to make generalizati ns about a population. So the no se that's added shouldn't be so much that you can't m ke generalizations about you kn w, smoking causes cancer, beca se you'll see that that's someth ng that a lot of the population as in common when they smoke. ut you wouldn't be able to fig re out that somebody specifica ly has a rare disease. And that nd is a part of that data s t. Differential privacy is a so used for training mach ne learning models. So there's t is great paper by Carlini et a, called the secret chair, whe e they hide a social securi y number within the penn tre bank data set. And they show how language models memorize, specifically character language models, in this case, memorize the the digits of that social security number that's hidden, even though it only appears once. Yes, so that as well as other examples that have been brought up of language models spewing out personal data, they showed the need for either one, using differential privacy when training the models, so that adds a little bit of noise to t e training. So you don't memo ize the specific info mation, but rather over rching patterns that you've seen multiple times? Yes. Or in what we've observed in private AI i you can also redact the sens tive information or pseu onymise. And if it's not ther in the first place, you can' memorize it.

Paul Starrett:

I see. Interesting. Um, does that cover most of the various areas of the there are more there secure multi party computation? Yes. Okay. And I can I can sense a reluctance to go there. I definitely go there. Quite a few technologies, but I'm happy to cover them all. Okay, no, I think I think I've heard that one and then federated learning. And maybe if we could touch on those, I think it is a good idea to kind of keep this, you know, focused. That's the problem is, I'm technical, and I get a lot of what you're going. But I think that's probably going to be a bit much. But yeah, maybe secure party. And then federated learning maybe. And then.

Patricia Thaine:

Yeah, so secure multi party computation. The idea there is that you'll have two or more parties, if it's two parties secure to two party computation. And these parties don't want to show each other their inputs, but they want to share an output. And some examples of secure multi party computation and use or for example, for genomic data analysis, when hospitals can share data with one another, then they can use secure to party close to party or multi party computation to get that result, ultimately, so the way that it often works is using what's called garbled circuits. So these are circuits that you can't understand what they did what the inputs to these are. But you can understand the output. And you have to frame the inputs in such a way that when you're putting them in the circuit, it combines nicely to make a sensible output. So this is used with in combination with federated learning for machine learning. Federated learning is about training your machine learning models directly on the devices where the information are being produced. So, for example, Google uses federated learning for G board. And Apple uses it for emoji prediction learning. And what you want to avoid when you're using federated learning is for the resulting weights of the model. So the updates that you're making to the models to reveal anything about what was, what the information, what they were trained on what kind of information they were trained on. And the way to do that is by combining the models using for multiple users using secure multi party computation, and the secure multi party computation protocols that are available for this. In some cases, they have a set limit of users that need to put in their models for the output to make any sense So that avoids you having you saying, let's just get these two users models and then get the output and then trying to figure anything out about what the updates were.

Paul Starrett:

I see, I see, Wow, that's a lot and all these different technologies. But it sounds like it, like you said, If they are, each one is useful in different contexts. So for example, differential privacy is for machine learning, maybe synthetic data and such, and in federated learning and secure multi party computation. So when you have multiple inputs from different areas, and then, and then with the case let's see, I can remember what we discussed, because there's so much with the idea of homomorphic, homomorphic encryption is you're sort of querying a database on a on an ongoing basis, repetitive way.

Patricia Thaine:

Yeah, that's one example. So if you have a database of social security numbers, for example, and you want to see whether there's a match with one social security number, that's input, but you can also do some light machine learning using homomorphic encryption. But that's still a very girly area of research.

Paul Starrett:

I see. I see. So it really depends on the context. They check. Yes, that you mix and match these things, in order to bring yourself into a place that is optimal for for the, the balance. That brings me my next question. And that is, there's what I like to what I'v heard referreed to as a pri acy budget. So there's this ther is this sort of balance between the usefulness of the underl ing original data where the pri ate sensitive information is in the obfuscation, again, to coi my own phrase, of that data so that it is that it is compl ant to whatever law you have, and whatever infrastructur or workflows you have going o. I think that's really, wher do you that is a very, I wan to say subjective, but it's i is very difficult to pin do n a metric to say, what is how uch do I take out of the dat to make it privacy preserv ng, compared against how uch insights that data provide is lost in that process. What are your thoughts there? Do yo? I think one of the things that the listeners are really be kee to know about is, is it fair to say that there's always a poss ble net gain with the use of t ese technologies? Can we always say that for any workflow or infrastructure or techn cal topology? That we can say es, this is always worth doing ill always find the right plac to put the needle. How would ou? How would you approach hat thought, if that makes sense if it made sense to you,

Patricia Thaine:

Totally makes sense, I really, once again depends on the use cases, the where to place the needle is still a research topic, and will continue to be to for many years on end. And on a case by case basis. What new things I've seen Yes, and actually, that is a that's a fantastic segue for us. come up is, for example, this worked by Calico, a mom who does synthetic data generation of unstructured data through Replica analytics. So they have worked on, you know, once once you figure out where to place the needle, how can you have, ho can you know, what th identification risk is of th data that you're creating? S they've combined concepts fro data, and re identification ris metrics from an anonymizatio methods with synthetic dat generation to have a clear ide of is the data produced actually privacy preserving. S that gives you an even bette idea of where to whether you'r placing the needle in the righ spot. And then in other cases you, you could just make th data privacy preserving, and yo don't lose anything, it reall depends on the use case. An this is particularly visible fo structure, unstructured data. S images, video, text. So if yo want to tell how well th conversation went, you don' need any information about th participants, you just need, yo know, the positive negativ words that are present in th conversation. So you can remov names, you could remove socia security numbers, ages, none o that is giving you any extr information for your task. An for images. I mean, there are lot of tasks that don't eve require people to be in there So that's already anonymous b default. So things lik temperature analysis, or thing like relief analysis. An there's a if you just remove th people from for example. video of streets and you want t compute just you know how man people are on the street, rathe than, you know, extract an information about tha individual. You're alread getting a certain amount o information that's useful. Like how busy is it? Is ther anything funny going on? And yo don't need any personal dat there to figure that out. So I think that a lot of the dialog e that's been happening abo t utility versus privacy has be Before we move there just for a second, I would you agree, I n on structured data, where you o have the set columns, with t e very specific information th t you're trying to get for, li e pharmaceutical trials, and ve y little has focused on what th t means for unstructured dat think that there's an aspect of the needle, again, of the privacy preservation versus the utility, the underlying data, that might be a risk question for the lawyers or the compliance folks, where they might say, well, here's our budget, here's our risk, here's our commercial purpose. You know, there's some there's some end goals, machine learning, or what have you has some commercial value, whether it's, you know, with a set of your statements, there's some question for the the compliance steering wheel. Yeah, um, I think what we need to do as a, as a research community is make that needle more clear. What exactly does it mean, when you have a certain amount of privacy budget, and I do see some work that's headed in that direction. But really, it's about understanding the data that's produced afterwards. And that's something that has does not have very much research. I see. Interesting.

Paul Starrett:

So that's, that is an area of challenge for. That's right. Got it. Interesting. That's very helpful. Great. So I guess we can go then into the the data, which really kind of gets right down into the heart of what your firm does. And that is that you really, the type of data really drives a lot of us. So you have your tabular data, if our audience doesn't know, that's like a spreadsheet columns and rows, basically, they tend to be simple numbers, or simple values, like a name or something will be called categorical, it may have text in it too. But it's generally not a challenge as much. But when you get into like emails, and research documents and other natural freeform text, and you get into audio, and you get into some of these other areas, that's really a whole different bucket of worms, if you will. So, so if you can just tell us that this is really basically a place for you to discuss what your company does, because I would recommend, as you alluded to these flowchart and get the decision tree you have, not only for the app, but just to see how these things are handled at private-ai.ca. So maybe just cover the different types of data and challenges you're finding for each of those, especially its text, because that's an area that I have some background in NLP and stuff.

Patricia Thaine:

Sounds great. Actually, we do focus more on text than we do on images and video. Okay, I can definitely focus more on what we're doing with regards to text. So for text, what we're doing is redacting direct identifiers, quasi identifiers, and also doing automatic pseudonymisation. So redacting things like names, ages, approximate locations, exact locations, credit card numbers, we have a huge list of information that we can redact. And in addition to that, we also have the ability to pseudonymise completely with regards with information from the context. So it's not a table where you get means and then replace them. And that becomes really predictable. And also doesn't work well with the context. But pseudonymisation itself, when

Paul Starrett:

I see I guess a lot of this comes in with just done properly context based, one decreases the re identification risk of an individual, because if you if anything was missed in the concrete, and by the identification system, then it's very difficult to tell what the original data was from the fake data, because of how natural it looks. And number two, it also prevents low lowering machine model accuracy for downstream models. So if you're doing sentiment analysis, named entity recognition or other tasks with the data that's de identified, and it actually de identified th ough pseudonymisation, then your models are better suite for that task if you're using more natural so audience knows pseudonymisation is a term that means basically your pseudo anonymizing it's not completed anonymiation, but it's it's some steps in the middle. But that freeform text and natural language processing and the inclusion of search engines and things like that, is it's so diverse, the freeform text, that there it's I think it's difficult to glean underlying information.

Patricia Thaine:

Yeah, so it really depends on the kind of text that you're dealing with And to your point about pseudonymisation It's there is this misconception that it is less secure than redacting the information rather than replacing it with names, locations, and so on. And part of that comes from the GDPR, saying that pseudonymisation is less secure than anonymization. And if you look more deeply into that the GDPR actually counts pseudonymisation as they're having to be a link between the information that's being used for replacement with the original data. And that link is what counts as making it less secure. So you remove that link, you know, becomes much harder to re identify an individual.

Paul Starrett:

I see. I see. Got it. Thank you for clarifying that. Because I think as part of what you're talking about, knowing earlier, when you said, How do we determine what the needle is actually doing? Before you can have a risk professional say, oh, yes, I blessed this has been compliant. Because, yeah, got it. I think the last few questions would focus then on where do you see the law going? I mean, this this is almost rhetorical. But is, it seems to me as though there's a real move towards being more strict, and you're seeing more of this being spread across different jurisdictions. So we're getting onslaught and you probably a little more familiar with this than I am with the United States, the individual states are now starting to come out with laws. CCPA is now CPRA, which is even more ominous, or onerous, I should say, what's your sense of where the the law is going? And how do you see I would imagine that would bode well, for our various firms? Where do you see that going? What's your sense?

Patricia Thaine:

Well, we are looking at seeing more and more laws that are taking up more of GDPR standard, and then customizing that for their location. And we are seeing more companies that are just taking the GDPR as the standard regulation that they comply with, and then adopting their practices from place to place. And I, so privacy, by design, privacy, by default, are both part of the GDPR, part of the LGDP in Brazil and part of a bunch of regulations that are popping up. And I think that as lawyers, figure out what these different technologies can do, and where the appropriate scenarios are for these different technologies, we might get more and more refined laws. Or it might move more towards the experts have to validate whether or not you're doing the right thing. So it's, it's really up in the air, where it's gonna go. But what we are seeing is that grace periods are finishing for a lot of these laws, the GDPR is becoming is going to be enforced more severely in the coming years. And there's they're setting up legislate regulatory bodies that will be keeping track more closely, of which companies are doing the right thing. So it will be really curious to see, you know, ultimately, how is this going to affect government budgets for privacy regulations? How is this going to affect privacy research? How much more privacy research budget, is there going to be really curious about how it's gonna go?

Paul Starrett:

Yes. And I would agree, because I think part of this is is educating as I've alluded to, the the lawyers and because a lot of lawyers, I am one myself, but I, my my licenses inactive right now, I don't use it. I can go back anytime I want. But I know generally that they chose liberal studies as a direction because they're self professed technophobes. So getting them up to speed in a way that that they can make an informed decision, I think is going to be one of the challenges. And because the technologies we've discussed have been, you know, even at the superficial level, still somewhat of a difficult to grasp, and it's no one's fault. It just, it's the nature of the beast. Let's see. So I think then, I guess the last question, I might put you here is, I think there's a certain cost to do research and development, there's a certain cost to build the infrastructure for testing and for production when for the audience production meaning, what goes actually into the enterprise infrastructure to help them in their real world systems, maintain price, privacy preservation, there's a cost to all that. And that's going to somehow I think, guide, how well we can operate as how well and enterpirse can operate, you know, in whatever he wants to do commercially. So there's a certain barrier to those purposes. Do you think that these technologies will remain cost effective, in general for the purposes that they're, that they're meant to?

Patricia Thaine:

Yeah, that'sa good question, Paul, I there's normally a barrier in cost, but also a barrier in education. So there, there really isn't. There aren't that many privacy courses in technical colleges, or in computer science departments at the moment that is growing in quantity. But there's still quite a bit of an education barrier when it comes to figuring out which privacy technologies to integrate, and how to integrate them properly. So there is that huge bit of onerousness that companies need to deal with where it's very hard to find the talent very hard to figure out when to integrate different privacy technologies. What I've noticed a lot is data protection officers in some companies get empowered to go and talk to the technical teams get into the nitty gritty. But in some cases, they're the people that they're the technical teams go to, to say yes, or no to a particular problem, but then don't work with in order to find a solution to get to a yes, in some cases. So there's that barrier, both in education costs and internal politics, all of which need to be solved in order to get to the next level. What we're trying to do at private AI is make it a bit more cost effective for companies to be able to to do this also more secure than having your engineers with no privacy training, working on privacy problems. And what I see happening is, there is more open source software that's going out there that some some software engineers can take advantage of. In a lot of cases, there are more privacy companies that are around to help you guide you. But there's still this really big gap even when it comes to open source software, or easily integrating privacy into your software pipeline. And that's really where the market needs to move.

Paul Starrett:

Interesting. Interesting. Great, I think, well, that's a good answer. Because I think that's really kind of where people are going, what is privacy preservation where, you know, this is a solution, Frankly, I think is overall very much in that game. I think it's a hockey stick, if you will, kind of floodgate that's, that's going to happen. I think, I always give every person I have a podcast with the opportunity to say anything they want to to the audience that we haven't discussed, that you think might be important for them to know. Hmm, interesting. Let's see. Take your time. We'd love to, we'd love to let you take the time to come up with everything is worthwhile. Okay, let's see. Or to emphasize,

Patricia Thaine:

I I'd like to point out that the work Twilio is doing for privacy is really spectacular. I'm a huge fan of their data data protection officer, Sheila Jambekar. And she goes, so we did a web webinar with her a couple couple weeks ago. At private AI that's on YouTube. You can take a look at that to see how they're dealing with privacy internally within the organization. And I think that they're quite a leap away from most companies as to how their privacy team interacts with their engineers and it's definitely worth taking a look at.

Paul Starrett:

Interesting. That's Twilio. Twilio? That's right.

Patricia Thaine:

her name is Sheila, Jambekar.

Paul Starrett:

Do you mind spelling that real quick? Yes.

Patricia Thaine:

I just want to make sure that I'm getting Sheila Jambekar and Twilio.

Paul Starrett:

Got it. That's really good, because that really gives people a good sort of bellwether, if you will. I'm an organizer of the SF Python Meetup group here in San Francisco. And one day we had Twilio came in, and we did some Python programming in their, in their environment. And I was impressed. Yes, very good crew, very into education. I noticed two very big on that. Listen, I'm going to close out here with a little bit of a discussion of PrivacyLabs and then I we can let you go and finish up. What we do is we help to unify all these different moving parts. It's you know, the synthetic data, the homomorphic encryption, the cybersecurity, the cloud, on premise, all of those things. One of things we do we also work to to perform audits of artificial intelligence and workflow automation and also with just the holistic aspect of what's happening. So I could definitely see from what you said today how this, if you pull on one thing, it tugs at many different things, and you really have to consider those things. So I think that I will end in getting in touch with you, would it make sense for them to go to website and then contact you that way? Or did you want to? Do you have an email address you'd like?

Patricia Thaine:

Feel free to email me. My email addresses patricia@private-ai.ca.

Paul Starrett:

Got it. And my mother's name is Patricia's great. Yeah, smart. Yeah. Okay, and I'm Paul Starrett of PrivacyLabs. You're already on our website if you're listening to this. So um, thank you again, Patricia. I really appreciate it and very, very useful and helpful and hopefully we'll have you on another one. Soon here.

Patricia Thaine:

Would love that. Thank you so much, Paul. Really, Thank you.