Data Democratization Podcast: Stories about AI, Data, and Privacy

52. Synthetic Data as a Strategic Enterprise Data Asset with AWS' Faris Haddad

Alexandra Ebert (MOSTLY AI) Season 5 Episode 52

Welcome back to season 5 of the Data Democratization Podcast - and our first-ever live studio recording.

In this episode, Alexandra Ebert sits down with Faris Haddad, AWS' Global AI Technical Strategy Lead. Together, they delve into the transformative role of synthetic data in modern enterprises.  

Faris shares his journey into the world of synthetic data, highlighting its evolution from a niche solution to a cornerstone of enterprise data strategy. He discusses the challenges organizations face with data silos, legacy systems, and privacy concerns, and how synthetic data offers a pathway to overcome these hurdles.  

Whether you're grappling with data accessibility issues or exploring innovative ways to leverage your organization's data assets, this episode offers valuable perspectives on integrating synthetic data into your enterprise data strategy. 

[00:00:00] Alexandra Ebert: Hello and welcome to a very special episode of the data democratization Podcast. I'm Alexandra Ebert, your host, and MOSTLY AI is the Chief AI and Data Democratization Officer. And today's episode is the very first episode that we are actually recording live in the studio. So today's guest is no one other than Faris Haddad, who is the global AI technical strategy lead for AWS Faris. It's such a pleasure to have you here in London today. Welcome to the show.

[00:00:25] Faris Haddad: Thank you. Really appreciate it, and what a great inaugural talk we're having together. Live

[00:00:30] Alexandra: absolutely before we dive into today's conversation about synthetic data and its emerging role with a new standard data asset within any modern enterprises data strategy, can you briefly let our listeners know what is the job of a global technical AI strategy and what makes you passionate about your daily work?

[00:00:47] Faris: Yeah, it's one of these roles that you kind of evolved into over years. So it's, as you can tell, with my wife, I've been doing this for a few years, but it's like the intersection of how an organization can transform itself with data, of course, AI, and the strategic approach to doing so. And it's not just a technology discussion very often. It's the way you help your employees, for example, to be able to understand and use and optimize the way they use AI and the way the organization has to change alongside them, because very often, as soon as you say the word transformation, it does mean you're breaking the model of how you work now, and you're reinventing something which is quite painful for some organizations, because they're very rigid in the way They've been constructed, let's say, over the last 50 years, sometimes.


So y role is to go into organizations, and first, it does often become mainly a technology discussion, but the precursor to that is to understand the business pain points and what really that they're trying to accomplish going forward. So what's the aspirations? What they're trying to, for example, gain more insights about particular operations they do, or some part of the business, and I help them understand how they could do that. So it's a very interesting job, and that's why I'm still in it and loving every day. 


[00:02:22] Alexandra: I can imagine no day will be like the other. No having this conversation not only about which technology is out there, but how to actually put it into place to make it work for your objectives, I think, is a crucial success factor, because there's so many organizations who are just following the buzzwords, following the hype, investing in some technology, but not connecting it to the business cases and the use cases. And therefore, I'm so curious about the insights that you're going to share with us. As mentioned, we want to talk about, specifically data democratization, synthetic data, and why there is a need for this new standard, privacy, safe data asset within a modern enterprises data strategy. You now have been in the space of synthetic data for several years, and I'm curious, because we talked about this off record quite a lot. How did the role of synthetic data evolve when you think back three years, four years, versus what you're observing within enterprises or public sector organizations today?

[00:03:12] Faris: So the way I became interested in synthetic data, it really derived from the pain points and the friction points I could see with organizations having access to the main data they had internally, and that really was there's two reasons for that. One was either the they were stuck in silos legacy applications, which were very hard to interact with or pull stuff out of or they were, they didn't have the mechanisms to share data and to, I use the word liberating the data from from either these silos or to be able to share it securely amongst a wider audience in the organization, you know, because, for example, it could be either regulated or contains very sensitive data, that it could be an IP in there that they want to reveal, or anything like that.


And it became really louder and louder this kind of noise that I was coming across in an organization where so many projects were impacted by the inability for an organization to freely use its own data. So it would, it would stop, for example, a new software being developed, or some new AI projects being taken on board to obviously change the organization for the better so they can do it because they don't have enough data, or even if they're working with external partners, like I work with Accenture a lot, where they want one of these GSIs to come and help them transform, as I mentioned earlier, but the number one friction point was always Ah, but we can't give you any of our real world data to able to even build a POC for us.


So you can see this chicken egg situation was always coming up due to work, yeah. So, so that's, I think, the earliest stage when I really became interested in you. A better way to liberate this data, because I know, and we've discussed this before, Data is the lifeblood of an organization, you know, and if you don't use it, well, it's your main differentiator in the market. And if you don't use it, well, it's one of the biggest assets that's going to waste, and that's the heart of it. And

[00:05:18] Alexandra: So you mentioned those ones that back in the days, it was more a specialist solution for case by case problems, maybe used by some data scientists. When you talk with executives around the world now from all the different industries, financial services, telco, how are they seeing synthetic data now in this bigger picture of their data strategy?

[00:05:36] Faris: Yeah, it's been fascinating. I think how evolved these conversations have become because earlier on, what you and you mentioned, as soon as you talked about synthetic data, that the first assumption that the customers would have, or anybody you talk to, they would, they would call it fake data. And this has been around for years, where he's creating generic data, which, which, you know, mock data. Kind of mock data simulates something closely. The problem with that was it, it's never good enough, because it when you just do very generic data, although it might have the same columns, let's imagine it just as a table for now, a spreadsheet table.


There's so much nuance in your data which is lost if you just create generic version of it. So a lot of a lot of we call it statistical correlations. So the way the data points have a particular pattern in the organization, and that's where, that's where all the real magic is in a company's data. So that was really the early days when, when we have these discussions, and then we were there they, as the request became more sophisticated, what they're trying to do with it, and them understanding that the techniques to create synthetic data has so evolved and so advanced because these, these generative AI type of models, predate large language models, so people suddenly think it's a new thing. But you know, this has been going for much longer than language models we even thought of, and because it's so efficient at doing creating synthetic data twins, as I call it, because it is, when you generate a synthetic set, it matches very closely the original data. 


The only difference is it's previously preserving so So you haven't got any PI data in there, or anything else that you might deem as sensitive. And it's not always obvious things. It's not always just our credit card numbers or addresses and things. There's a lot of sometimes quite nuanced data points in there which are which could cause harm if you combine it with some other data sets. So you've got to be really clear about that. So and as soon as you start demonstrating to organizations and customers how well you can recreate this original data of theirs. It changes the dynamic in conversation, and that's what I've been seeing. That shift over the last few years, especially the last year, it's been quite dramatic, absolutely,

[00:08:06] Alexandra: because you just mentioned large language models, I think also, for the sake of our debate, we should define the type of synthetic data we talk about, obviously, we collaborate a lot, so we know which type of synthetic data, but we're mainly focusing here on privacy preserving AI generated data in the structured domains, or tabular data, time series data. Think healthcare records, financial transactions, yes, of course, there are also many other types of synthetic data, from privacy preserving unstructured data to this mock data that far is alluded to, to even the type of synthetic data that everyone knows nowadays, when we think of synthetic images that you can get by using chatgpt and the like, yes, but the important point, I think, from an enterprise perspective, and you really think about how you can make use of your data assets in a strategic manner, is, How can you get this synthetic data twins, these replicas that are statistically so highly representative without infringing privacy. And this, of course, requires that there are data assets already in existence.


So we are not alluding on the synthetic data technologies, which are more in the bucket of creating data out of thin air you need. And most large enterprises and public sector organizations have mountains of data. It's just as far as said, they're locked in silos. They're not available to the larger organization.

[00:09:23] Faris: Yeah, and I think that's a really important distinction, because you've got this idea that. So let's imagine a company is trying to, say, test a new software system they've just developed Now, normally, they would not give their developers the real production data go and test system, because normally, there tends to be sensitive data in there. So they would create, traditionally, they would just make some random mock data. They just put it through the system. So yeah, everything is fine.


And what tends to happen very frequently will happen, and that's why people have changed the way they approach. This is when you put the real data through, let's say it's passed all the tests you did originally, but the real data has data points which you did not have in your in your mock data that you created. And these outliers, as we call them, are the ones that might trip over the system. Because you did not develop any way to handle those things and testing. So that's one scenario. Then you've got another scenario where you've got, let's say you want to train an AI model, and you don't have enough data for that. So this is very often what really kills off so many of these potential AI projects the organization might be very, very interested in exploring the space, and particularly so recently, with all the hypergenitive AI and agentic stuff and everything else, which we'll come back to in a bit. But the AI models need enough data, represented data, to be able to pick up patterns, patterns in the data. And for example, let's say I work a lot of times with financial institutions, and let's say they're doing a fraud detection case. Now, by their nature, they are very anomalous

[00:11:04] Alexandra: transactions. So because otherwise that would be quite complicated.

[00:11:09] Faris: So they're so rare that it's very hard to train a model well for it to detect these. Yeah. So another use case, that which is really powerful with the data is not just to create a mirror image of the original data, but to be able to actually boost a particular subset of that data to balance your data exactly. And we call that data augmentation use case. So you can add, you can, for example, create 100x of the of the anomalous transactions, and then you can add that to the real data. So it's not always just trying to create a fully synthetic data. This is a semi synthetic version, and you can do things

[00:11:50] Alexandra: like that. And I think that's super interesting from two perspectives. First, because data is lacking in the real data, which, of course, is something that can be helpful, but there are some limitation in terms of when we think of medical applications, I can't go to a synthetic data generator and say, Please, now generate me healthcare study data for the female body, which was not researched the past 100 years or so yet. One thing that is super helpful when we think about responsible AI applications, trustworthy AI applications, that, of course, enterprises not only need to care about the predictive performance of their fraud detectors, they also need to care about things like explainability transparency.


Now, one of the benefits that our customers have seen when they use synthetic data to boost the fraud cases is that it made it possible to actually use less powerful downstream models, which from the model architecture, are way easier to explain than the most sophisticated deep learning models, for example, which then aids their efforts in having explainability and also from an AI governance perspective, and better set up to be future proof and compliant with innovation. Yeah.

[00:12:41] Faris: And I think the important thing there is, if you're actually wanting to be very open about how you came to a particular decision, or how your model decided on one outcome versus the other, before you would have to use very, you know, complex techniques to kind of work backwards. Well, how was that neural network come up with that particular sort of permutation that you're testing against? But now you can actually not just look at this simpler model and much easier to actually see the insides of it, but also the data set itself that it was trained on, because it's synthetic. You can share that with anybody who's interested in really delving into it, and they can recreate that thing and or use it to balance the data better, because you're suddenly, yeah, let's do something, or even

[00:13:26] Alexandra: spot problems that are. There we go. Exactly. I think UK is also quite leading in the concept of having a responsible AI assurance ecosystem, having people auditing your models. Of course, as a large organization, you want to have capital capacity and expertise in house, but you also want to have the capacity to have your models audit mitigate risks. That way, you can't have a human auditor looking at your model and just putting a check mark below, saying this is not going to discriminate. They need granular examples. And if you can't give them production data and real customer data, of course, due to privacy, synthetic data is the only option which actually reminds me one of our joint conversations we're leading with public sector organizations is also how to actually use synthetic data to create hyper diverse data sets that then by the regulators, by the public entities, could be distributed to private sector organizations to give them, actually this training material and testing material prior to putting the models into production, figuring out if there are any biases that need mitigation, any discrimination against minority groups. And I think that's another fantastic application area for synthetic data.

[00:14:32] Faris: Yeah, it is. And I've seen this really, in quite a few of the regulated verticals I work in. So we talked about FSI before, but in healthcare, life science, I've seen, for example, research organizations. They work with rare diseases in the oncology space, and they have research centers, one in China, one in America, one in Switzerland, and they're all running separate trials. Us across the world. Now, one of the problems with with medical any data, genomics data, or just general your medical records, is the it is such a complex terrain to navigate.

[00:15:20] Faris: Cross geography, the data sharing to privacy folks,
it's really hard. And this is the internal, you know, internal company sharing. But that causes two problems. First of all is you, you're not able to join all these data sets and these and these kind of trials you do together to accelerate the development thing here. So you're all again, working in a forced silo, let's say. And then the other problem is, if you're looking at the wider ecosystem now, this the opposite of what I'm going to mention, that happened during the COVID thing, where suddenly countries and organizations were sharing huge amount of data sets about, you know, the COVID variants and all this kind of stuff.


And that really accelerated development so much with the with the with the vaccines and everything else. That's the kind of Nirvana that lot of the life science organizations I talk to want to recreate all the time, so not just sharing data internally, but also with their research partners, with other global organizations that they collaborate with, just to speed up the whole process. And then there is this other idea of trying to, we talked about balancing your data. 


So what one of these organizations, they knew that they did not have enough representation, let's say, from the Hispanic community, for for the for the for the trials. So immediately they they knew that that it's not going to be a good idea to try and go to market with anything this has not been tested thoroughly, so they were using synthetic data sets to kind of boost the way the model is trained, to detect things and also to approach other groups who would be underrepresented also. So that's another one of those hybrid approaches between just augmenting as opposed to completely recreating it. Yeah, absolutely.

[00:17:00] Alexandra: And then I think it's also quite interesting. We talked about financial services, insurance, healthcare, all these regulated industries, when we think about the potential that they have when they go to the cloud, and all the AI services that are available there, yet the restrictions that they have with their data not being allowed, and all the internal policies. So these applications of using synthetic data to migrate to the cloud, I think also something that is gaining a lot of traction, and where we maybe also come back to the point that you made earlier about why it actually specialized models, which also are way more compute efficient and capable to running locally to help with this data liberation efforts.

[00:17:39] Faris: Yeah, you bring up really important point. Obviously, people might think I'm invested in making sure that everyone moves their data into AWS, but it's, it's I worked at Microsoft before, so, you know, so you're fine with Microsoft and AWS. So I think, I think the bigger question is, what are they missing out on by having these, some of this data, locked up on prem and not part of their wider analytics platforms. 


So many organizations have some of their data sets in the cloud, and they do some great AI, advanced work and analytics and everything on there, which is super but frequently the most interesting data sets are locked up, like you said, on prem because they contain highly sensitive information or they're highly regulated. So the approaches that we've used and successfully with so many different organizations was this idea of say, Okay, we'll leave that data where it is. We're fine with that. Let's generate a synthetic version of that, which that will go into the cloud and sit amongst all the other data sets you've got now. And the beauty of these, this type of synthetic data we keep mentioning, is that how accurately it mirrors the original data. So the way we describe it is you can ask questions of the synthetic data and you get the same answers if you ask the real data without loss of any of those insights. So that's a nice kind of data democratization story, let's say away from on prem.

[00:19:06] Alexandra: I think on the one hand, that you can ask these questions, but then also, who can ask these questions? There is this big gap between organizations who are already way more mature in the data democratization efforts, understanding that this is strategic data asset that not should only belong to some privileged High Priests of data science, but actually anybody also in the business side of things could benefit from whereas the stove, who are still in the days of okay, a team product development AI, whatever, wants to have access to some data, and then they have to wait five weeks, five months, absolutely. And I think that's just not the way of working. When we think about the potential that data can bring to an organization,

[00:19:46] Faris: folks? Yeah, I totally agree, and it's a great story that's really mirrors a lot of the experiences I've had with some of the customers, where the subject matter experts would presume things. Because, you know, I've been doing this for 10 years, I know what's happening. You know, those kind of lines, and very often they might have known the area or the market from a few years ago, but the data subtly changes, and you're not monitoring data at that level. Ever. People who are just using it would just look at the argument outcomes from that data. So this idea of an ever evolving and changing it's like a living organism. Lot of this, this kind of data.


So you've got to have a way to capture this kind of subtle shifts without exposing, of course, any private data. But then there's the other aspect. Is okay, so you're saying I've got this really interesting data. And of course. Within AWS, we're very much a data driven organization. We everything we do has to be underpinned by very strong data evidence. And we talk to our customers a lot about this, about how do you make data the foundation, really, of everything you do? But often data, you talk to companies, and they look at it as a cost center, because, oh, I've got to pay so much money to store it, to move it, to kind of transform the usual things. And when they when they start understanding, especially with these emerging use cases, how much difference that access to that data gives them, and what incredible transformations that can lead to something they're very interested in freely making it available to the organization. So we started to, kind of help companies to think of these synthetic data sets as a new type of assets, because before you'd got your real data everywhere, and you're trying to move it around, as I said, or use it with all the restrictions or you might have, for example, a one off project where somebody might go and synthesize data in, or their data science comes in, or that does a project, and then that's it.


It's gone, as opposed to it being the new de facto standard. Yeah. So the idea of saying, could we imagine a world where a company has two types of data assets that coexist. One is your real data, and this will still have the same barriers and constraints that you should have on them, but you also have the ability to create multiple synthetic data versions of that for different downstream use cases. We mentioned a couple of the models augmentation. One was testing, etc, etc. Yeah, there's many, there's many other scenarios. So you could say, you know, I could create different versions of the of this data depending on the use case. And I don't have to pre prepare it all. I can even set it up in a way that I can invoke the so I can train the model on the real data so it's ready to go. And of course, you can share that model freely, and you can share it freely without exposing any of the data restraint,

[00:27:04] Alexandra: because the synthetic data generation models already have two privacy mechanisms baked in. And regardless of whether you then replicate what you learned from on the original data or rebalance it, this will be anonymous data that doesn't pose any privacy risks.

[00:27:17] Faris: So this becomes like a new, incredible sort of source of never ending data. So so you can generate as much as you want to on demand, so you can have this available there. And as people come and request something, they can say, okay, you can have the real data, but I can give you synthetic data. How much do you want everything? So obviously you don't have that actual conversation. That this is automated, and we kind of bake it into the actual platform.


So we have data catalogs and governance platforms like data zone, for example, where you can have the real data there in the catalog, and then, depending on your role and what you're allowed to see, you can either get to see the real data or it'll generate on the fly, or already has a synthetic data version, which you can use, but you also have to. It's not just about generating the data. You also have to have the metadata to go with this. You understand where the data coming, where it came from. How recently was that trained on this data? Because they sometimes change it dynamically. And this becomes a very different approach to liberate, or which keep talking about this kind of really important asset you have, but it also opens up other opportunities of data monetization, for example, because normally, if you go and talk somebody about selling really sensitive data, it's like, it'll never happen that conversation, but it's really valuable sometimes to be able to, let's say you've got, like in some of the regular industries, we talked about some data set which would be so great to share telco we both

[00:28:49] Alexandra: Yeah. So when we think about these data sets, and that most organizations are super interested in the mobility data that, of course, telcos own, but with synthetic data, it then suddenly becomes possible to enrich those location movement patterns with the demographic information, so that you not only see aggregated movement, but actually can drill into how a college student's moving, how 40 year old top managers moving, etc, etc. So many monetization, monetization opportunities arise. And then

[00:29:19] Faris:Tthere was, there was another whole sector. So let's imagine we talked about fraud detection so and we talked about this complexity of sharing data. So let's say you've got a banking group or a whole range of banks globally who want to collaborate on training all their models of fraud detection on a combined data set exactly contains all these different variations at each bank,

[00:29:44] Alexandra: Or even including the network providers, which have a different lens on fraud patterns and movements. There's so many exciting things happening. Or what we've also been recently discussing are these clean rooms state emerging use cases, where we say, okay, one bank A. Wants to collaborate with insurance, be merge the data sets in a privacy preserving manner to get much more holistic view of the customer's lives. Because for a bank, obviously their view on the customer starts with them entering the organization. But we as human beings, we natively transition between x, y, z service. And I think this will also open up a lot of opportunities in this platform economy and really having platforms that allow users to go from A to B in their user journey without being artificially stopped by, okay, this is now, this is now a banking service. This is an insurance service. This is a Uber mobility service and stuff

[00:30:36] Faris: Like, yeah. It does really change the whole terrain about, you know how people collaborate with data, and this idea of having a clean room, it's a very specific kind of environment where it's neither party can see what the other what other data is. So it's it's original data. Isn't that the original data? So you're both putting your original data in there. So the algorithms would automatically join them up, so it understands where the correlations are. But then the point is, normally, with clean rooms and ad briscot, great offering in the space, you stop people doing queries at a particular level. So you must say, you're not going to go less than 1000 results coming back, because you might reveal some private information, but we're suddenly having this ability, which we just embedded MOSTLY AI’s model into clean room. It gives you a whole new approach. Okay, let's put this data together like we did before, but now neither of us is going to ask questions of that real data. We're going to create a synthetic data from the combined, amalgamated two sets, and then both of us can freely use it. Yeah, it's absolutely transformative that kind of approach. Now

[00:31:46] Alexandra: Absolutely, absolutely, I want to dig deep a little bit in what you talked about a while ago, because this majority of thinking that you and AWS has on bringing synthetic data and data democratization at scale to organization is something I have not seen with any other organization. So if you would make it more actionable for our listeners, executive thinking about integrating synthetic data into the data strategies, what's the difference between just using a synthetic data tool on an ad hoc basis versus implementing it in a manner that serves the enterprise needs? You mentioned access control. You mentioned the meter data and the data cataloging. So if you would paint out that picture again, or maybe like, step by step approach, what should our listeners think about when they want to transition to this data democratization? Application style? Yeah,

[00:32:31] Faris: I think it depends on the organization. But let's just look at some scenarios here, so you've got an understanding of what are your important data assets that you're looking to to work with. So let's start with there, and they'll be scattered everywhere. There'll be some on prem. There'll be some data sets mainframe stuck in a mainframe, or an SAP ERP system, something like that. And you could be global organization. You've got some data centers in different parts of the world. And you've got, of course, some things in the cloud. There could be multi cloud, you know, some some states are so complicated. So the first question is, if I'm looking at just creating a an internal data marketplace, fine, so that's one approach, fine. We're going to be able to create synthetic versions of all these data sets we have, we can either put the real ones in there plus the synthetic, or just a synthetic, and then we can give it free for all in this catalog, people can access it ask questions. Great.


So that's a kind of simpler way of thinking about it. The more interesting ones is what we touched upon earlier, which is to you have data available now with all these applications. So we have this kind of initiative called this data migration factory now, which is this idea of liberating data from numerous sources. So they could be from from legacy applications that you they're not ready to move, or it's going to take you five years, like a mainframe could take 510, years, sometimes something if you're gonna move the application. So the problem with that is, I need the insights now. I can't wait five years, or if, every time I ask run a query on the mainframe, it cost me huge amounts, because, you know, the MIPS costs are so high.


So how can I really solve for that scenario? So what we can help them do is there's multiple ways of this, but you can actually take a near real time image of not the whole data, but just the ones that matter to you. Exactly. It could be 10% of the whole thing that's used to run the mainframe, and we can put that in the cloud now and then, you can start using that so that's the real data. Or you can actually do that, but then synthesize it and have that available too. So it's not just about taking a whole set and putting it in. You could be very strategic in the way you think of what I'm going to move there and why, and the best way to do that is to work backwards from the insights you're trying to derive.


Say, Okay, what are the questions? My business is desperate to understand and to use. And then you go back to, okay, where will, where is all this data sitting if I had a free for all? And then the final question is, okay, is, which of this data is sensitive and I cannot make it available to everybody, and could I synthesize that? For example? So there's lots of paths through this absolutely and to make it part of your normal day to day operations, absolutely.

[00:35:23] Alexandra: And I think also this discussion that we had about having this data cataloging, and also data discovery features where all the different lines of business and individuals can come and see which assets are available, either proactively request a synthesized version of the data set, or if it's something where leadership has decided, hey, this is a data set of strategic value.


Already have set up the automated processes of automatically refreshing that data, maybe providing the data assets in a refreshed format on a daily, weekly, whatever basis, or just the generators of synthetic data to then give the data consumers the ability to say, I need a replica of this data set for analytical purposes, or I need a rebalanced version for my use case. I need just conditional generation college students because I'm building an application that's only focused on that customer group. So that's really, really exciting to see where we are currently heading. And what will that enable from not only like cost optimization, but really unleashing innovation perspective, when we think of data as an asset.

[00:36:23] Faris: And obviously you're getting there's a whole different kind of problem here. So most organizations would collaborate with external partners and GSIs like, you know, I mentioned Accenture, and there's many others, of course, to build them applications or to develop some new, transformative kind of or taking a mainframe, let's say, and breaking it down into microservices and things like that, which takes years. But the classic problem with that is, you want them to be able to build this for you, but you can't give them their production data to test it on.


And we talked about this earlier, so that's another way to get ready, because if you're if you're trying to test all this new gen AI stuff that's everywhere, and you want your line of business, people in the organization to start playing around with this and experimenting and to get value from this, you would feel very uncomfortable considering that, do I trust that these models are not going to release anything or data that people aren't allowed to see. So you're going to have to think about a new construct of access controls. Or you can say, well, you know what? Actually, I'm going to think laterally here, and I'm going to do synthetic versions of all this data.


So even though they can ask questions now, you might want to restrict it more about the grade of somebody seeing a particular depth of data, you know, like we talked about, you know, somebody's wages, or something like that. You know, their salary. But it's more about the how do you deconstruct your current way of deriving insights from your data? And think of it in a whole new paradigm, which has been driven by, you know, gen AI applications. Genetic workloads, like we mentioned earlier, that gives you really an incentive to rebuild your data foundations which have to underpin that.

[00:38:13] Alexandra: Do I understand you correctly that you say, setting up synthetic data democratization in the way that we just discussed with data discovery, access control, metadata and everything, also prepares you for this age of agentic AI, where we, of course, also need to consider access controls, all these different rules that are currently quite complicated baked into my existing applications, and move that in a shape and form that also AI agents can effectively and compliantly navigate

[00:38:39] Faris: exactly so. So you're actually now winning on both fronts here. First of all, your data can be used in a much more effective way with your traditional analytics workloads that you do now, but they're also not just ready, but actually optimized for this new emerging workloads that we're talking about, these general applications and things. And then beyond that is when you start looking at the wider ecosystem, you say, well, now we can actually start collaborating with a whole supply chain of organizations that work with us. So the supply chain isn't just about the products, it's about all the data that mine and my partner's supply chains are generating. Can we now use that synthetic data paradigm to give us whole level of traceability and understanding of every single aspect of my supply chain, and I can study it together as a group all the way downstream with everybody in this supply chain. And there's multiple variations of all these conversations about what we need to have a follow up conversation?

[00:39:37] Alexandra: Absolutely, it'll be a series, because we need to come to an end now that you explained how our listeners need to think about it from a strategic perspective. How could they get started with it? Or what is AWS current development on synthetic data and making that available at this enterprise data democratization scale?

[00:39:54] Faris: Yeah, I think it's been quite interesting the last few months. And I think what prompted a. Rapid change in our approach is the collaboration we've been doing with MOSTLY AI, I think maybe some of the listeners already know, but your open source, your industry sort of leading model, and that kind of changed the whole dynamic of how people can get access to this kind of power. And of course, but you need to run that somewhere so, so, and there's so many different personas, so we made it available out directly from sagemaker. If you're, if you, if you're dead scientist or even a citizen scientist, can even do it. We've also baked into the special product, like I mentioned, and clean rooms. But they've even got, you've got great drag and drop UI that people can use who know nothing about you know, any of these techniques,

[00:40:41] Alexandra: But now I think it's also available on the marketplace.

[00:40:45] Faris: Yeah, it's just been released on our internal marketplace. And the beauty of that thing is, because the first question that people might ask is, what Hang on, if that's a SAS product that I can go I'm going to internet to run, I'm not going to put my data up there. But one of the incredible things that you guys have done with this is that you've, you can run it and deploy directly into your own VPC, which is your virtual private network on in your own account.


So you're not actually going into putting your data anywhere except excite inside your own space. And you've got full control and visibility on that. But that gives you something a whole nother layer of access for the whole organization to be able to go in, not just to create data sets, but actually to understand and there's all these incredible ways of metrics and graphs and things to showcase how private something is and how close it matches the original.


So you're not guessing. You don't need to have a data scientist sat next to you to walk you through it, very visual interfaces, so now it's accessible today, and that and many of our organizations and all the sectors we discussed are using it currently. That's happened over the last few months? Yeah,

[00:41:50] Alexandra: definitely exciting for all these conversations that are ongoing. And maybe if we again ask you to highlight the three top recommendations that you would give with our listeners to say we want to get started on synthetic data. We want to think about it in the right way to really get these strategic data utilization benefits. Three most important takeaways,

[00:42:09] Faris: I think one is to take a really deep look at your current data terrain. Now this is not from the lens of optimizing it for cost or availability, but actually to understand what are my new questions I want to be asking of my data, and is that data available freely now or not? So then you're trying to get a sense of what is where and what is restricted and what isn't. And then once you once you've got that visibility, then say, okay, if I'm definitely going to be using Gen AI stuff, because everybody in the world has to, it's not like an interesting thing. This is really imperative. Would I want to release that on my real data, or do I want to have a different kind of data paradigm? And then finally, how can I get that to become our new type of asset in the organization? How can I make it so available and accessible that it changes the way we talk to our data absolutely

[00:43:12] Alexandra: and I think this is where it also nicely feeds into this. Ai democratization, effortlessly. AI assistance to help those folks who are more citizen data scientists, such as business users, to actually get an understanding and a perspective for what becomes possible once they have access to data and once AI tooling helps them to actually not only use it for analytical purposes, but also in some of the initiatives that you alluded to earlier in off The record conversations on building their own AI applications, prototyping on a highly abstracted natural language interface, anybody my grandpa can do with style of difficulty. And I think that's just a match made in heaven when we think about data access paired with data insights and the possibility to use data for impact, for prototyping, for innovation. So that's an area I'm particularly excited to see. I want to conclude with that question to you, Outlook, future. Outlook, wish list. What is something that you can't wait to see happening at larger scale, happening at all?

[00:44:10] Faris: I think the pace of change now is breathtaking. I think that the thing which would be fantastic is if I can give organizations ultimate confidence that their data will be safe to use in any form with all these emerging technologies, that would be the best thing I could do. You know, once you they know that by using these techniques I am I can step back from that constant fear of being exposed or cause a problem, that confidence really changes the way that people start thinking about what I could do in the future.

[00:44:46] Alexandra: Agreed and Well, with that said, I can't keep you here any longer, back to this important mission to read the stage, but Faris, thank you so much absolutely for this very first life. I loved it. Thank you for inviting I'm already very much looking forward to having. A repeat conversation, maybe at the one year anniversary of live recordings. Thank you so much. Thank

[00:45:04]Faris: Sounds amazing. Bye, bye.