Your Agent Doesn't Know What It Doesn't Know—with Heather Lutz (Datasite) Artwork

Futureproof by Xano

Futureproof by Xano is a podcast for technical builders, entrepreneurs, and engineering leaders who want to stay ahead of what’s next.

Hosted by Xano’s CEO & Co-Founder Prakash Chandran, each episode features conversations with innovators and industry experts who are shaping the future of technology, business, and product development.

All Episodes

Futureproof by Xano

Your Agent Doesn't Know What It Doesn't Know—with Heather Lutz (Datasite)

June 11, 2026 • Prakash Chandran, CEO & Co-Founder of Xano

0:00 | 46:50

If you plug an AI agent into your data, how do you know it's giving you the right answer—and not just a confident one?

In this episode of Futureproof, Prakash Chandran sits down with Heather Lutz, Director of Engineering at Datasite, the provider of AI-powered solutions that enable private market investment, including virtual data rooms for mergers and acquisitions. Together, they unpack what happens when you point an agent at a massive data set without doing the foundational work first, why data readiness and governance are non-negotiable prerequisites for any AI initiative, and how Datasite is layering semantic views, verified queries, skills, and a "data doorman" to make agents actually useful.

About Datasite

Datasite provides the infrastructure that enables information flow for private market transactions, with purpose-built tools to optimize outcomes. Datasite’s innovative product portfolio, spanning sell-side virtual data rooms, buy-side intelligence, agentic AI applications, and an open data infrastructure layer, drives execution across the full investment lifecycle while generating unique data insights to empower investors, advisors, and deal professionals worldwide. Trusted by top private equity firms, investment banks, and consultancies, Datasite is built on 26 years of enterprise-grade security, compliance, and reliability. For more information, visit www.datasite.com

Topics covered include:

Agents are confident interns, not seasoned analysts: Why an AI agent querying your data won't know about data quality issues, duplicate revenue tables, or missing filters—and why confidence without context is worse than no answer at all.
Data readiness as CI/CD: Why testing data should follow the same discipline as testing software—with checks at every stage of the pipeline—and why continuous data quality monitoring barely exists as a standard practice yet.
Data governance makes agents work: How domain ownership, shared metric definitions, and semantic layers turn an unnavigable ocean of tables into a surface an agent can actually be expert on.
Don't work with the ocean: Why starting with your top ten metrics, your most important structures, and a bounded consumable layer is the only practical path to making AI-over-data work at scale.

Episode ID: 19329740-your-agent-doesn-t-know-what-it-doesn-t-know-with-heather-lutz-datasite

Subscribe to Futureproof wherever you get your podcasts.

From Xano - The fastest way to create a production-ready backend for any app or agent. Xano unifies AI speed, code control, and visual clarity, so you never trade reliability for velocity. Sign up for free today.

SPEAKER_01 0:00

They're going to ask the analyst, I need this data. I need the current metrics on this given KPI. And if there's something wrong with that data, the analyst would work around that or massage it, maybe let you know. But let's say you just put an MCP layer over top of your data. They get an answer back. And the agent, it sounds like it could be the right answer. That agent might not have any idea that there is a data quality issue with that table.

SPEAKER_00 0:44

I'm Prakash Chundran, CEO of Zano. Today I'm joined by Heather Lutz, Director of Engineering at DataSight, a SaaS platform for the full MA lifecycle that processes some of the most sensitive financial and legal documents in the world. Heather's background is unlike most engineering leaders you'll hear. She started as an English teacher in Japan, went through a coding boot camp, got hired as a junior engineer, and over the past decade built out the core data infrastructure that backs Data Sites platform today. She introduced Kafka, Airflow, and real-time data flows to the organization, started the analytics engineering and product analytics team, and now leads data infrastructure and engineering across the US and Costa Rica. There is a line Heather said that I think every builder needs to hear. AI is going to accelerate you in one of two directions. It's either going to give you intelligence at a speed you've never had, or it's going to give you wrong answers faster without you recognizing how wrong they actually are. The difference, of course, is what's underneath the quality, governance, and structure of your data. This is a conversation about the found uh about the foundation that most teams are skipping in the rush to plug in AI. Heather, so good to have you here today. Thank you for joining.

SPEAKER_01 2:00

Yeah, thank you for the invite. This is great.

SPEAKER_00 2:03

Yeah, awesome. So I wanted to start with something I mentioned up uh at the top. You know, you went from teaching English in Sapporo, Japan to building core data infrastructure for a company that handles some of the most sensitive data in the world. Can you just at a high level walk us through that journey?

SPEAKER_01 2:19

Yeah. So I did, I graduated with an English degree in about 2008. And if you were around then, you know that that was a really rough time to be graduating with an English degree, pretty much any degree. Um, so I took the opportunity at the time, I had taken four years of Japanese by that point, and ended up in Japan, uh, teaching English. I absolutely loved it and thought I wanted to go into teaching and be in education forever. I still enjoy education, um, but I did start to burn out from that position relatively quickly. I was in it for a while, moved into higher ed administration. I actually have my master's in higher ed. Um, and thankfully, during that time as an administrator, I had the privilege of advising software students, folks pursuing a degree in software engineering. It was a small cohort of students. I got to know them quite well. And I also realized that I thought what they were doing was far more interesting than what I was doing. So I decided to quit my job, go to a boot camp for a little less than a half a year, try my luck, and very fortunately got picked up by the organization that became Datasite as a junior engineer. And moved through a bit, spent some time developing the MongoDB infrastructure for Datasight back when we were still a greenfield. Thankfully, my boot camp program had trained me in the mean stack. I was one of the only engineers who had even touched MongoDB on the team at that time. So I got to work with the data. I got to set up that infrastructure from scratch. Um, everything from setting up the clusters, downloading the actual instances, doing the installs, the monitoring, indexing, some of the initial schemas, all of it as a junior. Um Yeah.

SPEAKER_00 4:17

I remember inspiration for this, you were saying it was like within your first week. They're like, who's worked with Mongo before? And because of your mean stack trading, you raise your hand and look, it is parlayed into quite a career for you.

SPEAKER_01 4:30

That was that was amazing. I I am still surprised the things they let a junior engineer do. Um, set up all their data. Yeah, that was uh that was an experience. But I I did that. I was in back end engineering for most of my first part of my career, about five-ish years or so, alternating between working with the data store on the operational side and then doing coding for everything from authorization, authentication, uh document metadata related work. And eventually uh they needed somebody to manage the analytical data space. That team had spun up about a year prior, and I accepted. They also tossed in the operational store with it, and I began managing the entire data stack for the product. And eventually I brought in, as you mentioned, an full-on analytics team, an analytics engineering team, started bringing in Kafka streaming for our data, and here I am.

SPEAKER_00 5:35

Here you are, uh a decade later, after just that first week raising your hand in Mongo, and you you're now a data expert. Um, this is a conversation about data. Arguably, data's always been important, uh, which I know something you're a very strong proponent of. But increasingly, especially in the world of AI, uh, it is more and more important. And for those that are avid listeners to this podcast, almost every uh organization leader, organizational leader I will talk with says you it always starts with the data. So, you know, going back to the quote I mentioned, you said it's going to accelerate in one of two ways. Either you're gonna get intelligence at a speed you never thought possible, or you're gonna get the wrong answers faster than you even recognize how wrong they are. So I'd like for you, in the context of this data conversation, to unpack what you mean by that.

SPEAKER_01 6:29

So previously at DataSight, if somebody needed information, and when I say somebody, I typically mean a business user, somebody in sales and marketing, somebody over on the product management side, typically what they're gonna do is they're going to ask the analyst, I need this data, I need the current metrics on this given KPI. Can you tell me this information? Can you put it in a dashboard? Can you pull it out of the data store? And the analyst will go do that. And if there's something wrong with that data, or there's something wrong with that data set, or if there's known quality issues, the analyst would work around that or massage it, maybe let you know. But at the end of the day, you would have a hopefully a fairly accurate report. Assuming there's not multiple sources of the same thing in the data store, and you know, the analyst is pulling the right information and knows that data quite well. But let's say you just put an MCP layer over top of your data, and you would just allow this business leader to query it using the AI tooling of their choice. You plug in Claude, you plug in chat. They get an answer back. And the agent, it sounds like it could be the right answer. The agent seems very confident they've analyzed the response. That agent might not have any idea that there is a data quality issue with that table. That agent, depending upon how that question was asked, might not realize that there are actually three different revenue tables somewhere within the data structure, and it doesn't know which one to look at. Or it could be that they're used for two slightly different questions, and it doesn't have the context around that. So your business user just got the wrong information, but it looks awfully correct because that agent is awfully confident. I have one of my engineering leaders say that the agents right now are like the extremely intelligent but overly confident intern. Very talented, very smart, but they don't know what they don't know.

SPEAKER_00 8:48

They don't know what they don't know. I think that is really important because I do think there is still um, I don't I don't know if stigma is the right word, but definitely a thought that you can plug an agent into a data set, no matter how large and it will figure itself out. Sure, it might take a little bit more time to think, but it's gonna figure it out for you and it's gonna come back to you with a confident answer. You actually ran an experiment, if I remember this correctly. You said you sent an agent to Snowflake with 22,000 tables and two petabytes of data. Um talk to me a little bit more about what that experience was like, what you learned from it, and um and then from those learnings, how you decided to segment out agent access to data moving forward.

SPEAKER_01 9:37

Yeah. So we actually did not give it access to the full 22,000 tables. We did give it an access to a subset of it. We gave it access to our usual consumable layer. We gave it access to our analyst layer, basically. So the layer that the analysts typically use. And we let a business user set up connection to the MCP and start asking it questions. And thankfully, we paired one of our analysts with this individual. And uh honestly, even the business user knew that was wrong. That that particular business user knew his data pretty well. They look at these KPIs, they look at these numbers all day. It it it wasn't even close, mostly because there's additional filters that just aren't being applied. There's terminology that the business user might be using that, you know, that that column name, that table name's not quite the same thing. So it was very evident that the agent was making some guesses. Um it did give an answer. You got one. It was it was not the correct answer, and it would certainly not be something we would have put on a on a report.

SPEAKER_00 10:48

Yeah. Um you know, I think this is even though you might have a much larger data set than uh what people are traditionally used to, um, I think there's just a thought of like really kind of figuring out where to start. If people are generally aligned with the fact that like you can't just throw a large data set at an agent, well, what can you throw uh at it? And um later on, I'm sure we'll talk about like the beginnings of it, but like how do you at Data Site and as a data leader think about how to partition either different agents, different MCPs, different roles, and what they get access to?

SPEAKER_01 11:29

Sure. Well, first of all, if you want to put the entire landscape of, let's just say your consumable data in front of the agent, if you have somebody using a skill or somebody writing a skill who really knows what they're doing, can get detailed into it, can specifically point to the exact tables to look at, the exact information to filter on, you actually will get a half decent answer. This is assuming that the person who wrote that skill knows what they're doing. And you're probably going to repeat some of the same instructions over and over and over again. And eventually that's going to start to become a little bit untenable. Skills are typically meant to answer a how question. Um, and you're essentially forcing it to answer also, or instructing it to answer a lot of what type questions as well, in that case. So from there, how do you start going and chunking this out so that the agent doesn't have 20 something thousand tables to go and look at? Well, first of all, taking a look by domain and by role. Most people who are pooling information in an organization, they are pulling this information by some sort of domain. If you are working in product, you are probably pulling things around feature usage, metrics around the usage for the components in the application. Hopefully, you probably have some kind of OKR as you want that information out of it. You want to know how many people are using your product. Or if you're in marketing, you maybe you're doing some cross-selling information. But you are typically working out of a general bucket of information. So segment your information from a data perspective. You should have ownership by domain in a perfect world with probably some cross-dependencies underneath. But it's the same thing with the MCP and with the agent. Give it a surface of data where it's going to be the expert. I literally at one point thought to call our product agent and name it after like our analyst. I'm like, this is essentially an analyst's companion for product. That's that's what this is doing. So it's over top of this particular domain. Hopefully, that domain is also fairly well curated. If you have your gold consumable layer of data and you say this is my data product, and I don't necessarily mean it's client-facing in that sense. I mean this is the data that I care about. It's the one that I'm saying it's ready to use. It's of a quality that I am satisfied with. It is well documented, it is governed. Ideally, you have some kind of semantic layer or semantic views in the case of Snowflake on top of this. That that's a good surface to then try and put an agent over top of that or put an MCP on top of that. That also is attached typically to your RVAC solution. So your users in that area should also only have access to that date.

SPEAKER_00 14:43

Yeah, there's a couple threads I want to pull on there. And one of the first things that I want to recognize is that um before you define the skill file or anything like that for the special uh speciality of what that agent uh pulls and how it pulls it and what it pulls, talking about all of this pre-work that needs to be done before we even get there. You're even using words like governance, right?

SPEAKER_01 15:10

In an ideal world, yes.

SPEAKER_00 15:12

In an ideal world. But broadly speaking, you need to be able to have the data. Um, I think you said this, and probably you said you have to be able to cure curate it, maintain it, and make it consumable or accessible.

SPEAKER_01 15:26

It's like an API. If you're writing a piece of software, you are putting something into production. In this case, it's data. If you it's not ready for somebody to read and use, ideally it shouldn't be in that, like curate it like it's an think of it as it's an API. You are delivering a product.

SPEAKER_00 15:47

So I want to talk about two separate things. I want to talk about data readiness and data governance. Correct. Let's start with data readiness. When you say if the data is not ready, what does that mean to you?

SPEAKER_01 16:00

So to me, we're saying if the data is not ready, this means that you have some kind of CICD and testing process sitting here involved. So for instance, if you have a column that is never supposed to be null, you've checked that. You've checked that before it's even landed anywhere in production. Um, if you have a certain set of options, enums for a given set, you know what those are. You know it's not supposed to be outside of those parameters. You know that if you're joining two pieces of information, you're expecting it to be within a certain range of values. You've already checked that before your CICD process has written that into the productize version of your tables. So all of that is in there, and that's just data in motion. That's data as it's landing into that consumable environment. Ideally, you also have some data quality over top of that longer term. So you're not just monitoring, hey, I don't expect this tape column to be null, except maybe in these circumstances. Okay, great, it landed in pro, but what about over time? Like let's say all of a sudden I'm seeing more null values than I would expect for this type of value. Or hey, I'm all of a sudden seeing a great increase in this type of for the data site process, this type of project. But what does that mean? Um, so it's changing an expectation. It's going outside of what you would expect. It's outside of a standard deviation. So you're measuring it both in terms of can I put this data into prod? Can I promote this into a working layer? Plus, and this is something that you don't necessarily, I think, measure as much in typical software engineering, but over time, I'm looking and examining how my data set is changing, possibly in ways that I don't necessarily expect.

SPEAKER_00 18:15

And I think the interesting part there, I think when we were talking earlier, you know, you have a back-end engineering background. With back-end engineering, you have a CI CD and you have a testing process. Yes. Um, with data, that is not necessarily a standard, like the continuous testing of data quality. Um I'd love for you to talk a little bit more about that because I that was something that I remember that was surprising to me, that that was like something that you felt very strongly about. You touched on it a little bit, but I'd love for you to say more about why more people need to make it a standard, um, especially in a world where you have agents and AI accessing your data.

SPEAKER_01 18:55

Correct. I mean, essentially, once you have AI accessing your data, again, whatever they're accessing, that that's prod. So if you wouldn't want this committed into your application, don't put it into your data set either. One of the questions I almost always ask in an interview, and it's very general. So if you ever interview with me, take note. How do you test data? And you should have hopefully an opinionated answer or make a good one up. But you have tests that appear at various places, and you will now see this even in some of the tooling that you have, whether you're using Snowflake or Databricks, they're starting to build in data quality metrics and monitoring into their tool sets. DBT, if you're using it, has testing inside of it. Airflow, you are able to detect when a given um DAG fails, why it fails, the logs for it. Um, so you should have testing and the orchestration of the data. So as the data is, as the trigger goes off, you should be able to test as the data is then transformed and landed, transformed, joined, and landed, and you're confirming it at every point. And then you're going to essentially monitor again at the landing zone over a period of time that that data again is meeting your expectations. Like you have CI C D, the orchestration, that's your CI CD process. Did it, did it, did it also fail or something like that? Um, you can almost think of your, I guess your landed monitoring kind of like an end-to-end test or a full stack test. So you're seeing it, I view it like it is like a software. If it's important enough, I should be testing it and making certain it's okay.

SPEAKER_00 20:51

Well, what's really interesting, at least not that I'm aware of it, and of course I am not a data expert. I just don't think that there's a standard around this. Like you just made some like correlations around like unit tests and DBT that totally make sense to me. But like in the world of like data CICD or data pipeline, um, like the quality standard around how you continuously test that, I don't think is something that people are thinking about, but they will start thinking about it more moving forward, I think.

SPEAKER_01 21:18

I do hope that comes out more. I have noticed, like, you will find walkthroughs for how to create like DBT tests. You will certainly, you certainly have tools like Datadog and Monte Carlo that will monitor data for you. But you're right, I I have not seen too many references to like an end-to-end testing process for data, like actually written down. Not certainly not as many as I see for software engineering, where you have whole things like test driven development that have built out there. I I don't think it's as big as that.

SPEAKER_00 21:53

Yeah. Okay. So we we uh I'm sure we could talk about each one of these topics for uh much longer, but we talked. Talked a little bit about data readiness. Let's talk about data governance. Correct. What does that mean to you?

SPEAKER_01 22:07

So, data governance to me is the method of making your data accessible and usable and understandable and consistent to everybody. So now you're stepping outside of your own domain, and this all needs to make sense holistically. So this means that, first of all, I hope if you have data for product, you have one team that is responsible for data for product. So this goes back to the idea of domain ownership. Um, who's in charge? This also goes to make certain around, like, for instance, hey, we're talking about let's say total number of pages. A simple metric. Okay, what does total number of pages actually mean? Is that the total number of pages on a project? Is that the total number of pages that were processed? Is that the total number of pages that are available to a certain segment of users? What does that actually mean? And everybody then has to agree on that particular definition. This is one of the phenomena around how people in the organization can get a report on supposedly the same thing and get two totally different answers. That's because multiple people interpreted that information in two completely different ways. Or the person requesting the data thought they were asking for one thing, and the person filling out that request thought it was something completely different. So it's that agreement, even around how we use these terms. And then there, I think there is that overlap in data quality as well, but it's far more holistic. It's also making certain because data is actually most important when it's used together. This is actually, I think, one of the biggest differences between data and software engineering. Software engineering, you want, you want to try and keep that tight ownership, but like loosely cohesive. Data, data is most useful when it's with other data. So you also really want to make certain that you understand where that data is coming from and tracking that lineage and in some governance structures, even down to like a field level. So really understanding the origin of that data. But it's it at the heart of it, it's putting the tools in people's hands to make certain that they have enough information about the data that they are using to make very good logical business decisions.

SPEAKER_00 24:43

When we talk about data governance, it's, you know, I think holistically about what you're saying about governance in terms of its accessibility to everyone. But also in terms of access, you've got the platform layer, you've got the data layer, um, you know, you've got um uh the the application layer. Where does governance actually start? And especially in a world maybe we're getting into harness territory here, where do the instructions for these agents uh live, right? Like how do you think about that?

SPEAKER_01 25:16

Okay, well, where does governance actually start? I am, first of all, I am I am not in charge of governance. I leave leave most of that to our governance team at data site, but I am fully inspired by them and I take what they say to heart. But an ideal world, they would say governance starts as soon as that data lands in an ideal perfect world. Um, most people don't start governance as uh soon as their data is up and running. You ideally you would have started governance when your product and everything was a green field. Yeah. If you did, good job. Um that's perfect. That is not, I don't think that's most of us. Um so for us, the way we have chosen to start is from our client, even if that's internal facing layer. What's your gold standard of data? What are the top 10 metrics and reports that you are responsible for that are the most important thing to your area of the organization? Start with that. Um, start with those particular metrics. And typically you will find that there's a series of metrics, tables, structures that underpins a large portion of what you do. Document that, solidify that, and start from there. Because once you find, I think, those key structures and you start from that point, many other things become easier. And you might actually get to the point where you don't necessarily need to document, analyze, and pick apart absolutely everything. Um, like for instance, our bronze layer of data, our landed layer, we sink it. No one's gonna see that other than us. And hopefully we know what we're doing. We hope we know. Um that layer we don't really worry about. We worry predominantly about that consumable layer of data, that gold and platinum layer where we know that we have folks from other areas of the org, other analysts doing that work. Or in this case, this is where our agents are actually plugged in. So we have curated a very bounded layer where we have said everything in here is documented. There is a semantic layer in the case of Snowflake semantic views built over top of this. We actually built those views when we started designing the product itself that became this layer. So it was actually built out first before we even started coding that newer version of our layer. Um, anything new immediately has definitions coming in. So, this way, the most important things in your org plus anything going forward ends up getting governed, ends up in that consumable layer. And it's gonna be a lot easier for an agent then to roll through that and understand it.

SPEAKER_00 28:29

Yes. I mean, that makes a lot of sense. And I think that semantic layer is hugely critical. And I think your approach around that, you know, starting from the client, the gold standard data, uh, top 10 metrics and reports. I am still interested, though, in the cross-functional piece because giving agents access to the instructions on how to query the data. It's not like a single player game that the semantic layer can just solve on its own. How how are you handling that today at Data Site?

SPEAKER_01 29:05

So over top of this, the semantic layer is it helps. It helps a lot, but it's not everything. So this is where we're starting to dig into other pieces of information coupled with it. So if you're using Snowflake, you have something, for instance, called a verify query. So as we have users start to use this layer, I can't see the prompts. We can see everything that at least Snowflake is querying on. So we can actually go and say, oh, this is queried very frequently. Let's create a verified query. So essentially training it and saying, okay, when you see this request, this is what you should be running. Got it. And that will also help at them with very similar queries, similar filters. So that is a verified query where it's saying, I know this is right. This is the answer that you should then expect. Tune from there. We can also, we also monitor these agents as well for their accuracy against these verified queries. So as that's coming in, as we're deploying a new version of the agent, we actually test on that too. Like how closely did you hit these? Um, and if it fails, we won't even deploy it. Um however, even going back further than that, so you've got the verified queries, you've got your semantic layer. We have also started to dig into skills as well. So we do have skills out for our um individual sets of users. So certain filtering requirements, certain things that almost like what the analyst is trained to ask to they include this in the request? If they didn't, please prompt them for it because it'll make it better. Um, so that would be like certain filters. We also train um the agent itself on synonyms, too. Like certain synonyms, like we might call it a project, we might call it a data room. Those are synonyms they could use either one. So that's layered on top of it. There's also potentially another layer we're considering more on like the langsmith side. Um, there was a thought, once again, coming from our lead developer, of what we call the data doorman, which is maybe you actually don't need Snowflake. Maybe you don't actually need that data set. Maybe what you're actually looking for is click data. Maybe what you need is actually pendo data. Or, you know what, if you're acting about an active project, maybe I could actually send you to our operational data store and that would be more efficient. Dorman will help you with that question so that I can direct you to the correct place and you don't even have to ask us anymore which data sets and which tools you should be looking at. We'll just direct you there. Um, so that's another thing we are experimenting with. So there are multiple layers in this. And we tried to keep the snowflake agent portion very specific to the snowflake data itself.

SPEAKER_00 32:23

Got it.

SPEAKER_01 32:23

It's about the snowflake data, about the semantic. It's there. The layer before that, that skill layer, has tended to be more about the business user, how they're asking, how they're interacting. And then there's yet another layer around, okay, well, maybe we need to even redirect you to a completely different location, or perhaps between a different agent in Snowflake, because maybe you're an executive and you're using this agent normally, but you need to ask about something in product, we'll just redirect you to the product agent. And you're an executive, so you can know that information.

SPEAKER_00 33:04

Yeah. Yeah. I I love um I love the way that you are approaching it. And I love the concept of a data doorman that kind of ushers you into the right kind of uh set to be looking at. Um who owns this? Like because it because it is so cross-functional, um who is currently owning this at DataSight if you were allowed to share?

SPEAKER_01 33:27

Ah that starts to get a little complicated. So, but let's go into ideally who owns it. Yeah. So if you own a domain, you've curated the domain, guess what? You own the agent. It's it's it is for your analysts, for your services. You are the best person to train it. Ideally, you own it. Or if you don't own it, you're at least heavily directing and having influence on the engineering team that is doing it, would be my opinion on it. Um, however, we are not alone in this. Um, like for instance, we're working on this doorman concept. We don't own Lang Smith. That actually belongs to our AI platform team. They they have the AI platform and we work with them in concert to get features and functionality like this out the door. So at a really high level, that's that's the ownership. But we are still in such an experimental phase that we have not fully fleshed it out. We are still testing this. And currently we have released it to just internal product clients who are trying it out.

SPEAKER_00 34:36

Well, one thing is for sure, everyone is still experimenting and evolving. But at least you have some opinionated or prescriptive ways of what you've seen work and certainly what you have seen not work. Um, and in that spirit, I think you made you know a comment earlier that, you know, obviously it feels easy to build right when you're starting from scratch and it's green from but a lot of um application development leaders might be listening to this and they might be part of organizations that have been collecting data for years, decades, right? So I I'm curious if you have a perspective around what do they start implementing now to start gaining clarity? Like where would you recommend that they begin?

SPEAKER_01 35:24

This may seem like a really obvious spot, but do you know what you're asking for? Would be the first thing I would say. Do you have a good handle on what metrics you need, what those metrics actually represent, and what the metrics, the definition of those metrics are? Do you know what you should be tracking regularly? Do you know what you should be tracking as potentially new functionality comes out, or as functionality shifts? Um, so having a good handle on that is really helpful for just everybody involved. So as we're building out the data sets, as we're building out these layers, getting a good grasp on what actually we're supposed to be measuring, because you also have limited space and time for semantic layers, et cetera. Like so they can only be so large. They can only hold so much before you have to divide them up again, or they won't perform very well. Um so making certain you have a handle on that would actually be one of the first things I would say to do. Um and then outside of that, once you have a good handle on the metrics themselves, I would say having an understanding of a flow of how you want to interact with those metrics. So, like I'm starting on the front end, the front, the front end, um, like this would be an application. Do you see yourself interacting with an agent and asking these questions via MCP? Or actually, would you prefer a dashboard? Um, one of the things our doorman actually does is if metrics exist in a dashboard, we'll just send you to the dashboard. Why waste the tokens? Why, why consume more than we have to? We will send you to where those metrics are actually persisted, and you can just go and look at it rather than spending the extra money. Um, but even then, having an idea of that flow also figures okay, well, what kind of a skill am I building out? What kind of functionality am I building out for you to support you in doing this? Um that does help. And then if you haven't created a curated layer of data, if you threw it all in the data layer and hoped for the best, um, creating that layer once again, starting with, okay, well, here's my most important structures to have. Here's my most important things to surface. And even if you're manually putting it out there, if you don't have any kind of CICD process or you're triggering a some kind of cron job for DBT or whatever transformation you're doing, um just making certain that you are getting your monitoring and hands and your quality around that key layer, building it up, setting up the basic monitoring, even the basic testing around landing it in prod, as we discussed, is is helpful. And you'll slowly start building up the rest of it from there.

SPEAKER_00 38:32

Yeah. I think one of the key things there is like, you know, you can work with like a bounded set, right? It doesn't have to be everything all at once.

SPEAKER_01 38:43

Don't work with the ocean. It's I actually recommend not.

SPEAKER_00 38:46

Yes, don't don't work with the ocean. That's good right, that's a good recommendation. Um, and then you know, uh in that bounded set, I think you can do a lot of the recommendations that we've discussed here today. Make sure that that set is is governed, like that readiness that we were talking about, the governance that we were talking about. And then also on top of that, you you kind of started with this, but like what are you asking for, which bleeds into the definitions? Because if you're confused, I think the agent's gonna be confused, right?

SPEAKER_01 39:17

Yes. If you're confused, the agent is not really gonna necessarily make it any clearer. In fact, it it might lead you to further wrong answers. So the more pinpointed you can be in your requests, in your definitions, in what you are asking for, the more clarity we are going to be able to give back to you. And honestly, that even applies, it's more important for regenics, I think, and for these uh new processes as you have the business users themselves starting to interact with the data more. But even prior to that, it's still helpful for even an analytics team or folks who are just creating dashboards, having a really clear idea on what you're asking for, realizing that it's important. It's just not something you're gonna ask for once and toss away. It it is something important for you to track over time and having a good grasp on what those metrics are versus the things that I'm just gonna check once and then turn around and that's it.

SPEAKER_00 40:19

Yeah. Um, you know, as we start to wrap up, I'd love to hear a little bit more on what you are excited about in the landscape of data, in the landscape of analytics and, you know, data hygiene, especially when it comes to I actually think data, this is gonna sound weird. It's like having a renaissance. Like I think people are realizing the importance of this data hygiene, you know, at the foundation of what's going to power AI. I don't think I've ever had so many conversations about like foundationally your data has to be, right? What is that saying? Crap in, crap out type of thing, right?

SPEAKER_01 40:59

Garbage in, garbage out. Still applies.

SPEAKER_00 41:01

There it is. Because you live in this world, what are what are you looking forward to? What are you thinking about? What are you excited about?

SPEAKER_01 41:08

I think the thing that excites me most is just people getting great greater clarity on their data itself and really understanding their own needs and what they're asking for. Um it's it's really not easy, but it's really common to have organizations where data is very siloed, it's kind of all over the place. You have to really shift and sift through it to find its value. I'm really looking forward to people putting effort into their data, into their governance, um, and actually having data provide the clarity and insight that it is fully capable of without having to worry about whether or not that those are the right, right metrics, right information. Um, it's just I'm really looking and excited about that level of confidence and clarity. Um, it is oftentimes historically very difficult to get people excited about doing data definitions and doing um data cleanliness. I still don't think people are going to be absolutely excited about it, but I think the need for it will at least um convince people to do it. In essential.

SPEAKER_00 42:36

We've covered so much today, a lot of tactical advice around like how to get started with data cleanliness, especially if you're dealing with a world of a legacy data set. And I think that what it sounds like to me is that you have kind of a data governance council, maybe it's not necessarily called that, around what are the definitions? Where what the what role does the uh semantic layer play? This whole introduction of the doorman and where where that fits into the stack, that's probably like a cost savings uh you know initiative on its own. Where are those conversations being held? And is there a you know a group that you can recommend that other people can maybe adapt from what works well at DataSite?

SPEAKER_01 43:22

Sure. Um, so we do have a bit of data by committee, but it is a centralized committee. So we have a data governance department. Um, they are the ones who have counseled all of us in what data governance is, its importance. They actually have some of the best documentation on using their tools and finding the information in our in our data sets. We have in our metadata tool, for instance. Um, they are great at putting that information together. We do also currently have a data management office. They lead and have essentially regular meetings for data folks in the org to contribute to conversations. Um because I do work in product RD. A lot of the agenics for that we're working on came from our team, but there are other folks working in different areas of the org and doing other experiments. So we do have that centralized area to share that information. Um, in RD itself, we have a concept of communities of practice at DataSight, which has been absolutely invaluable for sharing the research and insights and experiments that folks throughout the product organization are doing. We have communities of practice on data, on AI, on Python, on back-end engineering, where we are doing demos, sharing thoughts, asking for feedback, and it is cross-team, oftentimes cross-org, where anybody is welcome to join. And it is for the purpose of sharing information, sharing philosophies, and sharing methods of practice and determining how we want to carry things out. So those are kind of the mechanisms that we have used for collaboration that have been pretty effective.

SPEAKER_00 45:32

I think that's extremely helpful. I mean, I think there is the the communities of practice model, which is like people that are actually building what's working, what's not working. And then when there's alignment uh and an agreed upon uh pond standard, that can then be, you know, bubbled up into like a more kind of formalized adoption within the organization.

SPEAKER_01 45:51

I don't care what it does.

SPEAKER_00 45:53

Yeah. Well, Heather, this has been a fascinating conversation for me. I've I've learned a lot. I know that others will take away a lot from it. If um there's one thing that you wanted to leave the audience or anything you wanted to share uh before we close, what would that be?

SPEAKER_01 46:10

I mean, it's definitely a period of experimentation and excitement. It can seem overwhelming. There's a lot happening all at the same time. Don't hesitate to experiment in a sandbox, not in prod, in a sandbox. Share the information, but don't forget about some of the core foundational data concepts, like your governance standards, like your cleanliness, like your data quality that are more important now than ever.

Prakash Chandran

Host