DataTopics: All Things Data, AI & Tech

#94 Agents Are Rising: Why Data Quality Matters More Than Ever

DataTopics

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 30:19

Send a text

Trust collapses fast when a dashboard misleads or an AI agent learns from messy data. We dig into how data quality became business critical—and how to move from reactive fire drills to proactive systems—through real stories from clinical trials and large platforms where a single broken test could escalate to the C‑suite. With Stan and David, we map the shifts driving this moment: AI adoption, rising reliance on metrics, and the urgent need for shared definitions, lineage, and monitoring that let teams find root causes before customers feel the impact.

We get practical about agents that actually help. Instead of vague hype, we break down a low‑risk architecture for read‑only, metadata‑aware agents that handle repetitive, high‑leverage tasks: writing dbt documentation, proposing data tests, performing lineage‑driven root cause analysis, and auto‑drafting tickets with queries, diffs, and impact notes. We explain why integrated agents beat copy‑paste prompts, how to add guardrails that limit scope and permissions, and what human‑in‑the‑loop review should look like to build real trust without slowing the work.

Expect candid guidance on adoption and observability: two layers of visibility—agent behavior and data quality posture—help teams track costs, measure time to resolution, spot repeating incidents, and choose structural fixes. We also explore buy vs build as platforms begin embedding agent capabilities, and we share a clear starting path for any team: prioritize critical datasets, standardize KPIs and definitions, enable tests, and surface lineage so automation has the context it needs. By the end, you’ll have a blueprint to reduce firefighting, improve stakeholder confidence, and make your AI agents smarter by feeding them cleaner, governed data. If this resonates, follow the show, share with your data team, and leave a review with the one task you’d automate first.

SPEAKER_01

Hello and welcome to the Data Topics podcast. Today I'm joined by Stan and David. Yes, yes. Stan, could you introduce yourself, please?

SPEAKER_00

Sure. Nice intro, by the way. I work at Data Roots for about four years now. My role is uh team lead and uh competence lead on analytics engineering. It's also similar to what I do at the client side. There focus is more on the engineering uh part where you focus on data modeling, uh things around the end-to-end analytics flow, and not also, of course, the data quality.

SPEAKER_01

Cool. We go way back. David, can you also introduce yourself?

SPEAKER_02

Sure. Uh David, I've been working for two and a half years now uh at Data Roots. Uh yeah, working as a data and cloud engineer. Um, done a few projects already. Um I've worked as well with some uh data quality projects as well. Uh so uh yeah, curious about this one.

Why Data Quality Matters

SPEAKER_01

Already hinting towards the topic of today. Indeed, we're going to talk about all things data quality, also about how agents can accelerate the development being done in data quality. How did you first understood how important data quality is? Do you have like one specific moment or memory that comes to mind when thinking about data quality?

SPEAKER_00

Um data quality is very relevant nowadays, but I I do remember uh a project, it's a while back now, um, but it was uh for a clinical trial company. So their data quality was very crucial. Uh from the start, there was basically not an option to have uh data quality issues. So I remember we we also took that into the the modeling phase and we started the the migration to towards Databricks and GBT using uh the data modeling. Um but then if something would happen, and it does not have to do uh necessarily with with the data itself, but if if a quality issue uh arose, then the whole company was involved, meaning that it was not just an escalation of two people one-on-one trying to fix the issue, but it was really a uh a high-level company uh resolution.

SPEAKER_01

So basically every data issue is business critical.

SPEAKER_00

Yeah, yeah, exactly. That was uh very, very crucial.

SPEAKER_01

David, can you think of a memory?

High‑Stakes Use Cases And Impact

Common Failures And Root Causes

SPEAKER_02

Yeah, I also have one. Uh it's of a past client um where I was working on their data platform, so basically the job was to maintain it. Someone previously set it up, um, and the most part was basically just uh yeah, fixing the data quality issues actually that came up um during my time there. Um so I would always start like, okay, I you noticed an issue in this table, and then yeah, you try and find what's the root cause. And if I had to guess, probably it was like 85 to 90 percent of the time it was always uh it led back to a developer developing a feature uh on their platform, uh, and all the data was already always getting ingested, so it was always leading back to testers or like developers creating this data that was then ingested into their data platform. But yeah, this this data was never correct for their data test, so it was always uh causing issues, um, which of course like it's a lot of time and money that's that's getting wasted um yeah by these guardrails that are not really in place for these ingestions. Um so yeah, that was kind of a yeah, time and money, good point.

SPEAKER_01

Uh maybe to add to that, I'm currently also working at a telco company as an AI product manager, and there, of course, governance is a huge topic because it's important to have governance on these AI products, and so for data quality, we've been putting a lot of effort into it in the past few months to have these data quality checks in place, but also alerting resolution mechanisms and so on and so on. So, maybe why because everyone is now looking at data quality, we see it at uh multiple customers. Why do you think data quality has become that important? Because it's always been important, obviously, but we can definitely see in the past, let's say, one, two, three years that it's become even more important. Do you have reasons for that?

Rising Importance With AI Adoption

SPEAKER_00

Yeah, it's it's very much the case, definitely. Um, I think it's it's in line with the the the rise of data and AI as well. And meaning that if if data becomes, I grows from a nice to have, uh basically a commodity, to a must-have in in a lot of processes and and at a lot of clients also means that people rely a bit more on data in their day-to-day, meaning that if they want to make decisions data-driven, then of course they need to have more trust in the data. And if something is wrong in a dashboard, for example, then you would have uh an embarrassed data team, but also quite a lot of impact in the in the business critical processes if it's not um anticipated in time. I think there is the difference is that that it used to be okay to kind of look at the data quality issues in a reactive way, trying to come up to a solution if there's something wrong together with the person using or or looking at the data as an additional check, basically. But now if you if you need to have that trust in your data day to day, then it has to be some proactive way to also look at the data quality process itself.

SPEAKER_01

Interesting. So the adoption of AI has grown, and thus also there is a need for more trust. And trust is super important. So for gaining trust with stakeholders, we need to have reliable models and thus also data sources. Do you have something to add to that, David?

SPEAKER_02

Yeah. Um, how I think it's it's why it's becoming so important recently is is mainly uh the rise of the agents, uh, let's say. Um, I think you can go two weeks without um OpenAI or Entropic releasing a new uh a new model that can do all of these things way better again. Um so of course a lot of companies wanna wanna uh yeah use those and and uh try to do some heavy lifting, uh let the agents do that. Um but of course in in a data data context, um these agents, the data that you feed them is the only thing they basically know, right? Like uh if if you give them data that's not correct or not up to the standards that that's for your uh data, yeah, then then you're not gonna get the most out of them. So in that sense, I think data quality is really important. Um if you want to use those AIs to have some really good uh agents to do for you, but um yeah, that's why I think data quality is becoming way more important than it was the last uh few years.

Typical Pain Points And Bottlenecks

SPEAKER_01

Yeah, thanks. Um, considering different data quality and governance maturities over companies, what would you say today are still some typical pain points that you see over different companies and industries?

SPEAKER_00

Well, I think especially as an analytics engineer, because you're often involved in in these processes, I think a lot is still being done manually. I mean that it's not an issue where it often becomes a bottleneck, meaning you have to discover an issue. It can be either the the product team, the the creation of the creature of the of the dashboard, but then the process is still not uh yet focused on the solution, meaning there is still need of communication to a specific person, it needs to be escalated, and it takes a lot of time. So the time to resolve is is very low. On the other hand, if there is nothing in place yet, uh if it's not a mature company in terms of data quality and governance, it's it makes sense to start off with this manual process to to focus on the most critical aspects, to focus on on alerting, monitoring these kinds of insights you need to get started with data and involve people as well, of course. But then there is there is a uh there is value in also looking on how to automate this, how to avoid these these bottlenecks in in the process.

Prioritizing Critical Data And Ownership

SPEAKER_01

What I also see in in the company I'm currently active at is that indeed they already did an assessment on what is critical data, because there's hundreds or even thousands of different data sources all being used in different applications, dashboards, AI models, and so on for different business purposes. But of course, within a business there are more and less critical processes and products and so on. So, there too, they they started with identifying those to also in your data quality strategy have some sort of prioritization because you just cannot do all at once.

SPEAKER_00

Yeah, I indeed very much agree with identifying those and focus on the most important ones at first. Yeah, yeah. Thanks. David?

SPEAKER_02

Yeah, I think for for these um governance tooling, one thing that's often overlooked as well, is the adoption within the team or the companies as well. Um yeah, you can have all of these tools, and a lot of teams create maybe a lot of data, but they're never seeing how it's used or how other people govern the data that they then actually are owning. Um, so the adoption in teams, but also within the more company-wide, um, is often overlooked, and there's a lot of nice, very good features that are in these tools that they then are often missing out on.

SPEAKER_01

Um so that there's a need for visibility and ownership.

Tooling, Standardization, And Adoption

SPEAKER_00

Ben, you mentioned tooling. It's it's an ideal scenario that there is one central tooling that everyone is using, but in case another team is is is logging their tickets uh in a different system, then uh the communication botonic that I mentioned is even more difficult because you need to transfer that knowledge from one system to the other. Uh, make sure that it's not getting uh scoped and and and reactively resolved there, but it becomes some kind of central uh source of truth.

Where Agents Add Real Value

SPEAKER_01

So standardization might also be helpful here and there, uh if it's feasible, of course. So I already discussed that we would talk about agents if we consider the pain points but also the the topics these days within data quality. Where would you say agents can make a difference?

SPEAKER_02

I feel like currently where agents really can can get where you can get the most out of them is if you have a lot of repetitive tasks. Um let's say documenting a lot of your data. You said you mentioned DBT models. Um if you want to use them within within agents, uh yeah, they need proper documentation. Um it's a pretty repetitive task that just an agent um could do for you or facilitate. Writing test cases to have those guardrails on your data. Um that's also something where where it's pretty repetitive, but an agent uh can help you with. Uh, but also doing analysis of certain uh quality issues, right? Um where then at the end you have a human at the end of the loop that can verify it, but the agent can do a lot of processing for you already. Uh, and afterwards, let maybe maybe um create a ticket after some verification.

Agents Vs Chatbots: Automation Depth

SPEAKER_01

Um could you maybe elaborate on the fact that so you just mentioned documentation and thinking about then indeed using Genii for that? What would be the difference between doing it manually with the help of tools like ChatGPT Gemini Copilot and then letting an agent run that? Is is there a big difference between the two of those, or is that still something?

SPEAKER_02

Well, I think there is quite a difference. Um, you can still do it with the tools like ChatGPT or Gemini, but you still have to create all these files, you still have to format them. Um all the other manual stuff that you still have to do while with an agent. If it's integrated into your system, you can just say, like, hey, I have all of these DPT models. Can you please generate me some documentation? And then it would already uh create those files, format them correctly, um, process them rightly into your um repository. Um, and there's actually no manual intervention from the user while with those external tools, you still have to go and create the files, format them, etc. There's still a lot of manual work that you, as the user have to do while these agents can can really do that for you, which is where most of the value um really comes from.

Concrete Agent Architecture

SPEAKER_01

So you're even more efficient and it wouldn't make sense to still have those manual steps in between.

SPEAKER_00

Um okay, except and also the the system that David mentioned also allows you to kind of create this context you want the agent to have, meaning that it's able to uh you can limit it to the context it needs to do this process, but also extend it to more tools, meaning that uh next to just looking at the models you have, for example, it also can perform some additional tasks, look at the data, look at some previous things that happened, and it can make the documentation even richer, for example, so it also can add on to the value it actually creates.

SPEAKER_01

I can already hear our listeners think what do they mean then with an agent and what are the components and how does it work and so on. Could you give a very specific example of an agent setup talking about the different components, the steps it takes, and so on?

Triggers, Cadence, And Triage

SPEAKER_02

Yeah, yeah, I can uh definitely do that. Um we have created a data quality agent, so you can either start from a known issue, so you have already defined some quality at some test cases that are failing, or you also have a way that you have a table, but you don't really have a guardrail, so you don't know the issue. Um so let's say you you have the the quality uh the test case created um and you want to let the agent really do the analysis for you. Uh so it can start with the metadata of the test, for example, the the query that was added to the test, or like a specific test case maybe. Um so then yeah, you would know on which table it would start. So it would start on the table, see if there's actually an issue. You might have had a wrong test case or something, but check if the issue is really there. And then it can just see the lineage and then go through each step backwards to see, okay, where can I find the issue? Uh, but then it can also check for okay, did the schema change in some of the some of the notes? Um, is there a code change maybe that caused this issue? Uh, so it can really go through all of the lineage, um, see where there would potentially be an issue, and then make up its own analysis of like uh, hey, I noticed in table X uh you made this change, uh that's probably the cause of this issue. And then at the end you would still need uh the user to check, like, okay, yeah, this seems valid. Um, let me create a ticket or um do something else.

SPEAKER_01

So that's one part of the full process because then afterwards there's still communication and so on. But in that first part, um would you then set it up to do it every hour or day, or that that's dependent on the use case then?

SPEAKER_02

Yeah, I think that can really depend on the use case. Um, but you could also just say, like, for example, uh, an analytics engineer is working on on the data, see, like, okay, this should not be there, or this is incorrect, and then it can uh use an interface maybe to then go uh to this um agent um to really have to the analysis for him.

SPEAKER_01

Uh then it's again a bit more reactive, let's say, because you had the intervention of uh yeah, that would be on the react.

SPEAKER_02

You could also have it uh more proactive where you would then set up a sort of pipeline, let's say, uh where it then automatically checks for okay, these tests all failed, let me do some root cause analysis uh and then triage it to some kind of team to say, like, hey, yeah, we noticed some issues. Um cool.

SPEAKER_01

So the trigger would be one of those checks, and um, if there is like an alert, that would be the trigger for the agent to start, and then it does all that you just mentioned.

Human‑In‑The‑Loop And Trust

SPEAKER_00

Yeah, I think it's also interesting to note that every use case we discussed so far is basically very low uh risk. Uh, it has very little impact on the process itself. It does not make the intervention, it's just a way for you to kind of accelerate the process. Uh for example, the the root cause analysis, it basically looks up into your lineage, just gathers lots of information, but also thinks about maybe components you would not really focus on from the start. Uh again, with access to different tools, different history of of other issues, it can really create an extensive report for you to just go ahead and and go focus on on the solution from the start.

SPEAKER_01

And so in an ideal scenario, you would then have a multi-agent setup with one root cause analysis agent, one communication agent, and so on, and these can then talk to each other, let's say, to fully automate that process.

SPEAKER_00

That would be awesome.

Observability And Guardrails

SPEAKER_01

Now another question that arises then is how much can you let's say outsource these tasks? So let's say you build an agent, root cause analysis agent, or a multi-agent system. Would you then today already have enough trust in agents to just fully run the process through the agents, or how would you how would you start there and what would be the distinction between automating and still having checks or doing things manually?

SPEAKER_00

Yeah, autonomously no. I think it's it's it's always crucial to to have that unit loop and to kind of semi-automatically um give that autonomy to to agents to to help you in a certain way, and to help you in a specific uh process. But it's also not that that it that it's capable to do everything from the start, meaning that which we also mentioned in the beginning, it's it's still relevant to start off with with the critical uh processes to kind of build this platform, kind of create your own metadata about the processes, get the documentation, the tests in place to actually monitor these kind of things, monitor these kind of things because the agent needs that information to to actually do what it does well. And then it's it's more of an accelerator to like do all of these processes in a better way. And then agents become interesting, but it's also for for the uh the the creator of the of the agent to also add these kind of steps to to start trusting the process because from going going from the the manual process to an agent helping you with these steps, you still need insight in how it's actually getting there, meaning that observability, for example, you need to know what it's doing at some point to to get to a solution and and kind of gain the trust of the people. How do you do that?

SPEAKER_01

That that's an interesting topic. Like, how would you set up that observability and and the transparency of how the system works?

SPEAKER_00

You can do it in lots of ways, but for example, if you just add a uh uh an intermediate step where it just tells you what it actually did before it goes goes to the next step and it actually gives you uh an insight, for example, a piece of code, a query to actually look at the data it generates to base another conclusion from, for example, you can validate what it actually does based on on these kind of steps or evidence, it came to a conclusion. Then that that black box becomes way more uh insightful to actually learn how to operate with.

Cost, Metrics, And Visibility

SPEAKER_01

How would you evaluate these systems? Like if you're setting up the agent, when is it ready to go live? And and how do you capture feedback?

SPEAKER_02

And I think you can maybe use the users, right? Like let them answer like this was actually pretty valuable answer, or uh yeah, say like actually this was not at all valuable, this was completely different. Uh, you could have some some things in place for that where they actually interact with the with the agent and itself to then say okay, this was actually a good answer, and then maybe reintegrate that with the agent as well.

SPEAKER_01

Like, so it's something very human-centric where you would say about tens of cases you just discuss with the consumer of the agent to then verify if if it's working uh as expected or not.

Future Of Data Quality And GenBI

SPEAKER_00

Yeah, I think so. Yeah, basically every off-the-shelf model uh from all the hyperscalers are are working uh as expected. The quality is increasing uh very, very quickly, but still it needs to work well in your business context, meaning you also need the people to actually use it and and give that feedback back to you to tailor it to the specific uh context as well. And I think that's where your uh edge or advantage of your company also kind of lays to use that interaction and kind of builds this this process the way you want it to be. And that's that also makes you um agile to to kind of changing business processes, changing technology kind of operates in iterative flow and not just as a static AI that will work the way it uh did in the beginning, that will never be the case.

SPEAKER_01

Yeah, I agree. I think this is key that's when you go from agent development to production, that you have this validation in place and that you first do the testing before uh putting it into production. Would you also consider guardrails?

Buy Vs Build And Tool Roadmaps

SPEAKER_00

And if yes, which ones yeah, very much to um give you an extreme example, if you allow the agent to um do any operation, it can also just drop every table in in your company. And that's something you want to avoid, so you can definitely put guardrails in place to limit the context as well to what it actually needs. And and for example, if it just needs information from your uh database, then it can just do that with reading access and it can limit to what it actually uh needs to do to get to the next phase, for example. So it's up to the uh developer also to put this in place, yes.

SPEAKER_01

Okay, anything else to add on the let's say architecture of the agent or the the way of working around it, the operating model.

SPEAKER_02

Uh no, not really. I think we covered a lot of them. Uh maybe uh the observability. I think it's really important, like in these cases, you don't really want the agents to to already start solving a lot of them because then it can really spiral into a lot of yeah, things that changing your data, manipulating the data in any kind of way. Um I don't think you you want it for like let's say that a quality agent, right? I think that can really spiral really fast into uh yeah messing up your database um or messing up a lot of things. So uh in these cases I think uh Stan mentioned reading X's is is way more in place way uh is it like a good guardrail already um okay for those things.

Getting Started: First Steps

SPEAKER_01

Let's maybe broaden the discussion on observability here because it's it's sparking my interest a bit, but there's maybe also a risk in letting the agent to handle every data quality incident, and then you don't have visibility on what's happening on the data quality um point of view anymore. What what's your reasoning on that?

SPEAKER_02

Yeah, um for sure. Um you don't really want to have it running for X amount of time and don't really have a view of what it's actually doing, what it's actually resolving, or like what's it acquiring all the time. Maybe uh in those things, um that's also leading to the cost. Uh, you want to see like what is this costing me, but also what is this uh resolving or looking at. So um you could add a layer on that uh where you have a view on what it's solving, and then maybe in let's say it's running for a while, then go back to that and see like okay, maybe it's it it solved like very much similar questions or problems, which then can lead to like okay, maybe we should really look into these and find like structural solutions for that. Uh, but also cost monitoring for sure is something I you if you have an agent that is running for some time, um, you really want to like see okay, what is this actually doing and how much is this costing me? Um, so there's also some layers of observability uh that you want to build into these kind of systems.

Metadata, Definitions, And Governance

SPEAKER_01

Yeah, if if I can summarize, you have like two layers of observability. One is on agent level, let's say you want to know how it reasons and why it does certain tasks or actions, and then there's a second layer where you obviously want to have an overview on which data quality issues are happening. So it's a bit a bit related to Gen BI. Maybe it's a good bridge to go to what data quality will look like in the coming three years. Like what are we expecting in terms of the evolution and and the the new innovations in in that field?

Closing Thoughts And Next Steps

SPEAKER_00

Yeah, I think we talked a lot about the the continuous aspect of handling and monitoring data quality. I think if we increase this and and kind of have the semi-autonomous um aspect of the agent uh also increase as well, it it it would help in this kind of situation because the MBI is basically going from data uh to solution very fast, and there is very little intervention of people of data people uh to look at possible uh data quality issues. So there, of course, it becomes uh even more important to look at how it happened, how it came to uh solution also iterate on it and and kind of improve and fine-tune the models to get to uh different solutions in in the future. Because next to next to the quality, it's also a way of of how you give again the context to the data, because in this way you are not creating the dashboard based on on how you want to model the data for these uh user questions or or or topic. You have to basically uh give access to a certain part of the data that is um able to answer the question, but also not able to able to not give the wrong answer to the question. So it should be some kind of unification on definitions, KPIs, everything. There is additional risk to this as well.

SPEAKER_02

Yeah, I feel like if we're looking really at the data quality part of this, uh I think a lot of systems or tools or platforms, whatever, in the few in the coming years or will also have these things built in, uh, like these identic uh tools, uh certainly platforms which already have access maybe to your data or at least metadata of it. Um but also there's a lot of tools coming out where you just put in some config and then um they can do a lot of data quality or or just data monitoring in general for you. Um so I do think this will this will uh grow in the certain the coming years.

SPEAKER_01

Uh do you have uh an example, a specific tool roadmap in mind?

SPEAKER_02

There is like this tool, and I think it's mainly a CLI tool, it's a datus. Uh so you connect it to an LLM. Um it can query databases for you. Um basically also bits conversational, but I think it's more on the CLI part of it. Um but also like platform like managed governance platforms. I think they will also start implementing these kinds of agents that they already have uh connection details and metadata details. Uh, it's just a matter of time before they also start doing all kinds of uh all of these things.

SPEAKER_01

And so it's becoming a bit of a buy versus build decision, too. Yeah, today there's not much on the market yet, so you would advise companies to start building an agent, start exploring the capabilities, testing it in an environment, and then roll it out into in production. Um going back to today, what what to make it a bit more easy for the listeners? What would you start with? So you have the ambition, you heard this podcast, you have the ambition to start um developing an agent and and experimenting with it. What would be step one?

SPEAKER_00

Step one would would still be to make sure you have the information ready to give to your agent, uh meaning again the the context, the metadata uh that it needs to come to certain uh good results. But if you have this replace, if you have the fundamentals and and the platform and and people involved, then it depends on on your own uh projects and stack. There are more and more tools um also jumping on the hype and of offering all kinds of AI solutions with suggestions, with also the conversational um uh interaction with with the tool to know more about the data or to know more about data quality. But at the same time, if you have all the fundamentals in place, if you have the the the platform where you can monitor and and and see the results of all these uh quality issues, it's not that big of a step to also kind of build an interface on top of it and kind of interact with your data and build these guardrays yourself. Because, like you said, you have the observability uh aspect and it gives you insight on on how it's operating, but also gives you the the edge to fine-tune and and make it even better to work in your environment, so depending on your situation.

SPEAKER_01

To make uh the concept of metadata a bit more tangible, because we know what metadata is, but could you give some examples in this specific case?

SPEAKER_00

Yeah, it's metadata is basically just information about your data, but to be very specific, it means you need to have the right uh business definitions, uh meaning that if if every department defines a specific uh definition in a different way, then it would be very hard for the agent to know which one to pick, for example. It needs to have the business context on on how the processes are going, on what to take into account, what to focus on. At the same time, it also needs to know what is happening and if it's happening in the right way. If if if there will be no tests at all, then it's difficult for an agent to anticipate uh quality issues.

SPEAKER_01

Okay, interesting. So governance remains key in uh the story. And I assume you can also run agents on governance. Maybe that's a topic for uh next episode. I assume people can also reach out to both of you to discuss potential next steps. Thank you very much both uh for being here today and sharing some very interesting insights. I'm very much looking forward to those new tools on the market and the the new features of those tools. So um, let's wrap up this session and um talk to you soon. Yes, you have taste in a way that's meaningful to somebody.

SPEAKER_00

I would recommend um usually it's mighty.