The Macro AI Podcast
Welcome to "The Macro AI Podcast" - we are your guides through the transformative world of artificial intelligence.
In each episode - we'll explore how AI is reshaping the business landscape, from startups to Fortune 500 companies. Whether you're a seasoned executive, an entrepreneur, or just curious about how AI can supercharge your business, you'll discover actionable insights, hear from industry pioneers, service providers, and learn practical strategies to stay ahead of the curve.
The Macro AI Podcast
Dirty Data, Big Losses: Unlocking AI Success with Clean Data
Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.
n this Macro AI Podcast episode, hosts Gary and Scott explore why clean data is key to AI success, offering practical tips for business leaders and tech details for enthusiasts. They cover real-world examples, the cleaning process, and trends like synthetic data, equipping listeners to implement effective strategies.
Why Clean Data Matters
Clean data underpins reliable AI. Gary and Scott share examples: a retailer mismanaging inventory due to inconsistent location data, or healthcare errors from duplicate records. Dirty data—missing or inconsistent—leads to poor AI predictions and financial losses. For tech listeners, they know how it disrupts machine learning, affecting loss functions and model convergence, making quality data essential.
The Data Cleaning Process and Industry
The hosts outline a five-phase cleaning process: analyzing data, defining rules, verifying, transforming, and integrating. Big data’s volume and variety complicate this, but tools like Cleanix (parallel processing) and HoloClean (probabilistic inference) help. The data cleaning industry—engineers, scientists, and firms—is critical, with 40-50% of data budgets spent here. Leaders are urged to prioritize governance for quality.
HoloClean: http://www.holoclean.io/
Synthetic Data and Conclusion
Synthetic data is highlighted as a fix when real data is limited, like simulating sensor data for self-driving cars. The episode wraps up stressing clean data’s role in AI success, offering steps to achieve it—invest in tools, talent, and explore synthetic options—making it a must-listen for leveraging AI.
Send a Text to the AI Guides on the show!
About your AI Guides
Gary Sloper
https://www.linkedin.com/in/gsloper/
Scott Bryan
https://www.linkedin.com/in/scottjbryan/
Macro AI Website:
https://www.macroaipodcast.com/
Macro AI LinkedIn Page:
https://www.linkedin.com/company/macro-ai-podcast/
Gary's Free AI Readiness Assessment:
https://macronetservices.com/events/the-comprehensive-guide-to-ai-readiness
Scott's Content & Blog
https://www.macronomics.ai/blog
00:00
Welcome to the Macro AI Podcast, where your expert guides Gary Sloper and Scott Bryan navigate the ever-evolving world of artificial intelligence. Step into the future with us as we uncover how AI is revolutionizing the global business landscape from nimble startups to Fortune 500 giants. Whether you're a seasoned executive, an ambitious entrepreneur,
00:27
or simply eager to harness AI's potential, we've got you covered. Expect actionable insights, conversations with industry trailblazers and service providers, and proven strategies to keep you ahead in a world being shaped rapidly by innovation. Gary and Scott are here to decode the complexities of AI and to bring forward ideas that can transform cutting-edge technology into real-world business success.
00:57
So join us, let's explore, learn and lead together. Welcome to the AI Podcast. I'm Gary Sloper, joined as always by my co-host, Scott Bryan. We're here to help business leaders like you navigate the rapidly evolving world of AI, giving you the tools and insights to lead this era and compete on the global stage. That's right, Gary. So whether you're an executive looking for practical strategies or a tech enthusiast craving some nitty gritty details, we've got what you're looking for.
01:26
Today, we're diving into a topic that doesn't always get the spotlight it deserves, clean data. It's the foundation of every successful AI project. And trust me, if your data is a mess, your AI project is going to go nowhere fast. Absolutely, Scott. We're going to dig into why clean data is critical. We'll sprinkle in some real world examples and even explore the booming industry of data cleaners and cutting edge ideas like synthetic data. By the end, you'll know why this matters and how to make it happen for your business. So let's get started.
01:57
Sounds good. So let's set the stage. uh Imagine you're a retailer, like say Walmart, uh using AI to predict inventory needs. And you've got data pouring in from suppliers, sales, and customer trends. But what if half your supplier records list NY and the other half say New York, or worse, some entries are even blank. Your AI is going to choke on that mess and you'll end up with some erroneous results like
02:27
overstocking flip-flops in December in upstate New York. Exactly. So dirty data, it's inconsistent, it's incomplete. Sometimes it's just plain wrong. You can tank your AI's performance. So we talked about this in prior episode, garbage in, garbage out. There was a, I don't know if you saw this, but there was a 2021 survey from Forbes. So a couple of years back, it said data scientists spend
02:55
I it was up to 60 % of their time cleaning and organizing data. And it's not because they love it, it's because they have to. So the bad data leads to bad predictions. And for business, that means lost revenue. I also read a study from PWC in which they found back in 2001, so almost feels like eons ago, that 75 % of companies surveyed suffered losses due to data quality issues. That's still true today.
03:23
If not more so with the explosion of big data. Yeah. It's not just retail. It's, you know, across industries. So in healthcare, dirty data can be a life or death issue and expensive. I reviewed a case a little while back where a hospital's AI system flagged the wrong patients for urgent care because of duplicate records and typos and medical histories. So clean data isn't just a nice to have. It's, it's often mission critical. Yeah.
03:53
Spot on. So, why is this such a big deal for AI? At its core, AI learns patterns from your data. So if you feed it garbage, like missing values or inconsistent formats, it's like teaching your kid math without a textbook full of typos. So the output's unreliable and your business decisions suffer in the end. For our technical listeners, think about training a machine learning model.
04:19
like a logistics regression or a neural network. If your features are noisy or misaligned, your loss functions going to skyrocket and convergence, forget about it. Additionally, you must be able to answer where your data came from and what sets were used. Plus really need to answer if the data was secure. Obvious reasons include outside vulnerabilities. We know that. However, a lot of organizations overlook was it secured internally for modifications, especially if others had access to it.
04:49
or modified it for their own AI project or something else. So if that data has been manipulated, how do you know that it's accurate? Yeah, no, I agree. So imagine you're working with a data set for a customer churn model and you've got a column or subscription status, but 20 % of the entries are null. Another say 10 % active in five different ways. So active or A or act, you name it.
05:19
Your model accuracy is going to tank because it can't generalize. So you need to pre-process that data, impute the missing values, standardized terms before it's usable. So that's kind of data cleaning in a nutshell.
05:34
Now let's talk about how this cleaning happens. It's not magic. It's a process, a process that can be broken down really into five phases. Analyze the data to spot errors, define rules to fix them, verify with experts, transform the data, and then integrate it back into your system. So for big data though, this gets tricky. The volume, variety, velocity, veracity, value, the famous five B's make traditional methods really too slow or clunky.
06:04
Right. Right. That's where some of the modern tools come in. So take something like Kleenex and I'm not talking about the hand sanitizer or the cleaning spray. uh It's a parallel processing system designed for big data. So it tackles issues like duplicates and missing values across massive datasets, all while running on multiple machines. uh Or there's HoloClean, which uses probabilistic inference to guess what's dirty data.
06:32
and what dirty data should look like based on patterns. Right, and don't forget about the human element. Tools like Katara blend knowledge bases with crowdsourcing to validate fixes. ah So think of it this way, an e-commerce company like Amazon, they're cleaning their product listings often. The AI flags a blue widget that might be listed as a red widget elsewhere. So Katara checks a knowledge base and then asks a human to confirm.
07:00
It's slow for huge data sets, but it definitely boosts accuracy where it counts in the end. Yep. That's a good example. So here's the curve ball. What if you don't have enough data to clean? So that's where we look at synthetic data. So companies like Nvidia are generating artificial data sets to train AI where real world's data might be sparse or messy. So think of self-driving cars. So simulating
07:28
sensor data to test algorithms without crashing real cars. So that's like driving a car in a video game with random obstacles. It makes me think of Spy Hunter. ah But it's not a game. But that's not a silver bullet. It's a game changer when it cleans data as hard to come by or privacy of data is a major concern. I now have the Spy Hunter theme song in the back of my head. I'll probably have that all day. Thanks. Thanks for that.
07:57
So who's making all this happen? There's a whole industry of data cleaners out there, data engineers, scientists. There's even specialized firms. We have some great ones here in the Boston area for sure. You've come across a few of those. ah So if I remember reading about this on some of the estimates, businesses spend around 40 to 50 % of their data budgets on cleaning. ah It's really tedious, but it's definitely gold.
08:26
in terms of the return. There's platforms like Trifacta and Talend, they automate this, but really more for complex cases. need the professionals who know the domain very well, ah especially like a healthcare expert spotting errors in patient records that an algorithm might miss, similar to your use case you just mentioned. Yeah. And for the executives listening, here's the leadership angle. So prioritize data
08:55
quality from the top. Set up governance policies, define who owns the data, how it's cleaned and how often, and invest in the tools and the talent. Obviously a key area for investment to make sure that you're building upon clean data for your AI moving forward. And doing some research for this episode, I came across a 2021 survey, again, a couple of years old, in which the team noted that dirty data wastes 27 % of revenue. So it's just a lot of action taken on
09:25
dirty data, is a big waste. And that's, that's, yep. So that's an upstream stat that'll wake up any C-suite leader. So clean data, it's not just a tech problem. It's a business imperative. Yeah, totally agree. So if we wrap up with a real world example, let me think.
09:46
Uh, how about Netflix? So their recommendation and engine is, it's pretty legendary, right? It's built on clean structured data. I believe, uh, at least from what I've heard, um, it has user ratings, watch history, even metadata about shows. If that data was a mess, um, you'd get Teletubbies, uh, recommended after breaking bad, right? Uh, their investment. Yeah, I would be, you'd be like, what?
10:16
I don't think this is relevant to me. Their investment in data quality drives billions in viewer engagement. Seems to be working well in my house with all the subscriptions. Yeah, you know, I just heard Brian Cranston from Breaking Bad coming out with a new movie. He's going to be in Lone Wolf. I think that's coming out this summer. They just finished recording it. Lone Wolf. Yep. I'll write that down. Yeah. So back to it. So the takeaway. um So clean data is going to power AI success.
10:45
So for the techies, just dive into the tools like Alpha Clean, CP Clean, and there's a bunch more out there. Just keep searching and check out how they optimize pipelines or handle uh missingness, which is really just simply the absence of data. So for leaders, build a culture of data quality. Some people love it, some don't, but it's the backbone of competing globally and competing efficiently. Yeah. I couldn't have said it better. mean, you're, you're exactly...
11:14
I think you have to try some of these tools. It's controlling your learning curve 101. We've all been doing it since we got in the industry. This is the next wave of technology. So I think it's imperative. So Scott, will leave you with a question which could tease for a future episode. I'm ready. Since we're out of time today. What do you think? Is that okay? Go for it. All right. Here it is. So do organizations...
11:42
ah have the responsibility to their customers to be forthcoming on the transparency of their source data they're using in these models. Yep, that's one to two on. Yeah, I agree. All right, so that's all we have for today on the Macro AI podcast. If you enjoyed this, subscribe, share it with your team, give us a like and subscribe, and join us next time as we explore more cutting edge AI solutions. Until then, keep leading, keep learning, and keep your data clean.
12:13
See you