The Dashboard Effect

Data Warehouse vs. Data Lake - Which do you need?

October 28, 2022 Brick Thompson, Jon Thompson, Caleb Ochs Episode 50
The Dashboard Effect
Data Warehouse vs. Data Lake - Which do you need?
Show Notes Transcript

Click here to watch this episode on our YouTube channel.

In this episode, Brick and Caleb discuss the differences between a data warehouse and a data lake and what each is best suited for. For mid-market companies, in many cases both could have an important role in an overall BI strategy and implementation.

Blue Margin helps private equity owned and mid-market companies organize their data into dashboards to execute on strategy and create a culture of accountability. We call it The Dashboard Effect, the title of our book and podcast

 Visit Blue Margin's library of additional BI resources here.

For a free, downloadable copy of our book, The Dashboard Effect, click here, or buy a hardcopy or Kindle version on Amazon.

#BI #businessintelligence #datawarehouse #datalake

Brick Thompson:

Welcome to The Dashboard Effect Podcast. I'm Brick Thompson.

Caleb Ochs:

And I'm Caleb Ochs.

Brick Thompson:

How's it going, Caleb?

Caleb Ochs:

Pretty good.

Brick Thompson:

So, today we're going to be talking about a topic we covered a few weeks ago, maybe a month and a half ago. A little bit on the why you would use a data lake and a data warehouse. And I thought it might be useful for our listeners to talk just more generally about what the difference is between a data lake and a data warehouse? And where do you use each one? So that's our topic for today.

Caleb Ochs:

Yeah, I think this will be interesting. You know, back like five years ago, was when data lakes really started to come on the scene started to become pretty popular. And I had that question. I was like, "Why would I do this? Like, why use this data lake thing, instead of just sticking this data in a database? I don't understand." So, recently we think we found like, "This is it. This is this is the reason why you'd use a data lake." And now I think how we're viewing these two things is this really nice clear separation of function between what the data lakes purpose is and what the data warehouse purpose is. I think it's important for people to understand that if you conflate those twoyou could set your self up for some trouble.

Brick Thompson:

Right, each one is good at a different thing. And I would say most midsize or larger businesses should consider having both, these days. It sort of depends on what they're doing. Maybe we can talk about that first, and then go back to specifics.

Caleb Ochs:

Sure

Brick Thompson:

Okay, so what are the use cases? What's the data warehouse you use for? Just high level versus the data lake? And how would you benefit from running both?

Caleb Ochs:

Yeah, maybe what might be helpful is is just a theoretical example. So if I was just running a company, you can think of like our company share margin. We don't have tons of data, we have a few data sources. You know, it's important data, but they're known sources. We know what they are, we know what's in them, we know what they're used for, we know how to get the data out of them that we need to. It's all just known, and we're not going to go acquire five companies tomorrow. So we have no use for a data lake. We don't want we don't need to just dump our data somewhere, right? We know what reports we need to run, we know what data we need, and we'll just pull that into a database, into our data warehouse. So there's no reason that you need a data lake for that. Our business, from a data source perspective, is pretty stable.

Brick Thompson:

We know the data sources, there aren't that many, it's not huge data. So a traditional data warehouse serves us well.

Caleb Ochs:

Exactly right. Now, let's say that you are running an e-commerce website, and you have tons and tons of log data from your from your website, you don't really know what you're going to use that for yet, but you probably should keep it. Perfect use case for a data lake. Or maybe you are doing a bunch of acquisitions, and you don't have a plan for using that data right away. But you do need it as you pull companies onto your main ERP, or maybe you leave them there, but you don't really know what's in those data sources, and you don't have the time to quite figure it out yet. Pull all that stuff into a data lake, make it easy and in one spot, so that you can use that data down the road.

Brick Thompson:

Yeah, maybe it's got unstructured data, or semi-unstructured, semi-structured. Maybe you've got a lot of streaming data, you think it's going to be useful, but you're not sure what to do with it yet. That's a perfect place to put data. Data lakes are a perfect place to put that kind of data so that you can start doing experiments and analysis and figure out how you might use that data. And at some point it might make sense to put that into the data warehouse. We'll talk about why. Or it might make sense to leave it there, apply a schema on top of that data lake and just report on it from there.

Caleb Ochs:

Right. And as we're talking, it really does seem like it all comes down to if you know what you're going to do with the data. You don't need a data lake, more than likely, there's some nuances, which I think we'll get into, but if you know the purpose for it, and you know the data, you don't need a data lake.

Brick Thompson:

Yeah. Well, let's let's just talk about, we'll keep this sort of conceptual level, but a little bit of technical, what's the difference between a data warehouse and a data lake?

Caleb Ochs:

So your data warehouse is going to be built on like a relational database. You're gonna have facts, you're gonna have dimensions. It's a denormalized, which means combining tables, getting it into a format that a reporting tool like Power BI can easily consume and build some good reports off of.

Brick Thompson:

Exactly. And then a data lake is more like, I think you gave a great example inother podcasts we touched on, where you can almost think of it like Windows Explorer on your computer. You can have file folders and put anything you want to in them. You probably want to have more of a plan than that in your data lake, but you can literally put whatever kind of digital data you want into a data lake and then figure out how you're going to use it later.

Caleb Ochs:

Right. I'm glad you brought that example up again. It's making me remember when I was an analyst, I used to do that. I used to build a data lake on my own computer, and it was laid out a lot like how we build data lakes now where you've got like, "Here's your topic, area, inventory, or whatever." And then you go in, and there's a bunch of folders with dates on them. And then it's got the version of data for that date that I might have to go back and look at. That's exactly what a data lake is, like almost to a T. There's some nuances, you get more complicated than that. But most of them that you run into are going to be structured very similar.

Brick Thompson:

Caleb was a genius.

Caleb Ochs:

I guess I was ahead of my time.

Brick Thompson:

That's great. So data lake is different in that it's unstructured. It's not normalized or de-normalized, you can put schemas on top of it, it can take any kind of data. One of the things is that data storage is generally less expensive in a data lake. You've got blob storage. It may not matter unless you have massive amounts of data. I also think another differentiator is, the data warehouse tends to be used more by the general business user and business analyst, and a data lake more BI data scientists and specialized analysts who are trying to tease information out of this unstructured data, or maybe trying to figure out how to get that data into the data warehouse. So the more business oriented folks can use it.

Caleb Ochs:

Yeah, I think thats a good way to think of it. The data warehouses is trying to simplify the data to make it easy to consume, right, for the business to consume. And the data lake is just raw stuff.

Brick Thompson:

Yeah, you gotta be willing to adventure into the jungle a little bit.

Caleb Ochs:

Exactly.

Brick Thompson:

All right. So advantages of the data warehouse was sort of covered. It's really good for getting the answers for predictable known areas. I mean, it's been built to support certain types of reporting, generally, for business users. Maybe for all the business users in a company, that type of thing. You don't have to do data prep when you want to report on it. A lot of times, it'll have a data cube attached to it that already has measures and KPIs. So you've got good governance. Making sure people are all, for example, when they say revenue, it's the same thing. It uses the same measure every time. It's just good for governance in general, because you're doing a lot of processing on the way in and your ETL. So you're making sure the data is clean, which tends to make it more consistent, more trustworthy. You know, the big downfall of BI systems generally is just user adoption, and adoption takes a hit whenever there's a trustworthiness question. And so it really becomes sort of the single source of truth for a company. Those are the advantages I see.

Caleb Ochs:

Yeah, I always think of the data warehouse as, they can serve other purposes, but really its main job is to support the key pillar type reports for the business. Metrics that you just need to run your business, that foundational stuff. You know, there might be other tangential things, like you can do some data exploration and stuff, to some extent there too. But you know, that I would see that as the main main purpose.

Brick Thompson:

Okay. And so then the main advantages of the data lake, I think, is you're not quite so constrained. It's super flexible, you don't have to process data a bunch before putting it in there, you can just stuff it in there. It's highly accessible. So people who have access to it can just get in there and start doing their exploration. It's fast update, you don't have to run a big ETL. In fact, you can just have data streaming into it, you're not doing any processing on it at all. It's great for storing massive amounts of data. Although you can do that in a traditional data warehouse too, it's easy to add new data sources to it. You just add it, whereas with your traditional data warehouse, you're gonna have to really think about, "How do we process the data and transform it, so it conforms to the data model we've got, and supports that same pillar?" I like that term, the pillar reporting for the business. A data lake too, you can put a schema on top of it if you have good metadata on the data in the data lake itself. And there's various tools to help with that. And actually, it's kind of nice, because you can play with imposing different schemas on the data lake and seeing how those get you the results you want or not. And then maybe have that guide how you might put that data into the data warehouse or not. You know, certain certain use cases you don't want to at all. Anything else?

Caleb Ochs:

I think you covered a lot of it. I think most of what makes the data lake easy to write stuff to and that type of thing is that it is not meant for that business user. You're not trying to make it easy to consume. So you do what you can in terms of putting it into folders and stuff, like my earlier example, where you might have a date and you're in a topic area. And then you might name the files a certain way. So you know what it is, but you're not going to say, "I want to make sure I don't have any duplication from this file versus every other file in here," which is what you would do in a data warehouse, right? You don't want that duplication. So it makes it harder to deal with. But that's why you need to separate those two things out. If you're gonna send business users to your data lake, good luck.

Brick Thompson:

Right, they're gonna struggle with it, they're gonna have a much harder time getting on the same page with each other. Now, a data scientist, on the other hand, wants to be able to get at that raw data. They don't want it to be processed. If there's duplications, they want to maybe discover that themselves. Maybe there's some information hidden in that fact that there's duplication. You know, that type of thing.

Caleb Ochs:

Yeah, exactly.

Brick Thompson:

All right. Disadvantages, what are the big disadvantages of a data warehouse? I mean, some of them are sort of the opposites of what we just talked about.

Caleb Ochs:

Right, yeah, it can take a little while to get things in there to where you're happy with them. Because, again, your end user audiences, you're going to be building these really pillar key reports. And you're also building it for less technical people to be able to consume some of these tables. So it can take a little bit longer to get to where you've got data, and you're seeing numbers. Ultimately, though, I think that with any data work, you're building this data to be consumed. Data lake is where it gets a little bit gray, because you're kind of building the data just to have it. Anyway, the big one of the disadvantages of a data warehouse, if when you're comparing it to a lake, is that.

Brick Thompson:

Just the complexity of preparing data so that business users and business analysts can have easy access.

Caleb Ochs:

Right, exactly.

Brick Thompson:

Okay, data lake disadvantages. I mean, again, sort of the opposite of the things we've been talking about. It's not that tightly governed, may not be very well organized, could have bad data, garbagey data in it. So it's gonna take a lot more skillful analyst to wade through that, and make sure that they're getting good insights out of it. It might be hard to decipher the data, depending on the form it's in. If you're just dumping tables into it from another database, you're gonna have to figure out those relationships. And there's still a lot of work to do there. And definitely, as you just said, difficult to navigate for business users.

Caleb Ochs:

Right. You're just gonna be dumping data into it. One of one of the tough scenarios, and there's technologies out there that handle this, but let's say that you're only polling, you know, changed new records each day. It's actually going to be hard to get a comprehensive look at the data if you're just looking at the data lake. Because each file is going to be its own day, and there is going to be duplication based on changes. So then you need to pull the most recent record based on some sort of key, so it can get really complicated, depending on how you set your data lake up. But with that said, it's so much easier and cheaper to just store raw history of data and do some sort of auditing or something. Data lakes are going to be much better for that. So you can see what changed and when. Don't do that in your data warehouse, it's gonna cost you a ton of money. Unless you need to do that on a regular basis for some reason.

Brick Thompson:

Yeah, if it's a pillar report.

Caleb Ochs:

Right.

Brick Thompson:

Okay, so jumping into what they're used for. Again, data warehouse is sort of general, pillar reporting, business use known data sources, known types of reports, data that you want to be carefully curated, that type of thing. And then the data lake is a place where maybe you're storing data, you don't know when you're going to use it. Or maybe you've got a use case, like, "Hey, we want to reduce the size of our data warehouse, how are we going to archive it?" Put the archive in the data lake. Maybe you just need to do some ad hoc exploration of the new data source. Or, as we talked about in our podcast a few weeks ago, maybe you've acquired a company. They've got some transactional systems that you haven't dealt with yet. You don't have ETL for them, but you still want to get to that data so your analysts can start looking at it. It's not perfectly curated, it's not perfectly clean, but you can start learning stuff and actually figuring out, "Alright, do we maybe want to pull some of this data into the data warehouse? Or do we want to just live on this now until we integrate that acquired company into our base systems," those types of things.

Caleb Ochs:

Yeah, it'll help you make those decisions. If you if you can say, "This is going to take way more effort, and it's actually not worth doing that. So let's not do it." That's kind of a huge benefit, having that data available.

Brick Thompson:

Yeah. I also think that for data scientists out there, as I read articles about, I'm not a data scientist, but I read articles about what they're doing, they seem to like to have those masses of raw data to feed into their machine learning models. And so I think a data lake serves that well. Certainly, you can feed data out of the data warehouse into that as well. But they may miss insights, because it's been so curated, and so clean.

Caleb Ochs:

Sure, right. Exactly.

Brick Thompson:

All right. Any final comments on this topic?

Caleb Ochs:

I don't think so. I guess the only thing is that, to boil it down, you know, the data lake is raw data that you may not know what to do with. Data warehouses, this is purpose built data, "This is going to feed these things in our business."

Brick Thompson:

And it does seem like most midsize or larger businesses would have use cases for a data lake almost always. Like if you've got some old system that you're retiring, how are you going to archive that data so that is accessible. I can think of years ago, you know, 8 or 10 years ago... when we were doing a new data warehouse for a company and they had a bunch of old data from an old ERP system sitting on hard drives on shelves. And it was a pain to deal with that because you got to get the software running to be able to extract that data. And if you had it in a data lake already, it would be much easier.

Caleb Ochs:

Yeah, we still see that, where they'll have some old sunsetted system and it's a huge pain to get to the data anymore.

Brick Thompson:

Yeah. All right. Cool, good discussion. Thanks.