The Dashboard Effect

A Case Study: Unifying Data Sources with Azure Data Lake and Serverless SQL

Brick Thompson, Jon Thompson, Caleb Ochs Episode 117

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 10:18

Brick and Caleb discuss Blue Margin's use of Azure Data Lake for unified data management. Using a recent client project as an example, Caleb explains how he integrated multiple data sources using serverless SQL to create a single data repository with a semantic model to enhance consistency across projects and reports. They also discuss the current performance benefits of using a serverless SQL and the Microsoft Fabric OneLake updates they're waiting on before transitioning from Azure Data Lake.

In this episode:

  • Introduction to Azure Data Lake: Caleb kicks off the discussion with an overview of Azure Data Lake, emphasizing its role in creating a unified data repository. The conversation highlights the significance of maintaining "one copy of your data" to eliminate redundancy and ensure consistency across various data analysis projects.
  • Client Case Study: Caleb walks through a recent project of creating an Azure Data Lake for a client. Learn about the challenges of integrating multiple data sources and how serverless SQL views simplify data modeling and reporting.
  • Serverless SQL Explained: Caleb outlines the benefits of serverless SQL, including its flexibility, cost-effectiveness, and how it caters to users familiar with SQL Server without the need for a traditional database setup.
  • Semantic Modeling for Unified Reporting: Understand the process of creating a semantic layer that merges data from disparate systems, enabling consolidated reporting across different business units without compromising data integrity.
  • Performance Insights: Caleb shares his experience with the performance of serverless SQL on Azure Data Lake, busting myths about query execution times and handling large datasets efficiently.
  • Looking Ahead: Brick and Caleb discuss the features they're waiting on within Microsoft's Fabric OneLake before transitioning from Azure Data Lake. 

Click here to watch this episode on our YouTube channel.

Blue Margin increases enterprise value for PE-backed, mid-market companies by serving as their fractional data team. We advise on, build, and manage data platforms. Our strategy, proven with over 300 companies to-date, expands multiples through data transformation, as presented in our book, The Dashboard Effect.

Subscribe here to get more episodes of The Dashboard Effect podcast on your favorite podcast app.

Visit Blue Margin's library of additional BI resources.

Brick Thompson:

Welcome to The Dashboard Effect podcast. I'm Brick Thompson.

Caleb Ochs:

I'm Caleb Ochs.

Brick Thompson:

Hey, Caleb. Today we thought we might talk about how you're using data in OneLake. Is it OneLake?

Caleb Ochs:

I'm using an Azure Data Lake. But I think the concepts that we're going to talk about are still very, they're kind of like the same. Right?

Brick Thompson:

Okay.

Caleb Ochs:

And I think that's kind of the key word. What we want to talk about today is how Microsoft is thinking about OneLake and how right now, at least how we're building out some models and stuff, by keeping some of those concepts. The biggest one being "one copy of your data".

Brick Thompson:

So for different projects, you're not pulling data all over the place, like having a set of data inside an import model within a Power BI, for example, or inside of Excel or something like that, right?

Caleb Ochs:

Right, or just having a copy of your data in one layer, and then another one, and then another one until you finally get to your end result. And people are using stuff in between. You don't want to do any of that stuff. Because then you've got one person using one version of data, and another person whose got another because the job failed or whatever.

Brick Thompson:

It didn't sync... or

Caleb Ochs:

Yeah, exactly.

Brick Thompson:

Okay. All right. So how are you approaching that now? I know you just did a project for a client, you and the team got them going on all the time that are like this, but this one came to mind for you this afternoon.

Caleb Ochs:

Yeah, so I actually did a lot of the work on this one. So the client has a bunch of data sources, but we're starting with one and we created a model for it. And the way that we set things up is we've got an Azure Data Lake and we piped the data in into parquet and delta files. And then we build our model with serverless views on top of that, so it's just querying those delta files, and then shaping it with SQL into a way that's going to work really well for our Power BI model, so we can ultimately build the reports that we want. So we decided to pull in another data source. And a little bit of background on how this used to work in the old days is you would have a database, you would have your - and technically you could have done this in this database, but I don't think anybody thought of it because database tools had a lot of things built in like constraints. And when I say constraints, like foreign keys, and primary keys, and auto incrementing IDs, and things like that. And the tools now are just better for doing this type of thing. Delta files and parquets and stuff are just better for handling large amounts of data and doing what I'm about to explain.

Brick Thompson:

Right.

Caleb Ochs:

So I'll get into it. You've got so anyway, we brought in the other data source, all into its own area in the data lake and its own parquet and delta files. And then inside of our serverless SQL, what we ended up doing is unioning and creating views on top of those files that query the files individually, and created our common dimensions like"customer" and "item","salesperson", "location", things like that into so when you look at like our serverless SQL database, it's kind of hard to call it a database, because it really just an engine,

Brick Thompson:

You should probably explain what serverless SQL means.

Caleb Ochs:

Yeah, so serverless SQL is it's really just compute on top of your on top of your files. So you've got your files sitting in data lake. That's different than like a standard database. In a standard database, you actually load the data into a database and that database maintains..

Brick Thompson:

Like a SQL server database.

Caleb Ochs:

Yeah, it'll maintain statistics and stuff on it. Parquet and delta files don't have... it's not a database, right? So you can, it's nice, because you can query that with a bunch of different tools that kind of opens the door to more options for you. So serverless sits on top of that, and allows you to write code against those files and manipulate that.

Brick Thompson:

In T-SQL.

Caleb Ochs:

Yeah. T-SQL.

Brick Thompson:

So people who are familiar with using a SQL Server can still write views and so on the way they're used to, but they're not having to write it against a SQL Server database. They're writing it against the parquet files in the data lake.

Caleb Ochs:

Right, exactly. So the nice thing about that is that (as we were saying earlier), it's one copy of data, right? You're just querying those files and shaping it the way that you want. Now, you that might give you some concern about performance. And believe me, I've also had those same concerns serverless SQL has completely blown me away how well it performs. I'm doing pretty basic stuff, like some joins and some CTEs. And, you know, not I wouldn't say, super-advanced, I wouldn't say super-beginner kind of intermediate type stuff when this sequel, but it performs really, really well. You know, we're talking like 3 million rows in our fact table, and it'll refreshes in like, three minutes. So it's really impressive. So anyway, we put thatlayer of SQL on top of these files in our serverless SQL. And then when you're looking at the objects inside of your serverless SQL database, you see, you know, just your standard, like, here's your"customer" dimension, here's our"item" dimension table.

Brick Thompson:

It just looks like what you're used to.

Caleb Ochs:

Yeah, looks like what you're used to. So the great thing about that, like I said, you have one copy of data in your data lake, and then now you've got this, you call it the"semantic model" on top of that, that's pulling together two totally different systems, and making them look like one, so you can have consolidated reporting across both of those business units.

Brick Thompson:

So it's what we used to do in the old days, where we would load the data from both of those systems into a SQL Server, and then do all kinds of work to get it so that we've got conformed dimensions and those types of things, and then do "reporting views" to make your semantic model there. But you're not having to be constrained by all the things that you're constrained by by being in a SQL Server. You can just load it into the data lake, you can write that semantic model using your serverless SQL, just against the data that you want to. You might have other data sources in there, you might have all kinds of stuff in there. But you produce this semantic model layer, this view, for the user that makes it really easy for someone who is used to that star schema type approach to reporting to deal with their stuff.

Caleb Ochs:

Right, right. Exactly. I think what I like about it most is that it's kind of like what we used to do in inside of a data warehouse, where you would do that, a bunch of that transformation of some SQL code, but then you'd load it actually to another physical table.

Brick Thompson:

Right.

Caleb Ochs:

And with the way that serverless works, there might still be like, really heavy data cleansing work that you need to do where you still would want to do that just for performance's sake. But in most cases, for us, just doing that, you know, having those queries directly on top of the data, simplifies things a lot. For example, if something was wrong, let's say in that separate physical table (usually a fact table), what one of the troubleshooting steps would be, okay, go look at the SQL that puts that table together, and kind of try and piece it together. Like, "where did it go wrong?" Now, the way that you would do that is it's just one step, like, you're not having to say, Okay, what happened here?

Brick Thompson:

Let me go back and look at staging,

Caleb Ochs:

what happened, it's just, you're just query that. As long as that query is good, you're still good. You don't have to worry about missing an update, you don't have to worry about, you know, something going wrong in a procedure or failing halfway through or, whatever, you know, you just run your query against your data,

Brick Thompson:

Because it's kind of raw data sitting in there in the parquet files. Yeah, that's cool. All right. So you're doing that in Azure Data Lake right now. Fabric OneLake is out, and we've used it some. Some people are using a lot, but it's not quite at parity with Azure Data Lake, but it's close. Right?

Caleb Ochs:

Yeah. I mean, it's pretty close. I think some of the code actually, it's a little bit simpler inside of Fabric, which is nice. The reason why in this particular instance, why we're not in Fabric, and why we won't be for, I don't know how long for them. And these scenarios is that we're pulling data from an on-prem SQL Server. And you have to have an integration runtime, or a data gateway to pull data into the cloud, into OneLake or into Azure Data Lake.

Brick Thompson:

And that's a little agent sitting on that SQL Server.

Caleb Ochs:

And right now, those are not supported inside of Fabric pipelines. So we're doing it in Azure.

Brick Thompson:

That's got to be around the corner. I mean, it has to be.

Caleb Ochs:

Yeah, I think so. Yeah, I was actually surprised that it didn't come with it right out of the box, but there are other ways you can do it. There's just not well-suited for moving a bunch of data.

Brick Thompson:

Got it. Okay, interesting. All right. Well, I liked the case study there. And we're going to just keep doing more and more of this getting better and better at it. I have a feeling you know, the whole industry has been moving this way for the last three years or so. But it's really becoming mainstream, and the I think the Azure Data Lake and the Fabric OneLake, the performance is what really makes it easy to do to. Like, you don't have to go back to the SQL Server to get the performance you want generally. It sounds like there might be some cases still where you would do that.

Caleb Ochs:

Yeah, your SQL Server is gonna have the same. You know, it's gonna choke on that, too. It's just gets into some heavier data engineering questions, but yeah.

Brick Thompson:

You're right. All right. Okay, well, I think that's good for today. Oh, Nicole's gonna kill us.

Caleb Ochs:

Yeah, you say it.

Brick Thompson:

Please hit that subscribe button. Yep, like and subscribe if you liked the content. Alright, thanks. Talk to you soon.

Caleb Ochs:

See ya.

Brick Thompson:

Bye.