The Dashboard Effect

ETL & Data Warehouse Architecture with David Dimmen

December 07, 2022 Brick Thompson, Jon Thompson, Caleb Ochs Episode 56
The Dashboard Effect
ETL & Data Warehouse Architecture with David Dimmen
Show Notes Transcript

In this episode, Brick and David Dimmen discuss a high-level overview of Blue Margin's approach to ETL and data warehouse architecture. Warning: This one is a bit on the technical side. Consider viewing the YouTube version of this episode here so you can see the diagrams David refers to during the discussion.

Blue Margin helps private equity owned and mid-market companies organize their data into dashboards to execute on strategy and create a culture of accountability. We call it The Dashboard Effect, the title of our book and podcast

 Visit Blue Margin's library of additional BI resources here.

For a free, downloadable copy of our book, The Dashboard Effect, click here, or buy a hardcopy or Kindle version on Amazon.

#BI #businessintelligence #ETL #DataWarehouse

Welcome to the Dashboard Effect Podcast. Today we're doing an experiment. It's our first video podcast. I've got David Dimmen here with me. He's one of our lead data architects here at Blue Margin. And we're gonna be talking about some technical stuff, talking about how we do ETLs and how those move into a data warehouse and that type of thing. Yeah, hopefully we don't get too into the weeds, just enough to give a good taste of how things work. Good, all right. Based on the preview, I think we're going to be getting a little weedy. Yeah, we'll see. All right. So where do you want to start on this? There's a couple of things that we can talk about, I think the first thing to start on is the types of things that we're utilizing as we move through the ETL process. There is obviously the data storage component. So we have data stored somewhere, and we need to get it into a data warehouse so we can transform it. So data storage is one element. The other element is how we're moving the data, so data movement, and the orchestration of that, how that works. And the final element is just the visualization of it. So those three pieces are the beginning step. And then we can talk about what happens once it's stored in the data warehouse. Perfect. Okay, you know, one term that we use a lot is orchestration. For a non technical person, what does that mean? It's basically how the execution happens of the movement. So when a trigger will happen, in order to get data starting to flow in one direction, the orchestrator is the one that says, I'm going to take the data from this place, and I'm going to move it over to this place, essentially. Okay. And so it's basically a set of tools, software, things that are doing all that. Exactly. Great. Okay, well, where do you want to start? So I've got this first slide up here, and I would say it's a little bit bland, honestly. But it's a good way to think of this. Up at the top of this visual, we have this data movement and orchestration, kind of bar area. And then in the middle, we have data storage, and the bottom we have data visualization. And on the far left of that data storage is where we see the client information. And so the client would typically have at least one source where the data is living, that may just be Excel for some clients. It might be NetSuite for others, it might be multiple sources, it might be workday, it might be NetSuite plus ADP... there's so many different sources. And so here, we've got four different sources depicted. And the client needs to get those into the data warehouse. So how are we going to do that? Well, what we're going to set up is a data movement or orchestration tool. In this case, we have shown here, Data Factory, or Synapse Workspace, those are in Azure. We can also use Microsoft SQL Server, integration services SSIS. And that's a great tool. It's been around forever, and it works really, really well. But there are so many benefits to moving into the cloud now that we're using Synapse and Data Factory, almost exclusively for all of our ETL. And I should make the point that, as a company, Blue Margin has decided to focus on the Microsoft stack, and so we do almost all of our work in the Azure Cloud, and now in Synapse for for BI work. And it's such a powerhouse, like there's so much that's Synapse can do, and so many different functionalities that it employs, there's so many different connectors that you're able to use with it seamlessly. It's a great tool to utilize for this type of activity. So what we'll do is we'll set that up, we'll link the client source with our data warehouse source, which you can see in this visual if you're watching this podcast, but the idea being that we're taking something out of the clients environment, and moving it into the data warehouse. So that orchestration tool is what we're using just to move data from one place to another, in this case. got it and you're gonna go into more detail about how that movement works. So this this picture is very high level. Very high level. So this is the 10,000 foot overview so that we can understand the in a very quick glance, the source to end product, this is how we're getting from where your data is into the report at the very end. And there's a whole lot that happens in between. But what this does is it takes it and moves everything into our Azure Data Warehouse. And once it's in the data warehouse, we use that same orchestrator, the one that took it from from A to B, and we use it to just focus on executing tasks inside the data warehouse. So we'll immediately then schedule some views to be pushed through with some procedures and do different technical elements to shape and mold the data from its raw form into a nice data model that can be reported on really, really well and can have good functionality with time intelligence and all the bells and whistles that we want to employ in a Power BI report. We do all that functionality in the data warehouse, and we execute the tasks with the data movement orchestrator or synapse in this case. Okay, so just to make sure everybody's caught up. So you start at the clients data source. You need some method then to move that into the data warehouse. There's various things that happen there that we're going to talk about. Once the data is in the data warehouse, you still do what we call transforms. So you change the data. You also will create views on top of the database tables that make the data easier to understand and interact with for analysis, and for report writing and that type of thing. That's where we've gotten to, right? Exactly, we make the data palatable for the business user. That's the goal in the end. So that's really what we're looking at here. And so as we see in the visual, if you're able to see the visual, we have these pipeline activities that are touching the data warehouse, where they're going in and they're doing the transforms. And just to touch on that, again, this is the ETL process, or sometimes known as the ELT process, where we're extracting from the client source, loading into the data warehouse and transforming in the Data Warehouse. ELT, Extract, Load and Transform, to bring that data into a nice ingestible format, that we then connect to Power BI. And we connect, or we build, different DAX measures and expressions on top of that, to really show the data as it needs to be seen. So that's what we're seeing here with this visual. To kind of zero in on, if anybody is able to see this, this step here, the pipeline activities that are affecting the data warehouse, and how we do those transforms. That's what this next slide will show. So if I jump over here, this slide looks crazy off hand. So much for not technical. So much for not technical, that's exactly right. So I still want to keep this as high level as we possibly can. And we can kind of focus in on a couple of little items at first. So on the left here, we show a source system. And we're just saying this is any source system, it could be SQL Server, Oracle, it could be HubSpot, it could be QuickBooks. Whatever your ERP is. Exactly, whatever the ERP is. And so we're going to take data from there and move it into what we call our staging layer. And in our staging layer there's two different types of tables we use. We have our persisted staging table, which is an exact source replica of what's in the source system, we're not changing any data types, we're not changing anything about the data at all. It's going to exist as it does in the source system the same way in the persistent staging layer. And that gets updated every time you run that ETL. Exactly. And so the advantage of having a persistent staging layer is that if you have analysts that want to go back and look at specific columns and specific tables from source systems, they can get to those. Whereas by the time it gets into reporting views, it may have been modified and transformed quite a bit. And so if they're trying to figure out,"What is going on with this one part of this system?" which we definitely get sometimes, they can go right to the persistent staging layer and get that. Exactly, it's a great layer for a business analyst that the client has to be able to just jump in and say, "I would like to connect to the source system. But that's a bit more difficult to connect to the back end of the source system. So if I have the back end tables of the source system right here in front of me. Well, I can just do my basic SQL queries on that, and I just want to select everything from this table where this column is equal to this value. Oh, now I'm getting some different insight because I can look at the data as it existed in my source." Right, so they don't have to worry about, as you said, it's difficult to connect to source systems a lot of times, but also it's a bad practice, because a source system is your transactional system. Yeah, presumably, the business is doing things on it. And if you go in there and run a query that somehow locks up the system that is bad. Whereas if you're working in the persistent staging layer, I guess you could affect some of the ETL processes, but you're not going to affect the business's ability to conduct business. And it's highly unlikely that you would affect any of the ETL processes when looking at the persistent staging layer. I mean, most clients are typically updating this ETL once or twice a day, sometimes you can get up to the five, six times a day. But even then, those are spaced out in pretty large increments. So you're not going to be affecting that. Whereas like you said, if you connect to the to your source system, and you run a query that locks up a table, it's like, "Oh, you may have just crashed your point of sale." That's not a good practice. I've seen that. Yeah, exactly. So it's not preferred to connect to the source system. But connecting to persisted staging is a great way to get insight and to be able to really dissect source system data without having any impact or negative ramifications on the source system. Okay, so let me just ask a clarifying question. So on your way to persistent staging you go through staging, which only survives for the purpose of that one ETL run? That's right. And you're using that to pull data in updates from various tables that you've been doing ETL on, get that into persisted staging, and then you dump those staging tables. Exactly. That's why we refer to persistent staging, and we don't say non-persistent staging, just staging. It's great to think of it as persistent staging and maybe transient staging. Where we're taking, we're dumping the table. And then we're gonna go back to the source and say, "Well, in our persistent staging layer, we've got data up to yesterday. So we don't need the 100 million rows from the day before yesterday to the beginning of your ERP system, we just need yesterday's data. So we go into the source and we query it, say,"Give me all of the data from this table from yesterday, up till now." Okay, well, that's 1000 rows, rather than 1 million or 100,001,000 rows, right? So we can just bring in the change data. And then we merge that into our persistent staging layer so that we have a complete source replica again. And then we'll check some where we say,"Hey, let's validate and make sure that the source system has this count of rows, and we have the same count of rows. And sometimes we'll count some of the sums. Okay, add up all the amounts in the source table and see if that matches our persistent staging table. Okay, they all match, we're valid." So you can confidently know that if you're checking, doing any queries on the persistent staging layer, you're querying exactly what's in the source system. Yeah, actually, there is one difference, I think, well, you'll correct me if I get this wrong, but I think in persistent staging, the way we deal with deleted records coming out of the source system is that we just mark it deleted. We have a flag, "is deleted," or something on that record in persisted staging. So if you're querying persisted staging, you need to make sure to check and filter out any that are "is deleted," unless you want to see records that have been deleted. Exactly. That way, you really have a full record of what's been going on in the system. Yeah, sometimes it's nicer, because in a source system, maybe they will be able to delete a record. And we will sometimes delete records, it depends on the functionality of the table. Is that a dimension? Or is it a fact? And you know, where these records really in error, or were they just needing to be audited or something? So we can check that a bunch of ways, but we do typically soft delete, where we say, "Okay, does this record no longer exist in the source system? If it doesn't exist in the source system, we're going to retain it for audit purposes in the future, we're just going to mark that it's deleted so that we can see and we can filter that out of any queries moving forward. That's right. I'm taking this down. To weedy of a rabit hole there, sorry. All right. So I think we've covered persisted staging really well there. It's a really powerful part of our architecture, I think that is missing from a lot of data warehouses. So anyway, thank you for that. Why don't we go on from there, what happens after persisted staging? So after persistent staging, we're going to be moving towards the data warehouse schema. And this is where we've done our transformations. And so we take, again, if you can see this visual, we take and we have what are SQL objects that are views. So we'll build a sequel view that goes back and it looks at multiple tables, or maybe just one table, it gets all the data that's necessary for the data warehouse, we're not going to carry everything from the persistence staging layer through to the Data Warehouse layer. A lot of times it's too much, we just don't need all that. So we're going to get the the columns that we need, we're going to do the transformations that we need with with different functions and all that, join tables back together that need to be joined. And then push that into a data warehouse table, where this is the clean version of all the data that you need, all of your pertinent invoice line information, or your pertinent invoice header information or your pertinent GL detail information, where we've gotten rid of all of the excess, trimmed it down to what's meaningful, and cleaned it up. And now it's ready to be reported on. Yeah, so that allows you to build a nice, easy to understand, dimensional model in the data warehouse, where you just get rid of some of the technical stuff that the transactional system needs to run, you pull out the stuff that you care about for the fact areas that you want to be able to do reporting on Exactly, and most of these source systems are built in an OLTP fashion, which is a transactional fashion. They're meant to be transacted on frequently. So they do what's called normalizing their data. So they expand all these tables into as many different tables as they can and join them up in strange ways, which is good for transactions. It's a great process for transactions. For reporting, it's suboptimal. So we like to be in what's called an OLAP format, which is for analysis. And so we're taking all of these normalized tables, and we're combining them back together as it makes sense to make a denormalized model, which is what Power BI likes to see. It likes to see a nice star schema denormalized so that we can do quick, really fast reporting, and get the right data that we need at the right time. Gosh, at the risk of going to technical again, let's just talk about the two terms. So OLTP, kind of an old fashioned term, but Online Transaction Processing. OLAP, usually pronounces OLAP, Online Analysis Processing, I believe, right? I believe that's right. You'll hear OLAP cubes, and that's usually referring to denormalized, star schema dimensional models that you can report on. Okay, good. And like we said, I'm trying to keep it in the weeds, but also with really common terminology so that we can understand the process, because the process, though it can be complex and a little bit difficult, is very intuitive, once you start to understand it. We're taking the data that was blown out into all of these different areas. So let's just talk for a second. So that's called normalizing. Yes. And normalizing allows you to use the least amount of storage space and is highly organized. So you might have customers that have an address, or maybe you have a customer table and use want to know what state they're in. This is a simple example. But you want to know what state and what zip code. You probably need city too for this example to work, but let's say you have a customer. So if you normalize, you'd have a customer that would maybe just have a zip code associated with that customer record. And then you'd have another table that tells what city and state that zip code goes to. When you denormalize, you go ahead and just put the city and state in the customer table. So you don't have to do that join out to that extra table. Yes. So anyway, again, getting back to this visual, we take these tables, this is where we denormalize. So we're taking multiple tables here, putting it into a single ETL view. And then with stored procedures, and this goes back to what we talked about at the beginning, we have that orchestrator that's going to say, "We're going to take and we're going to do something, execute an activity in SQL Server or in the in the synapse workspace." We're going to tell that, with the orchestrator,"Execute this procedure." And it takes and moves the data from this persistent staging layer through the views, through the procedures, and into the Data Warehouse table. So now we have our final transformed data warehouse dimensions, our data warehouse facts and that could be the end. We could say, "Let's just connect Power BI right to those data warehouse dimension tables and call it good." The thing that we've noticed, and we've really benefited from, is that having a final view that is usually just a select all from that one table, allows us to do some final little logic or little tweaks or little joins, that help us to supplement whatever's going out to Power BI. So if there's a small calculation that after we got it into the model, we said, "Ah, it would be really nice if we just had this one extra flag." Well, we can go right to the report view, and we can add that flag and then we can have it right into the report within, sometimes, 15 minutes. Because you're already reporting on a view, so you don't have to say, "Oh, shoot, this should have been a view. Let me create that. Okay, now point your report at that view instead of the table." You just start from there. Yeah. I like that. That's great. So that gets a little more into the weeds of how we're handling all that. We don't need to go in depth on those things. But the idea that we have a nice view layer that just looks at the report model, and it pulls that into Power BI and analytic services. And, whether it's a Power BI data set or Analysis Services explicitly or something else, it's pulling that in right away at that layer. And it's a great system, especially with the persistent staging layer and bringing in just new records. It makes it incredibly fast. It makes the reporting model very fast. And as we push everything out to Power BI, it makes that a quick model as well. That's great. Okay, I think you've simplified that well. Are we going to go to the next step, which is going into reporting cubes or is that for another day? I think that's another day. All right. Anything else you wanted to add here before we wrap up? No, I think it can seem like a very complicated process. But I think especially as you can visualize it, and you can think linear through this whole process, that it's just taking steps to get your data from where it was in its raw form, to where it needs to be in its clean, reportable form. That process, though it seems difficult, it can be really easy to understand if you can just see that process as it lays out. And it can be difficult to get it done, but conceptually, it's quite simple. Sometimes the technical details of making it happen can have some twists and turns. Yeah, exactly. There's very complicated logic in those ETL views and procedures. That being said, the concept of how it's moving through, once you understand that, even as a business user, "Oh, I can understand why this number looks different than it did in the source system initially. Because we've transformed it. We've made it better. We've made it look better and more accurate and easier to palette." Yeah. Great. All right. Thank you. Awesome. I look forward to doing this again soon. Absolutely.