
Auditing with data: for Performance Auditors and Internal Auditors that use (or want to use) data
The podcast for performance auditors and internal auditors that use (or want to use) data. Produced by Risk Insights.
Auditing with data: for Performance Auditors and Internal Auditors that use (or want to use) data
55. Data Science and KNIME with Maarit Widmann
Maarit Widmann is a data scientist at KNIME (www.knime.com)
In this episode, we discuss:
- What KNIME is
- Why the visual programming paradigm (workflows, rather than code) is appealing to both technical and non-technical professionals
- The most challenging aspects of the data science lifecycle, and what auditors should be aware of
- The most important aspects of the data science lifecycle
About this podcast
The podcast for performance auditors and internal auditors that use (or want to use) data.
Hosted by Conor McGarrity and Yusuf Moolla.
Produced by Risk Insights (riskinsights.com.au).
You're listening to The Assurance Show. The podcast for performance auditors and internal auditors that focuses on data and risk. Your hosts are Conor McGarrity and Yusuf Moolla.
Yusuf:Today we're really excited to have Maarit Widmann on the show. Maarit is a data scientist at KNIME. I'll let Maarit explain what KNIME exactly is and what they do; we've spoken about it on the podcast before. But Maarit is a data scientist there and looking forward to an interesting chat. Maarit, welcome to the show.
Maarit:Thank you very much.
Yusuf:Okay. So we'll kick straight into it. The first thing we wanted to understand is, what your professional background is. If you're able to give us a little story about where you've come from academically and what you do now.
Maarit:My professional background is my history at KNIME. So I come directly from university to KNIME as a very junior data scientist, and here I have been growing together with the company and as a data scientist. So my background in studies is in data science, and then I also have a bit of sociology and social sciences there, and that fits that teacher role and the education role that I have at KNIME now. Our software that the company develops is based on this visual programming paradigm. And the idea is that you don't need to code. You can make complex and easy and any kind of data science, operations without coding. You just build your workflows in a very intuitive way, as blocks of operations that you can see in the user interface. And this is the tool and our audiences are then data scientists, they are business analysts, they are several kind of users. And my role and the role of my colleagues in the education team is to bring the tool into these different audiences. It's a lot of writing, it's giving courses, it's teaching. It is also giving presentations of use cases.
Conor:What attracted you to data science in the beginning?
Maarit:As a student, data science was very innovative. And the learning is so huge in everything that you do there is always some learning, that is very rewarding. and the other thing is the results they are so concrete, you build an application of churn prediction, you apply it somewhere and then you see immediately, okay, this saves us this and this much of money.
Yusuf:With the work that you do at KNIME, what is it that keeps you going and keeps you interested and excited about the data science field?
Maarit:Firstly, I like very much working at KNIME and people with different skills. The data science is what is the common thing that we have good people who can develop the software or can use the software who come with great ideas. What could be the applications and this kind of environment is something very exciting. On a personal level it's also the variety in the tasks that you can do. So it's a lot of writing. It's a lot of teaching. It's also a lot of thinking on your own and coming with good logic.
Conor:Can you give us a little bit of the history about KNIME itself as a company - how it began and a little bit about its journey?
Maarit:Actually at the moment how KNIME is and the size and the reach of the software and the company is much more than what it has been in the time when I started, for example, at KNIME. So it has been growing quite fast and the beginning was about 15 years ago. And it was at the university. So there may becomes the innovative background that we still live at KNIME. Yusuf: What would your use of I use KNIME daily. I start from an application and I have already an idea how the applications would look like. I know what kind of data I have, I know what I want to have as a result, I know what my audience is. And then the focus is on okay, what I'm going to do, I'm going to build a visual workflow. I am going to do something that is easy to present to the audience. I'm going to build something that meets the expectations of the audience. There is a lot of also this creative part and looking what kind of other solutions there is and how can I make it maybe easier with the visual workflow or how can I communicate it besides the workflow and they saw a lot of other aspects included than just building the little pieces in the analysis process. That is maybe the main way how I use KNIME. Ut then there is also . What do we want to analyze internally? If I want to know how popular my blog post was in the last year, then I do an analysis with KNIME. So there is also this experimental way of building the workflows and seeing what comes out.
Yusuf:So lots of time spent with little yellow boxes.
Maarit:Exactly, exactly, especially the yellow ones. These are for the things that you never present,
but that's so important:filtering rows, filtering columns, making aggregations.
Conor:Can you tell us some of the benefits of using a workflow approach, as opposed to say a more traditional approach that's heavily based on code?
Maarit:The workflow based approach. Firstly, what I like very much is this communication part; that you can bring together so many different experts who are all looking at the same workflow and they just see, they can understand it. They understand different things about it, but it's all relevant to all of them. If they go look at a block of code, there is the data scientist, there's the coding expert who sees what is happening in there. And it's not super useful for the other people who are interested in the bigger picture in the process. This is what I like the most. And the learning curve - so it's so easy to learn that. I also have internally some people who use KNIME in the finance and the marketing department. I really like the thing that if they need help with their workflows, I can go there and say, okay, here is the node. You need to match these columns and all you need to change here the string to date and time. I go there once and I don't need to go there as a coding expert and say, ask again when you need this another time. I showed them once. Then they recognize which yellow node it is. They use the yellow node, second time and it's already in their toolkit. They can use it. The people outside the data science field can learn it so quickly and intuitively.
Yusuf:Excellent. So those are internal users focused on corporate administrative functions that are using KNIME themselves not just the data scientists or data experts.
Maarit:Yeah, definitely.
Conor:Over the years Maarit you must have worked on many interesting and fascinating projects. Is there one particular project that you see as the most interesting that you've worked on?
Maarit:The most recent one that I'm working on is very interesting. So I'm writing a book about time series analysis with my colleagues. this is about codeless time series analysis, and it combines so many interesting aspects. The first thing is "codeless"- to make time series analysis is super relevant in different industries. You want to predict sales, or you want to predict disease spread or something like that. But how do you do that? Many algorithms are in Python or are in R, and then you need the data science expert to build it. But at KNIME, we have been working on components to make these algorithms, ARIMA and so on, for time series analysis, codeless. These have been existing now for a while, and now we are making more promotion for them in a book. This is something that has been a very long journey, but also very nice project to work on.
Yusuf:Time series analysis can become quite complex, when it comes to forecasting, et cetera. But of the things that we often see within the audit, internal audit and performance audit realm is where we have data for some years or some periods, but we don't have data for others or where there potentially is missing data for say certain days or certain weeks or certain months. The techniques that you're developing, can you use that to predict what those numbers might be and then fill in those gaps?
Maarit:These real life problems is something that fascinates me. In time series analysis you need to know your data. Sometimes maybe with the machine learning models, you can just have some bunch of data, you give it to a model and you get some results. it's on you to think if they make sense or not, but in time series analysis, the pre-processing part, the data access part, it's so big part of the complete analysis that you need to think, what do I do? Okay, I'm missing one week of data. I'm have missing values and to make that model actually work it's. The analysts tasks to say, okay, how do I replace the missing values? Where do I get more data so that it works. Or if I don't have data, maybe I change the model. And there is also where the business analysts or the people giving you the data come together to make the whole project or the analysis work the best. In the book that we are writing and in the time series materials that we have put a lot of emphasis also on this. Okay. How do you start? What do you see from the data just by data exploration, if you can already use that for forecasting. Then at the end, we've built a model, but let's see if the model is actually bringing anything on top of that. Or it's just the regular things that we find by data exploration, if they do the work already.
Yusuf:So you're talking about things like if we missing one week in a year, we can't necessarily look at the week before the week after, because they might be seasonality, which means we need to bring the same week from the previous year and try to adjust it this way.
Maarit:Exactly. And this is where it comes to knowing the data.
Conor:And then the audience for your book Maarit, will that be aimed at beginners or intermediate level practitioners?
Maarit:Yes. Firstly, people who want to learn about time series analysis, we start there with some theory, what is seasonality what is stationarity, or they can also be advanced data scientists just with not the knowledge of the time series analysis. Then the audience who are going to use the time series analysis in practice, we have kept it on a very basic level. So you don't need to be a KNIME expert to be able to start with the workflows in the book. So if you can open a workflow and build the first bar chart, it should be enough to open the workflows in the time series analysis book and say, okay let's learn this.
Yusuf:So you mentioned earlier that when undertaking a time series analysis, exercise, that one of the things that you need to make sure you look at properly is understanding your data exploring the data. More broadly, what do you see as the most important aspect of the data science lifecycle?
Maarit:Based on the experience that they have and what I just described in this communicating and bringing the people together to do work, work together on the whole process, as it is described by the life cycle, maybe a quick review of the life cycle, what it is. why a diagram that dynamics in the data analysis process. So it starts from data access to data. Pre-processing this yellow notes. Then it goes to modeling. Then it goes, also do the deployment because we don't build the model just for having a nice model, but we want to make some use of it and how these different pieces of the process follow each other and interact with each other. It's often not super clear that single people working, for example, on the modeling or accessing the data that, these pieces interact. That if I have a super good model, I cannot concentrate only on tuning it and making it from 98% accuracy to 99. If my data is not representative of the population, for example, maybe I need to go back to the data access phase to make it actually work in the practice, or if I have a model that is good. Now maybe I need to check it in next month next year. Is it still good? And then go back to the beginning of the process.
Conor:So some of our listeners will be starting out on their data science journey as auditors' are there any sort of common pitfalls or common traps or any tips you might have for them about areas to focus on in that data lifecycle?
Maarit:I would say that the challenges of the auditor's are at the beginning of the life cycle, there is maybe not so much in the modeling part, but there is this data access and the data pre-processing. One of the challenges is to get into the life cycle and let the workflow do things for you, and not maybe say, okay, I have been doing this manually. Let me copy paste this Excel sheet or let me just write the functions myself. Or if I need something more complicated, I will ask someone else and forget about it. Just to see the own work as part of this process and say, okay, I have my yellow nodes. I let them do my work that I have been doing manually so far. And if I need something more complicated, I go to the modeling experts. I let them help me and I let them complement my workflow. Then it's still under my control, the workflow, and I can build on that.
Yusuf:So you mentioned various aspects that are really important in the life cycle. But based on your experience across, several projects I imagine, and having worked with many people and especially when you training lots of people that at various levels of capability and understanding. What have you found to be, across all of those people and across all of your projects, the most challenging part of delivering a data science project?
Maarit:The start is often quite difficult because people have preferences in terms of what kind of tools they use. And to make them speak the same language I cannot say I want to, I have an idea. How do we make this model 99% accuracy? If the language that the business analyst speaks, I want to save five hours of my people's working time. So to find the common goals, this is maybe the first challenge. And we say, okay, how are we are going to reach the goals with this workflow? And just to have a bit of the patience there and see, okay, you are now using this tool. I'm I may be using Python. Now I could do this in two minutes. But let's build together a workflow. It takes maybe instead of five minutes, it's taken out 15 minutes and now we have something that is understandable to everybody. And it's the common place of communication.
Yusuf:So that's interesting. The actual technical aspects of the project are probably the easier part, once you decide you're going to be doing this and then actually going and executing it, pulling all the boxes together and linking them up.
Maarit:That's maybe also related to my role as a teacher and how I in with what kind of tasks I look for myself, but yeah, that's.
Conor:I think your example was perfect. And, the example you gave of, we want to get reduce the time spent by five hours. But somebody may have a different goal and it's trying to bring those two goals together. So that you're working on the same objective, albeit from different perspectives.
Maarit:That is definitely, then that's also how, yeah, if I write a blog posts, if the blog post says this is the workflow, then always needs to be a story around it so that the audience understands, okay, this workflow is actually exactly the same thing that could also work for my use case. And then the story needs to be something practical, like fraud detection, or churn prediction.
Conor:Do you get many examples? Of users in the real world coming back to KNIME and saying, oh, we use KNIME for this particular problem, or to help us with this business issue. And it even surprises you guys at KNIME. Oh, we hadn't thought about how the software could be used for that use case. Do you see that much?
Maarit:Yes. Because we have quite an active community we have in the KNIME hub, we have this workflow shared by the community and it happens more often than people think that? I look for. Okay. Do we have about marketing analytics or do we have about. Some specific technique, then I look for it KNIME and I see, okay, there is a user active in the community has created example workflows. Maybe this example of workflows come from a request on LinkedIn or on the KNIME forum. So there is a lot of interaction going with where you have communication is a workflow also between the members in the community.
Yusuf:Yeah, that's interesting. I remember at one of the summits there was an individual that worked for a fire regulation authority that was using KNIME to translate documents into multiple languages so that they could share that with other countries. And not something that you would naturally think about in your normal day-to-day work.
Maarit:Yeah. Especially these use cases that I never thought of, I could build myself as an example. There is so interesting things circulating.
Yusuf:It definitely goes beyond the identify cat faces, which is interesting, but it's a bit overdone. In terms of people out there that would like to learn how to use KNIME themselves, how would they get started like, if I've never seen it myself before, maybe I've heard of it, but don't really know, what's the best place to start?
Maarit:If you want to know what KNIME is, what use cases are there? We have regularly events and data talks, and maybe it's a good starting point. to our website and see when is the next events happening so that you get to know the people working here, get to know the customers and their stories, how they use KNIME. This is on the very high level, what the company is, but if you say, I want to use the software, I want to use the analytics platform right now and start with visual programming. Then in the same place, from our website KNIME.com, you can land on a learning page. And depending on the type of learning you prefer, there are online courses. You can find me, for example, one of my colleagues in the education team and join a course where we go through the steps in learning the basics of data science or basics of KNIME analytics platform. Or if you say, I can do it myself, there is also self-paced courses where you just, watch videos and complete the same steps on your own. And we have webinars, we have books, so they're good starting point for that is the learning page on our website. The software is open source and you can, from the web page KNIME.com you can find on the first page, the download button and that's two or three steps to download it on your machine.
Conor:Fantastic tool. And one of the things from a non-data scientist, certainly not even close to being one, the workflow certainly gives a degree of transparency and understandability around the analysis, and you can see exactly what's going on in the logic chain, which is fantastic, I think, and would be really helpful to auditors.
Yusuf:For sure. Yeah. We've been using it since 2016 now, and it's so easy. You create a workflow and, I can show it to Conor and he can review it. Whereas if I had to give him a thousand lines of code, he'd point a gun at me very quickly.
Maarit:It becomes beautiful. If you see your process that started with the mind map that is so chaotic on your paper, and then you extract there the different pieces and you can put it in the analysis itself that it looks as logical as you thinking is at the end.
Yusuf:Maarit, how can people Read more about the work that you do or read some of your blog posts. get in touch with you and the rest of the KNIME team.
Maarit:When you download KNIME, there is also the option to register for regular updates regarding blogs, upcoming courses. This is one good way of keeping in touch with what is happening organized by KNIME and what is happening with the software and the people here. every time you open the analytics platform, you will always also see a welcome page. And there we show you a news. What is the blog post coming out? What are some new features, for example? So we will keep you updated if you just keep using the software, but also we have a Medium blog for low code data science and LinkedIn, Twitter, we update there regularly and there the people working at KNIME are very active.
Conor:Hopefully a lot of our listeners will jump on the KNIME website get stuck into it. Maarit, great conversation. Thanks for your time.
Maarit:Thank you very much. I hope to see more people joining them and hope to teach you at some point as well.
Narrator:If you enjoyed this podcast, please share with a friend and rate us in your podcast app. For immediate notification of new episodes, you can subscribe at assuranceshow.com - the link is in the show notes.