Quality Bits

Data Advocacy, Machine Learning Testing, and Data Teams with Aurimas Griciunas

April 17, 2023 Lina Zubyte Season 1 Episode 17
Data Advocacy, Machine Learning Testing, and Data Teams with Aurimas Griciunas
Quality Bits
More Info
Quality Bits
Data Advocacy, Machine Learning Testing, and Data Teams with Aurimas Griciunas
Apr 17, 2023 Season 1 Episode 17
Lina Zubyte

With over a decade in data-related roles, working with Data Science, Data Engineering, Machine Learning, and MLOps Engineering, Aurimas Griciūnas is a great role model for many people wanting to learn more about data. The author of the SwirlAI newsletter with over 7000 subscribers, Aurimas' posts with visuals explaining data have reached many more people on LinkedIn.

In this episode, Aurimas shares valuable insights about what working with data means. Is it just a trend to be data-driven, or, a reality? Are all the data products big data products? Who and how should help define good business success metrics? Tune in to learn the answers to these questions and much more. 

Find Aurimas on:
- LinkedIn: https://www.linkedin.com/in/aurimas-griciunas/
- Twitter: https://twitter.com/Aurimas_Gr
- SwirlAI newsletter: https://www.newsletter.swirlai.com 

Mentions and resources:

Follow Quality Bits host Lina Zubyte on:
- Twitter: https://twitter.com/buggylina
- LinkedIn: https://www.linkedin.com/in/linazubyte/
- Website: https://qualitybits.tech/

Follow Quality Bits on your favorite listening platform and Twitter: https://twitter.com/qualitybitstech to stay updated with future content.

If you like this podcast and would like to support its making, feel free to buy me a coffee:

Thank you for listening! 

Show Notes Transcript Chapter Markers

With over a decade in data-related roles, working with Data Science, Data Engineering, Machine Learning, and MLOps Engineering, Aurimas Griciūnas is a great role model for many people wanting to learn more about data. The author of the SwirlAI newsletter with over 7000 subscribers, Aurimas' posts with visuals explaining data have reached many more people on LinkedIn.

In this episode, Aurimas shares valuable insights about what working with data means. Is it just a trend to be data-driven, or, a reality? Are all the data products big data products? Who and how should help define good business success metrics? Tune in to learn the answers to these questions and much more. 

Find Aurimas on:
- LinkedIn: https://www.linkedin.com/in/aurimas-griciunas/
- Twitter: https://twitter.com/Aurimas_Gr
- SwirlAI newsletter: https://www.newsletter.swirlai.com 

Mentions and resources:

Follow Quality Bits host Lina Zubyte on:
- Twitter: https://twitter.com/buggylina
- LinkedIn: https://www.linkedin.com/in/linazubyte/
- Website: https://qualitybits.tech/

Follow Quality Bits on your favorite listening platform and Twitter: https://twitter.com/qualitybitstech to stay updated with future content.

If you like this podcast and would like to support its making, feel free to buy me a coffee:

Thank you for listening! 

Lina Zubyte 00:00:07
Labas! Or... hello! Welcome to Quality Bits -  a podcast about building high-quality products and teams. I'm your host, Lina Zubyte. The language I started this episode with is Lithuanian - my native language. It's because in this episode I'm for the first time talking to a fellow Lithuanian - Aurimas Griciunas. We actually went to university together years back, and now Aurimas is a familiar face when it comes to advocating for all things data. In this episode, we're talking about various things data related. What does it mean that the data is high-quality? The trend of companies using data and machine learning and how we could test it. Enjoy this conversation.

Hi, Aurimas. Welcome to Quality Bits!

Aurimas Griciunas 00:01:20
Hello. Hello. Thank you for having me here.

Lina Zubyte 00:01:22
It's so great to talk to you again. It's funny because we studied together a long time ago in Vilnius and we both studied mathematics. So how are you doing? And could you give a brief introduction about yourself? What have you been up to since studying mathematics in Vilnius years ago?

Aurimas Griciunas  00:01:44
Yeah, we did study together. Good times. Long time has passed, actually. So what? It was ten plus years, right?

Lina Zubyte 00:01:52
Wow. I did not realize. But yes, we're getting old.

Aurimas Griciunas  00:01:56.770
Yes. We are. So, yeah, many things happened after finishing the studies. So we both studied financial and actuarial mathematics. Right? So I actually pursued what I was studying. So I started working as an actuary right after the university. And from that day on, everything was connected to data for me in the last 10 to 11 years now. So I filled quite a lot of different roles when it comes to data lifecycle. I was a data analyst, the data scientist, machine learning engineer, data engineer, data platform engineer and data platform engineering lead, data engineering lead, cloud engineering lead. And my latest gig was mostly connected with machine learning operations and helping people deploy machine learning models to production. And now I am a senior Solutions architect at Neptune AI. There we provide a SaaS platform that helps you track and manage your machine learning model metadata.

And I also have a side gig, my own company called Swirl AI, where I mostly produce educational content so that I can help other people to both get into data space and improve their skills in data space. And yeah, so in my free time, like in my out of work time, at this moment, I'm actually quite focused on my side gig. So I do spend a lot of time creating content and writing technical content, so I don't really have that much time for doing leisure activities. I am doing some sports. I have two dogs, so I do have some physical activity outside of my job, but most of the time I'm doing work.

Lina Zubyte 00:03:54
Your educational passion and advocacy is the reason why I actually connected with you again, because related to as of all your free time hobbies, sports, I went to a work out with my used to be a colleague. We worked together in Budapest and she is a data engineer and she somehow mentioned, Oh yeah, there's this other Lithuanian that I follow. And he posts so many things about data. I'm like, Excuse me, wait, is his name Aurimas? And she's like, Yeah, yeah, yeah, I think so. And this is how I was like, Yeah, we should talk about this, we should talk about data and quality. And I also still love data, even though it did not work as an actuary. So what made you start educating the community and sharing knowledge publicly? What's the motivation here?

Aurimas Griciunas  00:04:49
Actually, it was kind of when I was working in HomeToGo, I was already in a leadership role for 3+ years before I actually left. And I did spend quite a lot of time just speaking to members of my teams. When it comes to technical things, educating them. And I already had quite a broad understanding of the entire end-to-end data lifecycle in the data systems, both data engineering and machine learning related. And when I left HomeToGo, I actually went on to do like straight hands on in machine learning operations. And there was this kind of, I would call it an empty space that I wanted to fill to actually put out the knowledge I had. And at that time, I also met the person who was team leader of mine in that company I worked for him.

And he was for some time already producing content on LinkedIn about machine learning operations. So I decided to why not try it out as well? And I tried out. I was writing about mostly organizational stuff rather than deep technical things. I did that for several months and then I decided to actually go deeper into technical bits of it. It became kind of a hobby of mine because I really liked both writing and I understood that the best way to actually move the industry forward is to educate others because I can't do much on my own. So I wanted to also do good for the industry. And people started liking my writing and what content I produced. Much of this content is actually something that I learned throughout the career, and that is not really easy to kind of find in the books or blog posts, like in a single place. So it's not condensed very well. And I think that's why people started liking it. And I wanted to share these things because it really doesn't make sense for me to hold them in my head and not educate others. Yeah, and then it went off quite well and got some couple tens of thousand followers on LinkedIn. And then I continued doing that. And I think that educating others is kind of the best I can do to bring the industry forward. So that's why I continue doing it. And my goal is to help our level up in everything that is data. So it's probably more data engineering, data analytics, data science, machine learning, etc.. Yep, that's my motivation.

Lina Zubyte 00:07:42
That's a great motivation. I'm extremely happy to see your posts pop up in the feed. I feel like they are clearly put, so the visuals always help. I'm a big fan of visualizations and simplifying the concepts that can seem so scary is such a nice thing to do for the community. I'm sure it's extremely helpful. To start from scratch, though. When I'm thinking about data, I remember all the companies that say we're data-driven. We're amazingly great with data. We do all data driven decisions. Data became a little bit like green flagging because we are saying that, yeah, we're so caring about it, but are we actually? Of course, data is such a vital part of businesses, but more often than not, we just hear how companies aim to be data driven, but in practicality they're not really doing it. What is your opinion about that? Do you see a trend in companies embracing data more nowadays? Is it reality or is it just some kind of marketing trick?

Aurimas Griciunas  00:08:57
I do think that it is becoming and for most of the companies, the reality nowadays, but probably we need to understand that not all businesses have data at the core of their function, and some of the businesses are also depending on legacy systems that we have. And maybe becoming more data driven does not make much immediate sense to them because maybe they are in a legacy business model that other competitors in the industry are not data driven as well. So sometimes it mentally does not make sense to become data driven today and maybe just wait for others to pave the way. But if the company is a new company and, not necessarily new but new-ish, and doesn't have that much legacy, I do believe that new companies not necessarily start as data driven companies, but very quickly become ones.

And when you say data driven, it's usually several things that people mean. There are several layers of being data driven. So first one is more simple when it comes to being able to describe your business, right? So data analytics. So can you describe how your business behaves? Can you effectively measure your financial results? And this is, I think, where most of the companies are doing at least decent job. Of course, it does involve moving data around, and moving data around has never been easy. But with the uprise of understanding of importance of data engineering in the industry. In the upcoming years, I think that it will become more and more regular to see companies doing well at descriptive analytics.

Of course, when it comes to predictive analytics, which is mostly data science and machine learning, so you have to apply statistical methods to forecast future. It is a lot harder because I wouldn't say there are not many good practices in the field, but the good practices are very new still. And then that also comes to what I have been doing in the past two years, which is machine learning operations. We help organizations with embracing machine learning at scale. So there are a set of practices and tooling and organizational structures required to do that. And companies, unfortunately, are not natively built for that because, well, how all of it started and I was also in such a situation once in my career was that everyone who wanted to do data science and create some value out of it, even though they wouldn't understand about it. And that comes to the leadership deciding to hire in a data scientist into a company and just saying do something with the data, bring us value and whatever that is, is up to the data scientist to decide. And I was in that situation once, and unfortunately there are many, was at least, many things that organizations wouldn't consider about before doing that. One of those would include, of course, data not being in the correct format and in the correct place for data scientists to leverage it. And the next one would be to, once the data is already there and the machine learning models are already trained. How do you actually expose those machine learning models to the end user? So the users of whatever your organization is actually creating. But nowadays, with the rise of machine learning operations and people are starting to understand how all of this has to happen. Well, it's becoming apparent that more and more companies are becoming data driven.

Another thing is the rise of this hype over foundational models and large language models. So everyone is now wanting to grab a bit of goodness that it, kind of, not offers but promises. So we'll see how this goes. But I think that there's also quite a lot of innovation happening around the large language models and how can you actually leverage them in your business and how you can production? Is it how you can deploy all of those products, leveraging large language models and the what practices are needed to make a specific use case work for your business? Yes. So this is a short summary of my thoughts around being data driven. And of course, I know that this podcast is pretty focused around quality. So in recent years, also data quality has become quite a hype. Not necessarily in recent years, but I would say last year, actually. Before it wasn't that much. So hopefully we can also improve on that front, but it will take time.

Lina Zubyte 00:14:18
What would you say is a way to increase data quality in general, especially if we work with big language products?

Aurimas Griciunas  00:14:29
So when it comes to big language, the large language models, data quality maybe is different comparing to what you would need in the regular data products. So large language models only work on text, right? So you will need a quality cleaned up text basis of your product, something that is not yet present on a global public network. If you want to actually reap some additional value that large language models cannot provide on their own, because if you're just using, let's say, Chat GPT or Chat GPT APIs, then you don't need any quality data actually, because it already provides the capabilities of itself to you out of the box. But if you want to fine tune, for example, a large language model or implement any other capabilities like maybe a documentation search that is specific to your documentation, you will need to basically have the bulk of quality text information. So it's not that difficult. You will probably already have it because you are using it for your product in some way or another. But you will need to do some well-known MLP cleanups on this text.

The larger movement around data quality is when it comes to irregular data sets that are not text. The idea here and the sentiment here is that you can't have data quality unless it is aligned on the top layers of management in your organization. Because if you are specifically focusing on business value that you produce, like you focus on some business metrics that you want to optimize for, you will never put high emphasis on something that is not directly contributed. And data quality - it's really hard to show how data quality directly impacts your balance sheet bottom line. And in order to actually implement this data quality best practices in your organization, you will need to make your leadership understand that it is important. And it only becomes important in a certain amount of time because initially it is very easy to bootstrap your first version of data driven product by just not caring too much about clean data pipelines or data contracts because you can just dump some data somewhere.

You can create that the PoC model on top of it and you can actually even deploy it and it will bring some value. 100%. But if your organization starts growing and your data warehouse starts growing and you don't have stable, clean data pipelines that ensure data quality, then it becomes really hard to horizontally scale data problems. And when I see that, I mean specifically having more either or machine learning models of different types running in production or even dashboards, correct dashboards that display certain business metrics to your management or just data products that are based on reverse ETL. So you calculate something in your data warehouse and then you push that data, aggregated data back into your operational system. Yeah. So it's all about, I would say, horizontal scale and alignment at the top layer of your management so that you can actually spend time on data quality.

Lina Zubyte 00:18:14
When it comes to data quality, what I think of usually is no duplications, some kind of structure, schemas, data schemas as well. What else is there? What would you say is actually data quality in a sense? What is for you, maybe a good sample of data? Is it just like having it structured and standardized or does this something else as well?

Aurimas Griciunas  00:18:41
I would say that data quality is extremely important in two parts of your data pipeline. So what you just mentioned is data samples. So it's samples used for either producing dashboards or charts. And then there is the other side of it. So operational data, because if you deploy a machine learning model to production and it is using online data for that, then you also have to ensure data quality while it's being produced and not only when it is actually piped through the pipeline. But coming back to your initial ideas about schema imputation, non missing values, probably, normalized field names, etc.. So this is one part of data quality spectrum.

Then we have things like SLAs, which is as important as the schema and missing values and data types, a lot of qualities of the data. So SLAs would include when do you expect, for example, all data up until 12 PM of yesterday to be present in your data warehouse today? Because usually you would send out the business reports to all of the stakeholders, let's say at 6 AM or 8 AM today, and it has to include all of the data from, as an example, your website that happened yesterday. Then you could think about things like what is the expected distribution of a certain data set, like what is distribution between countries that you are serving your product to just as an example? And specifically when it comes to machine learning products, it is very important to track these distributions because it is a statistical model that you eventually deploy to production, and that means that it is extremely based on the distribution of data that you fed into it, then training the machine learning model. And if, for example, a distribution of your features starts shifting in production, it might have a very high impact on the results that these machine learning models produce. So this is called in the industry detection of the model drifts and data drifts or feature drifts. So this is additional kind of the data feature that you would want to track. However, this is not necessarily connected to data quality. It's more of the behavior of your data. But yeah, so these are mainly all things that you probably would want to track.

There also are things that are really hard to discover. When you, for example, do validate for schema changes and even feature distributions, you might miss some important data qualities that are not easy to find. For example, if you have a dataset that has some categorical values and let's say, one of the categorical features has, let's say 10,000 different categorical values, that it can be represented with, it might happen with some of these categorical values are very minor in amount. Let's say only ten users that are from, I don't know, let's say, Portugal, right? And this is really then hard to catch data inconsistencies that only happen in these small underrepresented regions because you want to analyze them separately. So you could have let's say 80% of your population that behaves correctly. What behaves correctly means to you is also dependent on your business. But there might be inconsistencies in smaller regions, and you usually can't catch these in your regular data pipelines using tools like Great Expectations or dbt-expectations. So you need something more powerful that you can actually mount on top of your data warehouses and find these data inconsistencies. Usually there are some products that provide machine learning capabilities on top of your data sets, and they intelligently search for these data inconsistencies. And I have experience with certain products that work very well here and they basically surface something that will be not visible to something, but has to be manually defined as a good data quality.

Lina Zubyte 00:23:41
I actually worked with the machine learning product in the past. It was a chat bot and they really like that you mentioned as well having a diverse set or like the distribution of your data really matters because it is what feeds the model and that really affects the model. So of course we can remember examples like Tay the Microsoft's bot that became racist within like 24 hours on Twitter. If we feed bad data, then, you know, we will get the results accordingly. So this is like a super important part. And I always say that I believe in some kind of open source future because only then we can make a sample that's not just, you know, me, myself. And then I thought of something and then, you know, the chat bot speaks like me. What are your thoughts about the open source aspect of data? Because now, for example, when we think about Chat GPT or Alexa, they have really powerful data samples, but that is their moneymaker.

Aurimas Griciunas  00:24:49
Yeah, So actually a large language model, not large language models, but GPT models have been trained on the entire public the right. So it's not like they have something proprietary. It's just all that we fed to these models and trained them. I do think that this is the future because these models will be available to everyone. It's not like they will not. So the only thing that you will need if you want to create something custom is your own extra dataset that is not exposed in any website because if it is on the public that that means that GPP models have it. So either you make your data private and not accessible by a public network or you don't even need to bother because it's already in those models. And the future is that they will be using large language models for most of the LP applications, including the chatbots. I don't think that anyone will be using any custom chatbots that are not large language models based anymore. So we need to kind of adapt to this reality from today.

Lina Zubyte 00:26:01
Yeah, and it needs lots of data, lots of diverse samples.

Aurimas Griciunas  00:26:07
Not necessarily. I wouldn't say necessarily lots of data because there are two ways how you can use large language models today. So you either can, if you want, for example, a chat bot, you don't even need to necessarily fine tune the model. So if you need a chat bot that could search your documentation more efficiently, as an example. You can simply use your own embedded documentation text as a data store and use it to actually search for whatever you want to. Or you can in a more complicated case, fine tune the model, which is also something exposed via some Chat GPT APIs that you can do online. But it doesn't mean that it has to be huge amounts of data, totally means that it has to be something that is proprietary to you.

Lina Zubyte 00:27:03
And depends on your goal, I guess. When it comes to testing products that involve a lot of data, how would you test a product like that? We could talk about machine learning ones or big data ones. Some of the more challenging ones.

Aurimas Griciunas  00:27:22
So again, it doesn't need to have big data. It can be a small data based model, but how you usually test the machine learning models in production is you come up with a PoC model - the first, basically, model that you would like to test. And you would have then a baseline, which is your regular operations without the machine learning deployed to production. You would use techniques like AB testing as an example. There are more, but let's focus on AB testing now. So you deploy your machine learning on maybe 50% traffic of your customers and you keep the remaining 50% of your customers on your baseline, which is basically no machine learning model. And then you collect statistics around it and you estimate which case behaves better and produces more business value. And when it comes to business value, you need to understand what do you want to optimize for? And usually what businesses want to optimize for is revenue. And revenue can come in multiple forms and shapes. It can be just cash. So basically money you generate or just pure revenue. It could be conversion rate. For example, how many click fruits in your website happen? It could be also how well your model, if it's a computer vision model, detects objects. It does not straight tied to your revenue but if it is something you're providing to your customers, then you can try and evaluate things like how many customers, for example, remain with your product and how many customers do churn from your product. If there are inconsistencies in the quality of your product that you provide. And for this to happen, you want to actually expose these models to production and you do AB testing. You also need quite solid data engineering flows after it happens because you need to collect the data that is being produced by your product. It could be your website or your device that you provide or something similar. So there have to be also high quality, robust, data engineering pipelines that collect this data and then maybe daily or weekly or every hour, aggregate the data to actually calculate these business metrics and apply some statistical analysis on top of it to understand if your tests are statistically significant and you can actually approve or reject the hypothesis that, let's say your machine learning model is producing better results than a non machine learning model based system.

If it does produce better results, then you basically shift your system entirely on machine learning based product. So you don't expose the non machine learning based system to your users. And then you start iterating on your machine learning model that you want to expose. So you create a new version of it. You maybe collect more data, more quality data that you would use to train that model. Once you have a new version of your model, you can deploy it to production on 50% traffic while keeping the old model also running on 50% traffic. In the industry it's usually called control and treatment groups. And you apply the same procedure as you did before. You make a hypothesis that one of the models would produce more business value, and if it does in time, then you simply shifts two models between each other and you will retain the best performing one in production and you continue iterating on training new models endlessly. And by doing so, you basically just ramp up your business value continuously.

There are more techniques to doing this. Apart from AB testing. There are techniques like interleaving experiments or popular one multi arm bandits, where you wouldn't just expose your system to two machine learning models, but rather it would do it for multiple. And one of them could be running on 90% traffic. And let's say, 10 additional models would split the remaining 10% of traffic and then they would automatically switch to which one is exposed to 90% traffic on some business metrics. So the best model would always be exposed to the most amount of traffic, and the remaining ones would dynamically shift in and out if it would become better than the one currently considered as the best one.

Lina Zubyte 00:32:16
It's really cool to learn more about that, especially about experiments. And I think more and more companies are embracing AB tests, which is awesome. And this is how they say they make data driven decisions.

Aurimas Griciunas  00:32:30
Yeah, and maybe one more important point. You don't need to have machine learning model to do the AB test. So AB test is a extremely old concept and it was for a long time used to test different front end elements in your website. For example, you can change coloring again to see if it works.

Lina Zubyte 00:32:48
A different color button: which one performs better? For sure. Yeah, we do quite a lot of experiments in my company as well. What I'm wondering about is this aspect that you mentioned about having good business metrics, having good understanding of what success means for us. What advice would you give for defining good business metrics?

Aurimas Griciunas  00:33:10
Business metrics are never defined by data scientists themselves or even teams that do data science. Usually, business metrics are defined by business units and upper management even. I would say there are pretty clear standards around what needs to be measured. So I've seen things like click through rate, revenue per user in the last seven days, user satisfaction sometimes. This is a very interesting one, maybe because as I mentioned, for example, in chat bots, it is really hard to understand what you should measure. So companies like Google in their search and similar in their, for example, translation apps, they would have these buttons that would allow users to leave feedback. Was it a good translation? Was answer correctly formulated to very question and similar? So these are also very important pieces of infrastructure. But yes, eventually whatever drives their business forward.

So maybe very important thought here that you shouldn't necessarily overfocus on the specific revenue based metric because sometimes there are side effects. A very good example, if you say that the revenue is the only thing that you want to optimize for: it could be that your entire system would be disbalanced. As an example, if you're selling something or allowing people to rent objects, houses or similar, the system might optimize for showing the most expensive items to the user. And that might mean that you are making the most money. However, that does not mean that you're maximizing user satisfaction. So people might start leaving your website because they are only showing the items that are very expensive. So even though it's a maximum revenue in the short term is improved. In the long term, you might start losing your audience because you're not showing to them what they want and you need to understand where do you want to go in the long term rather than just short term and optimize for that. So it's kind of a complex topic.

Lina Zubyte 00:35:32
That's a very good example and it makes me think of biases and the ethical aspect, which could be a different conversation and likely a long one, because we for sure could have algorithms winning over, which are just looking at one aspect of metrics. When we look at metrics, we need to look at multiple dimensions to make sure they work. But yeah, we see lots of examples. For example, even booking.com, right? They play so much with the ethical aspect, it's almost sold out. Only one room is left and then you're like, Oh no, I have to buy now. And this is the whole big conversation. How we could make some products that like charge for certain class or even the amount of money you earn and then decide the customers and you're losing others. I'm wondering also that you mentioned at the start that sometimes companies hire a data team and they say, okay, now, you know, tell us things. Is that also one of the common misunderstandings that they hope the data team to figure out the metrics when they don't understand the success themselves?

Aurimas Griciunas  00:36:44
If they hire a data team, that's already a good sign, because what I've seen is one data person being hired and told figure out everything on your own. I was one of these data scientists that was once hired in, but I know more people, a lot more people that were thrown into this kind of situation. But it was a long time ago like it was at the rise of data science. I don't remember how much time ago, but I think 6 to 7 years back. Now, if you would hire a senior enough data person, and when I say senior, I don't mean a person who can code well or who knows machine learning good. Senior person to me is the person who takes time to understand the business and how it works and what kind of, as mentioned, metrics you should optimize for. I think that we are in such a times when this person could actually lead the initiatives as long as she or he has strong support from the management they get. So there's also always space to understand what business needs. But then again, this is a tip for anyone who is hired to basically bootstrap data value in the company. It's not only about tooling and infrastructure and how much data you have. It's about understanding what your business actually needs. So you need to be able to go to the business units and talk to them and not just come up with a nice shiny idea and start implementing it and then understanding eventually what no one really needs what you have created.

Lina Zubyte 00:38:27
Right? That's a very common challenge: building something that we think people need that they actually don't. So to wrap up this conversation and the topics you talked about, what is the one piece of advice you would give for building high quality products and teams?

Aurimas Griciunas  00:38:46
These are two separate questions. So when it comes to teams, I actually gave a talk recently about the effective organizational structure for machine learning operations. So the ideas are quite standardized upon: when it comes to machine learning, you need to have end to end machine learning teams. So teams that can deliver the data product from the very beginning to the very end. So they should have the ability to extract data, manipulate the data, build machine learning models themselves and deploy them to production. And there should be no handovers happening in this flow of work. And there needs to be a very efficient feedback loop from the products that you have deployed to production. So you need to be able to both monitor your running products. Meaning that you should see and be able to react to things like latencies, error codes, etc. You would need to also be able to see any feature and concept shifts happening in production so that you can react swiftly. And you should also have online testing implemented from day one. Meaning that you need to be able to understand if your machine learning products are delivering value against a non machine learning based system. So this is this is what is important when building teams.

If you are scaling a lot and you have multiple machine learning teams delivering multiple machine learning products to production, then you should also think about how you can alleviate stress from these teams so that we can focus on working on what matters, which is: understanding of a business problem and how to tackle it. And you do that usually by eventually abstracting the infrastructure and tooling from the team by creating a machine learning platform team, which is then responsible for providing machine learning capabilities to other teams so that those other teams can very easily and without too much stress, implement and machine learning flow.

And of course, when it comes to more general data products, definitely focus on quality. If you're just starting a machine learning, not necessarily machine learning, but data based products in your company, always consider the fact that you should start from hiring senior people and not mid, which sometimes happens, by the way. And when it's senior, as I mentioned before, not only about tooling and technologies, it's also about understanding and at least wanting to understand the business. Yep. And that's I guess that's it.

Lina Zubyte 00:41:47
Great. Thank you so much, Aurimas.

Aurimas Griciunas  00:41:50
Yeah, thank you. Thank you for having me here.

Lina Zubyte 00:41:53
And that's a wrap. Thank you so much for listening. If you're here till the very, very end, wow, kudos to you. And I'd like to ask you a little favor. It will take you just a little bit of your time. Rate this podcast and share it with one friend that may be interested in data or any other episode that we have and spread the word so that we can continue caring and building those high quality products and teams. See you next time.

Motivation of spreading knowledge on data topics
Data-driven companies: trend or reality?
How could we increase data quality?
Machine learning models and open-source aspect
Testing machine learning or data products
Business metrics: who defines those and what makes sense?
Advice for building high-quality products and teams