Data Brew by Databricks

Welcome to Data Brew by Databricks with Denny and Brooke! In this series, we explore various topics in the data and AI community and interview subject matter experts in data engineering/data science. So join us with your morning brew in hand and get ready to dive deep into data + AI! For this first season, we will be focusing on lakehouses – combining the key features of data warehouses, such as ACID transactions, with the scalability of data lakes, directly against low-cost object stores.

All Episodes

Data Brew by Databricks

Data Brew Season 2 Episode 8: Feature Engineering

July 08, 2021 • Databricks • Season 2 • Episode 8

0:00 | 31:17

For our second season of Data Brew, we will be focusing on machine learning, from research to production. We will interview folks in academia and industry to discuss topics such as data ethics, production-grade infrastructure for ML, hyperparameter tuning, AutoML, and many more.

Is there ever a “one-size fits all” approach for feature engineering? Find out this and more with Amanda Casari and Alice Zheng, co-authors of the Feature Engineering for Machine Learning book.

See more at databricks.com/data-brew

Denny Lee (00:06):
Welcome to Data Brew by Databricks with Denny and Brooke. This series allows us to explore various topics in the data and AI community, whether we're talking about data engineering or data science. We will interview subject matter experts to dive deeper into these topics. And while we're at it, we will be enjoying our morning brew. My name is Denny Lee. I'm a Developer Advocate at Databricks and one of the co-hosts of Data Brew.

Brooke Wenig (00:30):
And hi everyone. My name is Brooke Wenig, the other co-host of Data Brew and Machine Learning Practice Lead at Databricks. Today, I have the pleasure to introduce Alice Zheng and Amanda Casari to Data Brew. The two of them co-authored the Feature Engineering for Machine Learning book. Alice is a Senior Manager of Applied Science at Amazon, and Amanda is a Developer Relations Engineering Manager at Google. Amanda, how about let's get started with how you got into the field of machine learning?

Amanda Casari (00:55):
Sure. So I was a graduate student in 2009, 2010, and I was studying applied math and computational modeling at the University of Vermont, and working in areas around complex networks. And then, out of school did a few jobs and then I went and worked at a place where I was a consultant. And there, I was helping to start the data science practice. And it was here in the east side, very close to Seattle. And I actually met Alice through an open house that the startup she was at was having. And so, I was working as a data scientist in a consultancy group as were kind of my first job in industry with machine learning. But that's where I met Alice too.

Brooke Wenig (01:36):
That's very interesting. I don't think I've ever heard of open houses for software engineering companies before. So Alice, I think this is a great transition. How did you get into the field of machine learning?

Alice Zheng (01:46):
I got into it as a compromise. So straight out of undergrad, I applied to grad school and I was trying to decide what to study. So in undergrad, I was a double computer science and math major. So, I like math. I liked hacking on hardware. And I was kind of okay with programming. I think software was probably my least favorite, but I always like math, software, and hardware. Perfect one would be robotics, but it's too hard. I would spend six years of my life soldering. And then, I'd have to figure out what to write a thesis on. Like, "No, robotics is out. What's next? Okay. AI, it is." No hardware, but I get software and math. So, that's how I went into machine learning. And having one of the best ML experts at Berkeley, Michael Jordan. He had just joined a couple of years prior to that. That certainly helped to cement my decision.

Denny Lee (03:01):
That actually is hilarious especially because when you said compromise, I thought it was something similar to mine, which is basically I ended up doing a Master's of Biostatistics to make sure my parents were happy that I could claim that I went to medical school. That was the entire basis of it. So anyways, that's not the real question here. So, let's definitely talk about what inspired you to write the Feature Engineering book in the first place. At the time, I believe when you were starting, I believe Amanda, you were still at SAP Concur. I believe Alice, you were still at Turi at the time you had the startup, right? So yeah, you're obviously really busy. So, what made you decide to go and still write a book on this stuff at the same time?

Alice Zheng (03:46):
I can take that. So, it happened the idea occurred to us as I was working at Turi. At the time, Turi built a machine learning platform. And among other things I helped to do customer outreach and education. And I realized that people talk a lot about machine learning, about models, sometimes about data. But feature engineering was one of those areas that for anyone who's worked in the industry, we know how important it is, but there's very little writing or talks or just general dialogue around it. So I thought that was an important topic and got very excited about the subject, so decided to write a book.

Denny Lee (04:44):
Cool. And then Amanda, I'm going to switch it to you. I believe you were leading up SAP Concur Labs I think, or SAP Labs at Concur at the time. So yeah, what made you decide to partner with Alice on this book?

Amanda Casari (04:58):
I think I remember going out for a coffee with Alice, and her describing the book process. And I think it was more of a, I was like, "If you need a technical reviewer, I'm super excited to do things like that." Mostly because I'm really nosy and I like to read things before they come out to the general public. And I just thought it sounded like an awesome idea. And I liked the way that Alice was laying out her vision for how the book would flow, which was different than what I had heard from other tech books. And every time I talk with Alice about feature engineering, I learn something new. So it was definitely one of those [inaudible 00:05:34] was like, you mean, I get to read a friend's book as we're going along and learn how their brain works. This sounds great. I would love to do that.

Amanda Casari (05:40):
And then, I think after a few revs on some chapters, we started talking about and Alice was interested in not working on it by herself. I think at some point it was the... It's hard. It's hard to try to do those kind of projects by yourself. Yeah. So, she asked if I wanted to help kind of with wrapping things up and working on different parts of it. I think at one point there was definitely some cursing involved on both of our sides, which was great. So, it was a lot of fun. It was a lot of fun to finish it up together and definitely still feel the pride and kind of pulling together things. There's a lot of math jokes in the book that I wished that I had written, but I know Alice was like, it was such fun to read. So yeah, it's hard to say no to something like that.

Denny Lee (06:28):
That's awesome. I love the idea of math jokes. I'm going to have to actually make you pull some up. But before I do that, actually I do want to ask the quick question is like, okay, well, then based on what you've written, are there any key insights or tips that you want to provide in terms of at least from an high-level overview of what you would like people to understand and why they want to read the book in the first place?

Alice Zheng (06:55):
So for me, I think the thing that I really want people to take away from Feature Engineering and Machine Learning Modeling in general is that they are connected. The best features depend on the model that you're using and what is the input data. So you can't treat any of them as just a standalone black box and you just throw different things at it and whatever comes out, comes out. I think the best kind of science that I have experienced is this process where it's a holistic process where you start by looking at the data and to see what kind of features make sense. You bring your domain knowledge into it. You bring knowledge about spatial characteristics of your data perhaps. And you think about what is the right model for... What is the task? What is the right model for that? And then, what are the best features that fit the data and the model? And then, it come...

Alice Zheng (08:03):
... the best features that fit the data and the model. And then it comes back around because when you do your final evaluation, don't just take a number that says, "Oh, you see us up or down. Therefore, I will launch with this or know." Dive deeper to understand why it's going up or down, whether there are specific problem areas where the model is deficient or the features are deficient, and make this a cycle of investigation to really dive deep into results, dive deep into features. And then think about what the models and the features and evaluations should be, rather than just feeding them as black box. This book covers the feature engineering part. I think the overall evaluation and research pipeline is also important.

Amanda Casari (09:09):
Yeah. I think that for me, it's... I really hope and I really love the framing of the book and how it focuses on really that intentional mindset. And like Alice said, on diving deeper and thinking about when you're trying to... Based on what you're trying to build, we talk a lot about here's the different kinds of models you can use to solve problems like this. But the big piece in technical guides and literature and developer guides and things I felt like was missing before we started to do this work was the you're building statistical models on top of statistical models on top of something that has its own distributions. So yes, you can play whack-a-mole and you can just try to do all kinds of... Sorry, my computer just went to sleep.

Amanda Casari (09:58):
So you can play whack-a-mole with it and you can just try to continue to develop things on top of things. Or, you can take a look and understand what is the shape of the statistical distribution and the data and the data sets that you were working with, before you try to jam it into some kind of model and make some kind of evaluation. So maybe save yourself some time, like Alice was saying, and walk through it with a little bit more rigor. I feel like it's a little more proactive than reactive when you're trying to get something to work, to have that kind of statistical rigor and intentionality as you're building things out and giving it that additional step in care, rather than trying to just throw as many things into something and hope that you get useful results out of it.

Brooke Wenig (10:44):
So Alice, I know you had mentioned that domain knowledge is really important when you're trying to do feature engineering. Is that where people typically go wrong or is there something else that causes them to go wrong? For example, they don't understand the assumptions of the downstream models. Where do you see people go wrong with feature engineering?

Alice Zheng (11:00):
I think I'd have to go back to the point that I just talked about. Where people... where current practices may be deficient is when people don't think enough about what they are producing. So, but just blindly applying feature engineering and or modeling techniques. So for instance, in the later chapters of the book, we talk about how deep learning, which is very hot these days and in some cases almost synonymous with machine learning, deep learning, you can actually see it as... the various layers as a model that's aimed to extract useful structural features out of the data. And there's typical canonical layers that you can apply and people have composed together, certain deep learning models, and you can just take and use, right, like standard deep learning architectures these days.

Alice Zheng (12:29):
But if you just do that, you may not be thinking about what exactly is it that you're trying to extract out of the data. So what does a fully connected layer mean? It means that you are... the next layer of output is a linear... is several linear combinations of all of the input, right, from the previous layer. And then what does dropout mean? You're saying you want to add a bit more robustness to the features that you learn, such that they are resistant to noise. So you're artificially adding noise by dropping out a certain subset of the signals. And if you just compose together a certain, as in computer vision, for instance, when you're analyzing images, you are saying... you might be wanting to build localized filters. So you just connect together pixels that are next to each other, or features that are next... spatially located close to each other, right?

Alice Zheng (13:49):
So by thinking about it that way, you can mentally understand what are the types of features that I'm trying to extract from the data? Rather than blindly just applying known architecture or known models. I think that is the gap sometimes in application, is when people are just blindly applying. And so with this book, by going through here are the different types of features, here's why we construct them this way, and here's what you can do with them, and in some cases, we go a little deeper and say, "Here's where they might break down", we hope to give people that kind of intuition so that they can go off and more intelligently apply these techniques.

Amanda Casari (14:36):
It's interesting to me that you said the robotics piece, because I did an undergrad in control systems engineering, and then my work actually was in robotics for a short period of time. And that idea of the intentionally walking through your parameter tuning earlier and figuring out how to optimize it for this control, that was the least... my least favorite part of the job was actually going through. And even though you could run through emulators and you can do the real testing in real life, but walking through those super precise control loops and different kinds of control loops, similar to spinning up a cluster in your cloud preference of choice and putting some things in and seeing how things respond and then evaluating your results, that for me is I'm like... the idea of continuously running the experiments just to see what it comes out with is endlessly frustrating. Because if I don't understand the underlying mechanisms, if they don't actually make sense to me in a predictable way, I get so frustrated and I just want to walk away. It's not a fun experimentation for me. And so I like the idea of really understanding the underlying mechanisms that you're trying to model on top of. And also, because I feel like that's the way... that's where you go back and debug afterwards, right? So when things are going off the rails, you don't just look at the model, you have to go back and debug the data.

Amanda Casari (16:03):
And I would rather do that first than do it afterwards when I'm just frustrated that nothing's working out with the results that I want.

Alice Zheng (16:12):
I think that's what distinguishes practitioners from artists. I want everybody to be featuring engineering and machine learning artists. You get there by developing a deep intuition about the model, and you get there by understanding, what is the model or the feature trying to do? How is it manipulating data? Not treat it as a black box.

Alice Zheng (16:43):
As you were saying, interpreting outcome ... Yeah, I could just run a grid search and get the output here. Here's the evaluation metric. Here's my process, set up a cluster, go off and run it, and then I come back and pluck the configuration that gives me the best metric. You can do that, and often people do. We program our pipelines and large machine learning systems to do that automatically every day.

Alice Zheng (17:16):
To some degree, that is fine. But I think when you're doing the research, you do want to understand not just what is the outcome, but why? And sanity check. Is that the right thing? Because you could very well have set up the wrong experiment. You could have set up the wrong grid search. What if your grid wasn't wide enough, and the parameter that you retrieved is on the edge of the boundaries? Which means that you probably need to extend it further. If you don't look at the outcome and think about it, then you probably are not discovering edge cases that will be important, because that's how it's going to break.

Alice Zheng (18:08):
This is the part of the artistry, is, almost via a sixth sense, you know where to look. But that comes from years of experience of actually doing it and having things break, and then learning more about why it breaks, and then so the next time you can anticipate that. That's where I hope the field will go, is that we move from a field of just throwing tools at a problem to creating art out of our problems.

Amanda Casari (18:46):
I don't know if you feel this way, but I also think it's the step. It is one of the most crucial steps in the process, whether it's for a one-off analysis or for a production pipeline, where you really can start to add checks and add additional pieces to keep things from going off the rails. Whether it's because there is additional data inputs that you were not expecting, so whether the data shape starts to change, you'll see that first. It's a leading indicator, that you'll see it first when you're doing feature transformation.

Amanda Casari (19:15):
But I also think that's the place where some of the work that's being done around inclusion metrics, and looking at subsetting different kinds of data, and population to make sure that you're not creating models that are going to be biased in a way you could have predicted, that's the step that all those things can happen at. I think if we start normalizing those practices earlier in that transformation step, we allow those human evaluations and we allow those pieces where we do want humans in the loop to be figuring out or to be automating those checks. That way we know we're not creating things through neglect, or creating any kinds of thing that might harm people in a way that we could have predicted if we had done some additional analyses.

Denny Lee (19:58):
This is actually really true. We actually have a couple of Data Brew vidcasts which actually talk about privacy and fairness for exactly that reason. It's interesting segue there.

Denny Lee (20:08):
But then I did want to ask the question ... We've been talking about the artistry of building these models, and, for that matter, gaining that intuition, as you called out, Alice.

Denny Lee (20:19):
But then the one I'd like to roll back into is, what does it take for all the infrastructure to support all of this? Basically you're going to need the infrastructure to build features. I'm not asking for you to refer to your specific clouds or anything like that, I'm just saying, what does it take underneath the covers in order to be able to store these features, transform these features, analyze these features, automate them, from your perspectives? Whoever'd like to tackle that first, that's cool with me.

Amanda Casari (20:50):
I used to use Hadoop and Pig, and that gets the job done, so you know ... For me, the idea of feature store is what you're really doing is you're saying there is value in transforming data in ways and not having to do it with every workload. The idea of a feature store is that you know you're using the same kind of processed information in multiple places or in multiple ways.

Amanda Casari (21:17):
Yes, there's a lot of other infrastructure ways and MLOps benefits that you get out of it, but it's really the acknowledgement that it's not worth your time, your company's time or your individual money and account, to try to have six people doing the same transformation at different times and different days. For me, the idea of creating feature store, I'm like, "Yeah, that's just a logical next step when you're trying to scale out your work."

Amanda Casari (21:43):
Is it necessary if you're all by yourself? If it saves you time that you don't have to rerun variables constantly as part of a pipeline, then yeah. At that point, it saves you time and efficiency. I think it depends for me at what kind of scale you're looking at doing, and then asking yourself whether or not that is saving you time and workloads, and saving a team time and workloads to invest that as a separate work stream.

Denny Lee (22:08):
Makes sense. Alice, anything that you'd like to add? Especially because we did already jump right to the feature store. That usually is an implication of production pipelines and massively large workloads, big data.

Denny Lee (22:21):
Amanda, you brought up Hadoop and Pig. Man, you're bringing up the old days now. No, but in terms of, even for your small data environments, like when you're working with pandas, the same infrastructure is still not required for them. You alluded to that already, Amanda, that, yeah, you're doing this for yourself, but even having some of the infrastructure will make your life easier because you're often forgetting about this stuff. I'm just wondering if there's anything else to add, whether it's the big data or the smaller data, per se.

Alice Zheng (22:52):
You just want us to talk about Spark, don't you?

Denny Lee (22:55):
No. Actually I am not trying to do that. That's why I said small data. That's why I said smaller data.

Alice Zheng (22:59):
Right. Sure.

Denny Lee (22:59):
That's precisely what I'm not doing. I'm precisely not doing this.

Alice Zheng (23:09):
I agree with what Amanda just said, and I'll add to that in that I think the ... I'm not an infrastructure expert, by any means, though I'm fascinated by the area and I'm certainly learning a lot about it in my current job. I think whether or not you need a feature store depends on the scale of the project and how many other collaborators there are. Yeah. If you're just working on it by yourself, it's probably not worth the trouble to figure out and build the infrastructure to store all of this.

Alice Zheng (23:49):
The other downside, I think, for a feature store is that I don't know how the feature was generated. It loses some of its interpretability, maybe.

Alice Zheng (24:03):
Feature store, almost by definition is kind of like a black box, right? In order for me to effectively, efficiently use it, I think I would need to invest a lot of time to understand how were they trained?

Alice Zheng (24:17):
And also the feature store, unless you're talking about a constantly refreshed feature store, if it's a static one, then it could go out of date. So there's a question of how are the features updated? And are they updated frequently enough? And whenever a feature gets updated, what does that mean for the downstream gamma, or whatever, experimentation systems that are depending on it, because if the update changes certain aspects of the feature that...

Alice Zheng (24:54):
For instance, if it all of a sudden flips all the positives and negatives, which some updates are fine, right? Because some feature generation, it doesn't matter if you just multiply a negative sign, it's still geometrically pointing, the vectors are pointing in the right direction. So positive, negative...

Alice Zheng (25:15):
But if you have something downstream that is sensitive to the sign, or the scale, then that's going to be a problem. So again, I don't think you should blindly rely on the feature store without really understanding how is it being generated and maintained.

Alice Zheng (25:41):
And I think ultimately, a lot of machine learning and data science in production relies on infrastructure. And there is definitely not enough discussions about infrastructure. I don't know if the one infrastructure will apply to all cases, probably not.

Alice Zheng (26:07):
So each application, almost, needs to build this infrastructure. And then it's a question of how much of that can be generalized and shared. I don't know the answer to that question.

Amanda Casari (26:21):
Can I add onto that, Alice? I think the point that you brought up earlier around not knowing for feature stores how things have been transformed, brings us into the... That's part of the larger question, I think, that we have of documentation in general, for datasets and for transforming, first of all, raw data generation collection, the transformation properties that we apply to it.

Amanda Casari (26:46):
And so if we reference back to the work done for data cards, for data cards for datasets, or data sheets for datasets, which has been now transformed into a few different, there's the project for data, there's the data cards project.

Amanda Casari (27:04):
I feel like a lot of that still hasn't caught on. So documentation in general, I think, for everything including... Like leading up to the model, but everything before that, we really need better practices for that, so that we can ask questions, not just of the outcomes, but of the original collection and transformation. And we can ask better questions, not just of the products themselves, but also for the people who are responsible for those sections.

Amanda Casari (27:31):
And I think that for feature stories, that needs to be a piece as well, where it's not just the metadata. That's great, but you also need more information about original intention. You need a question information about what was mathematically done along the way. And we just need better documentation for all of those.

Brooke Wenig (27:53):
So you raised an amazing point, Amanda. And I love the call-out to data sheets for data sets. How do you suggest we document these things and share it with other people?

Amanda Casari (28:02):
Yeah, so I think, I asked this question actually on Twitter a while ago when I was trying to figure out... Because I'm right now doing quite a bit of dataset curation, and release, and trying to figure out that whole piece of the puzzle.

Amanda Casari (28:17):
I do think, I will call out... I want to look at their names to make sure I get it correctly, but so I know, Andrew Zaldivar and then Ahema, I can't remember her last name right now, did recently speak at the ACM fact conference on their project for creating data cards.

Amanda Casari (28:34):
And it's actually a design process. So it's not just the output itself, or integrating that into our workflow. But what they've found is a part of their work over the last two years was that it's not as simple as integrating checklists into tools.

Amanda Casari (28:49):
If you want to really capture that kind of information, then making sure that things are more intentional, and identify pieces of information and passing that on, there's lots of social components, there's lots of components for technical integration.

Amanda Casari (29:04):
I don't think it's a solved problem yet, because I think that we, as an industry, still haven't embraced that social context matters, and that social construct affects our outcomes, and our workflows, and what we're trying to create.

Alice Zheng (29:17):
I was going to say that documentation is great, but documentation goes out of date. And that requires a process by which to keep things up to date. I think that needs to be supplemented by automated checks in the systems themselves.

Alice Zheng (29:36):
So every system, at the boundaries of input and output, should perform sanity checks. You should have covariate shift checks in place for how your input distribution may have changed.

Alice Zheng (29:55):
And you should have sanity checks for making sure that your output isn't crazy. Or you might want to... Yeah, that could be the input check. I need to check for the next system.

Alice Zheng (30:09):
So I think documentation, safety guardrails in the system, monitoring, continuous monitoring and metrics, and diagnosis, those are all part. All of these steps, none of them are optional. They should all be... There should be a process around keeping all of those running.

Alice Zheng (30:40):
So yeah, it's not just about systems, obviously it's about people and processes.

Brooke Wenig (30:45):
That was an amazing point you had made, both of you, about documentation processes, continual monitoring, and making sure that we have these social constructs in place.

Brooke Wenig (30:53):
I think we're almost at the bottom of the hour here. So I just want to wrap it up and say thank you both so much, Alison and Amanda, for joining us today on Data Brew to talk about feature engineering and all things related to it, infrastructure, et cetera.

Alice Zheng (31:04):
Thank you. Brooke and Denny. Spark, spark, spark.

Brooke Wenig

Host

Denny Lee

Host