Feature engineering is an integral part of creating a machine learning model. The process transforms raw complex data sets into explanatory variables -- "features" -- that make machine learning algorithms work, explained Max Kanter, CEO and co-founder at Feature Labs Inc., during a recent conversation about automating feature engineering.
"Automating feature engineering optimizes the process of building and deploying accurate machine learning models by handling necessary but tedious tasks so data scientists can focus more on other parts of the process," Kanter explained.
In this Q&A, he highlights the benefits and challenges of automating feature engineering for machine learning and explains how the Deep Feature Synthesis algorithm works. He also discusses what CIOs need to know about feature engineering.
Editor's note: The following has been edited for clarity and brevity.
How can automating feature engineering help data scientists?
Max Kanter: One of the most time-consuming and error-prone steps of the process of building a machine learning model is taking the raw data and figuring out those domain-specific transformations you need to do. For example, say you are trying to predict how much a customer is going to spend in the future and you have all of the actions they have taken in the past. Then the data scientist says, 'The time since their last purchase and the average time between their purchases -- those are the variables that I want to extract to train my model.'
That requires them to understand the problem and translate their understanding into a piece of code or a script they can run to extract the variables. Ultimately, they train their machine learning model on those variables. If they don't extract the right variables, a model that they built won't be accurate and won't perform well enough for them to actually deploy and help their business.
By automating this process, we can accelerate the time it takes to extract one of these variables, which means we can extract more variables and we can avoid errors that come out of this process. Ultimately, automating feature engineering helps companies and data scientists create more models and get better accuracy.
What are the challenges of automating feature engineering?
Max KanterCEO and co-founder, Feature Labs
Kanter: It is very challenging to automate the process because every company's data is different and has its own complexities. One company might collect information about how their customers behave on their website in one way, and another company will collect it with different column names, table or data bases to keep up.
Beyond that, if you want to create a general purpose way of automating it that works for companies in the retail space, but also would work for a financial services company where the domain is completely different because they are trying to predict something like credit card fraud, you need to have very general purpose algorithms.
How does Deep Feature Synthesis work?
Kanter: Deep Feature Synthesis is an automated feature engineering approach that, essentially, can be applied to many different types of data, ranging from marketing use cases to financial services use cases to healthcare use cases. The general principle behind it is we're trying to emulate how human data scientists would approach these problems.
Deep Feature Synthesis works by having a library of feature engineering building blocks called primitive functions, and each one of these primitives is labeled with the type of data it can input and the type of data it can output.
To give you a very simple example, you can imagine a primitive that took in a list of numbers and outputted the maximum value in that list. We have a library of many of these primitives and when we get a new data set, Deep Feature Synthesis looks at the specific column and relationships in the data and figures out which primitives to apply. That's how it can take the generic primitives and create specific features.
You might also need to extract very complex features to get a highly accurate model, so these primitives can also be combined on top of each other and stacked. That is why we call it Deep Feature Synthesis, because we figure out how to combine primitives in the right order to create the right features. Much like a human data scientist would, we brainstorm a list of potential features we can calculate on a data set. Then we start calculating them one by one and prioritizing and ranking them so that the end user of our software gets recommendations of the most important features to use for their data sets.
What should a CIO know about feature engineering?
Kanter: If it's their first time using machine learning to develop a new service or deploy any product, the most important thing for CIOs to recognize is that without the right features, their machine learning model won't work. It's very important that they recognize that they need to do this, whether it's hiring a team of expert data scientists to do it, or taking advantage of new software to automate [feature engineering] to enable their existing resources, or using open source software that can help them do it.
Every CIO trying to apply machine learning should recognize that they need to have a strategy around how they are going to prepare data for machine learning and feature engineering.
I suggest CIOs adopt three best practices for developing this strategy. First, focus on the problem they're solving and not just build the most accurate model possible. Second, when they begin that process of finding the right problem, make sure they have a structured approach and they're not just developing ad-hoc scripts and connecting random pieces together, because then they won't be able to scale that process up for the future. Finally, they should make sure they bring all of the stakeholders to the table.