Sergey - stock.adobe.com
Feature engineering is a complex part of the machine learning process. It plays an important role in organizing raw data sets for different AI and machine learning algorithms. It is an iterative process that changes as data scientists explore different hypotheses about what the data represents and the types of algorithms they choose to reach a specific result.
Some of the top feature engineering tips include working with domain experts, continuously training the models, starting with standard techniques, calculating the correlation between features and thinking back from the objective.
What is feature engineering?
At a high level, feature engineering is the practice of transforming raw data into the most appropriate form for a specific machine learning algorithm. Piyanka Jain, president and CEO of Aryng, a data science consultancy, said that this consists of two distinct parts.
The first part involves data preparation toward model amenability. This includes processes such as outlier treatment and handling missing values. The second part involves feature creation and transformation to get the best model outcome, including some combination of metrics such as accuracy, recall or lift.
Feature engineering is also part of what Jain calls the insight step of the complete data science project. Prior steps include formulating a business question, creating an analysis plan and data collection.
"Before we get to feature engineering, we have already done a fair bit of business framing around what the model should deliver, how it is going to be used and what are the best hypotheses around the problem that model is solving," Jain said.
The importance of feature engineering
"Feature engineering is a process of making data actionable for the model and is important to ensure your AI models perform correctly," said Ivan Yamshchikov, AI evangelist at Abbyy, a document processing tools provider.
Ivan YamshchikovAI evangelist, Abbyy
Good feature engineering requires experts with the appropriate context of the application, the data sources and the ways the data has been processed and managed. Various tools are emerging to automate some aspects of feature engineering, but many experts see these as augmenting rather than replacing human skills at feature engineering.
"While the industry is beginning to see levels of automated feature engineering, particularly with image recognition, this level of automation still isn't possible for the majority of business cases," said Tom Dyar, product specialist at InterSystems, a database tools provider.
Introduce expertise as domain knowledge
One of the top feature engineering tips is to use subject matter experts to guide data engineers and data scientists on how to think about structuring the data.
SoftSmile, which uses AI to create orthodontics, has seen tremendous benefits around its data workflows by applying domain knowledge of its clinical data to the feature engineering process. Khamzat Asabaev, co-founder and CEO of SoftSmile, said his team found that standard preprocessing techniques such as noise removal and correlation-based filtering were computationally efficient but could only address part of the issue. Domain experts can sort out other aspects of data preparation.
In practice, this involves tight cooperation with expert clinicians to build a framework for feature engineering which is then enhanced by applying filtering techniques.
Train AI models continuously
Another feature engineering tip is to train AI models continuously, which is often referred to as exploratory data analysis, Yamshchikov said. This is especially important if models are used to predict outcomes in potentially life-saving scenarios such as in a hospital setting.
This practice involves finding the optimal combination of data sources to train several models and finding the one with the best overall performance, which may change as new data comes in. A hospital might want to use process intelligence to predict when a patient would need a hospital bed based on the results of certain medical tests. The models would need to ask which tests to run while considering test accuracy, time requirements and costs. These questions need to be answered before training a network to predict when or if a patient will have to be hospitalized.
Although running more tests could improve the final accuracy of recommendations, it could also slow the process, which in turn could cause a worse overall effect. One of the ways to find the optimal combination of data sources is to train several models and find the one with the best results, which could be a long and difficult process.
"This part of the research is usually called explorative data analysis, but to some extent it is already a form of feature engineering," Yamshchikov said.
Start with standard techniques
Data scientists may be tempted to experiment with cutting-edge techniques, but this can add unnecessary complexity.
"The risk with too much experimentation is that it overloads the model with unnecessary additional data features," said Rosaria Silipo, principal data scientist at KNIME, an open source data analytics company. She recommends starting with standard techniques and then experimenting later if required.
For example, in time series analysis, data scientists should start with standard techniques such as first order differences or logarithmic transformation to transform the data into a stationary time series. If those don't work, then you can consider experimenting.
Calculate the correlation between feature values
Another of the essential and easy feature engineering tips is to calculate the correlation between feature values. This helps select only appropriate features by removing the features with duplicate values, said Alex Ough, senior CTO architect at Sungard Availability Services, an IT production and recovery services provider.
There are a number of popular tools, in addition to the native support from programming languages like R and Python, that help accomplish this task. One tool is pandas-profiling in Python, which shows different kinds of correlation matrices along with other helpful feature engineering jobs like feature data analysis for data type, missing values, statistics and so on.
"The bottom line of how to calculate the correlation is to compare the change patterns of two feature values," Ough said. For example, does the other value also increase or decrease when one of the values increases? The correlation calculation will be more accurate if the feature values are unique and have fewer missing values.
Think back from the objective
Michael Yurushkin, CTO and founder of Broutonlab, a data science consultancy based in Russia, recommends starting a project by working with the business to determine what is predictive of the outcome and translating this into features.
"You can decide better on which model would perform best even before starting the modeling if you know the end goal," Yurushkin said.
For example, he had one client who wanted to create an app that would recognize particular objects and people in a video, extract frames automatically and send the selected frame to the user. The models the client had been using were not doing the job. Broutonlab started by looking at the video streams to understand how the objective could be achieved and built several toy models to validate various approaches. As a result, they found the optimal combination of models and features, reduced the video processing time from three hours to 30 minutes and streamlined the manual work required in the process.