Why and when to consider a feature store in machine learning
Feature stores exist to make data for training machine learning models reusable. Explore both the benefits and challenges of feature stores that organizations can experience.
Many machine learning and AI models work best on summaries of raw data called features. These features structure information into a form that makes it easier to train algorithms.
A simple feature might involve transforming a raw date into a weekday or weekend, both of which might be better predictors of behavior than a raw date number. Other kinds of features can be more complex and require intricate calculations across many data streams. A feature store provides a place to organize the most popular features so they can be reused across projects rather than redone from scratch every time they're used.
A feature store can increase automation, improve productivity by promoting sharing and reuse, reduce technical debt in software code, ensure consistency in calculations and provide governance, auditability and lineage for regulatory compliance, according to David Sweenor, senior director of product marketing at data science tools company Alteryx. However, a feature store isn't ideal for every company. Smaller ones may struggle with the overhead required to create and maintain a feature store. Companies may also struggle with reusing features across departments.
What are the benefits of a feature store?
A feature, as it relates to data science, is any variable that can be used for analytics. Simple examples include name, age, sex, zip code and amount. These raw variables are transformed through a process known as feature engineering to yield better predictions. For example, a date could be transformed into a day of the week, a day of the year or a holiday.
A feature store enables a data scientist to create this transformation once rather than having each data scientist recreate the same features repeatedly. This ensures consistency since everyone is using the exact same transformation as part of their models. It also reduces the need to insert the same algorithm within code. If a company decides to change a complex feature, a feature store enables them to change it once and propagate it across all models that use it. Otherwise, someone would have to manually edit all the models using that feature.
"Since processing these data is very expensive, and these data are slow-changing," said Edward Scott, CEO of ElectrifAi, "it makes sense to process them once every hour or day and store the features into a feature store for hundreds of teams to use [machine learning] ML to solve their business problems."
This reduces costs and improves the quality of features. Feature stores also reduce development time and enable developers to launch a new project more quickly. In one specific use case, a feature store can improve time-to-market and campaign effectiveness by looking at how the quality of translated content affects campaign effectiveness across different countries, according to Olga Beregovaya, vice president of machine translation and AI at language translation service Smartling.
Challenges with implementing a feature store
A feature store may not be suitable for every use case or organization as it involves some overhead, which can increase data science complexity, particularly for smaller projects. A feature store could make things unnecessarily complicated when a company has many different sets of data, and each data set is small.
"The feature store adds no value if there are very limited data science use cases within a company," Beregovaya said.
She has also found that feature stores aren't helpful when the data is so disparate that no shared modeling practice will be of benefit. If the data sets are disparate and the metadata is drastically different, then the features built on them are difficult to reuse. For example, they weren't helpful when one team was building an ROI prediction, and another was working on time-to-market estimation and the data came from completely incompatible data sources. Similarly, they can also create problems when data is shared by various teams, but those teams have different service-level agreement requirements.
Other challenges could arise when the raw data from a transaction doesn't contain all of the data needed for a predictive model to run, observed Alteryx's Sweenor. For example, a fraud detection algorithm may require a date, transaction amount, vendor, average amount spent over the last seven days and maximum amount spent over the past 30 days. If the raw data only includes the date, amount and vendor, the system will have to retrieve the average and maximum spend from somewhere else. This may be problematic. Data engineers would have to work with business domain experts and enterprise architects to ensure all the features in their model are available at runtime.
When considering the use of feature stores, each organization should carefully assess whether the benefits outweigh the risks for their specific needs and ML projects. A company with a small number of ML models in production may not reap the same benefits as one with considerably larger ML projects and data sets. Either way, feature stores are becoming more commonplace in the tech world and are worth keeping an eye on.