The early days of developing AI and machine learning models were characterized by the slow interplay between ideas and experiments followed by the occasional deployment. However, it seems that development has led to limited success with moving models from experimentation to production environments.
Enterprises are starting to adapt DevOps practices that have proven successful with software to build AI and machine learning models using machine learning operations. Various tools that support aspects of software are coalescing into coherent MLOps tools to streamline data preparation, run AI workloads in the cloud using AI-as-a-service and other capabilities to weave these new models into front-line applications.
Taking notes from DevOps integration successes
Until recently, there was no standard way to deploy, serve, monitor and govern models in production. This created silos that made integration work much more difficult.
An MLOps framework is about using technologies to enable streamlined delivery of machine learning into the business or products, rather than simply developing technologies that leverage machine learning, said Santiago Giraldo, senior product marketing manager at Cloudera.
This article is part of
"The biggest win for DevOps leaders and teams is the growth of MLOps as a standardized practice supported by innovations in data and model lifecycle platforms," he said.
Good MLOps practices make getting to production -- and, in turn, enabling other parts of the business to adopt and use machine learning models -- easy.
Streamlining a repeatable process
Streamlining the end-to-end machine learning workflows from data to production in a transparent, repeatable way can be an issue. A common misconception of machine learning workflows is that model code is the most important part. That's not the case, Giraldo said, as evidenced by the low success rates of enterprise machine learning projects. He believes successful production machine learning is built on a solid foundation that starts with data management and data engineering. These good data management practices empower data scientists to work effectively and deliver machine learning models to the business.
Enterprises also need to consider the requirements for operating models once they are in production. This includes everything from accuracy monitoring to proactive alerting, full lifecycle lineage tracking, model cataloging and governance features that enable transparency and facilitate adoption.
Early MLOps tools focused on delivering one aspect of these capabilities; building a full MLOps toolchain requires a way to seamlessly tie tools together or adopt full MLOps suites. These capabilities need to address the full data and machine learning lifecycle to simplify the collaboration of business managers, IT, data engineers, data scientists and data protection officers.
"It's incredibly important to have a machine learning platform designed to foster collaboration across various teams and [to] enable data scientists to deploy, serve and monitor machine learning models powering ML business use cases at scale," Giraldo said.
Keeping models current
The biggest challenges with provisioning and updating machine learning applications in production is how the models are deployed, governed and monitored for ongoing accuracy. Machine learning models are living applications that often continue to learn from new inputs; ensuring they stay accurate can be a difficult task.
"The nature of the data can evolve over time, making the initial models trained on this data no longer effective," said Jeff Fried, director of product management at database tools company InterSystems, based in Cambridge, Mass.
This can cause difficulties in how teams monitor results and the outcomes that machine learning models produce. Techniques for MLOps that measure and adjust for drift are available, which is a big help. Updating a model in production can be tricky, too. The new version may be less accurate than the previous generation and need time to catch up, Fried said.
Challenges in the MLOps lifecycle
Because of the interdependency of data and code, changes in data can have a significant impact on the outcome, said Clemens Mewald, director of product management at Databricks, based in San Francisco.
Thus, there needs to be a strong focus on data quality monitoring and feature engineering. One of the most common reasons for production outages is an incompatible change to the modeling approach such as adding or removing features.
"Even if a model is compatible, its performance during the training process is not guaranteed to lead to good performance in deployment," Mewald said.
Consequently, A/B testing is a recommended best practice to ensure that new models perform better than the ones they replace. Once a model is deployed, its performance can deteriorate because data distributions change over time. For example, purchasing trends can change over time requiring a recommendation model to adapt to the new information. Machine learning platforms need to provide the ability to retrain models on new data to keep them fresh and deploy them safely, closing the loop. A good machine learning platform or service needs to address this issue in a unified way to ensure robust governance, Mewald said.
Popular MLOps tools
Bharath Thota, vice president of data science for the advanced analytics practice at management consultancy Kearney, sees many larger companies building their own platforms for the end-to-end machine learning pipeline. Examples include Uber's Michelangelo, Facebook's FBLearner Flow and Airbnb's Bighead. Other companies are also open sourcing their platforms, such as Google's TFX and Databricks' MLflow.
This trend is driven by the need to manage the entire machine learning pipeline and to enable easier model updates. It can also help bring the sophistication of DevOps to data science, with orchestration and management capabilities that enable effective machine learning lifecycle management.
Thota said some of the important capabilities to consider when building a new MLOps platform include:
- Managing iterative machine learning experiments and keeping track of metadata like data versions, hyperparameters and learning algorithms.
- Debugging training failures and automatically capturing issues like overfitting, gradient explosion and class imbalances.
- Planning gradual/staged deployment of significant updates to the production model and ensuring failure tolerance by testing appropriately.
Dawn of the model development lifecycle
Today's machine learning model development cycle is inefficient, much like the early days of the software development lifecycle, where the speed of software delivery was hampered by the lack of automation and best practices, said Shekhar Vemuri, CTO of technology service company Clairvoyant, based in Chandler, Ariz.
Finding ways to weave together methodologies like Agile and DevOps into the model development lifecycle for machine learning will bring the same benefits to AI. The biggest problem facing MLOps is the siloed fashion in which teams work together. Data scientists use their own toolsets and environments, and often copy data into different systems to build their models. Once built, these models are then handed over to engineering teams to embed into production applications. This process can be extremely error-prone, as well as expensive in terms of both time and effort.
One change for MLOps is to focus on safely going faster, rather the building the perfect model.
"Most teams should focus less on achieving perfection in modeling before putting something in production but aim towards how they can design the product and the system to support more frequent changes, experimentation and then validation of these changes," Vemuri said.
Enterprises also need to focus on building a culture that spans data science, development and operations in a fluid way. Finding machine learning engineers or machine learning operations staff can be a challenge. Anand Rao, global artificial intelligence lead at PwC, recommends companies find promising candidates within their staff and upskill them.
A recent survey by O'Reilly Media revealed an ongoing trend with critical machine learning and AI-specific skills gaps in organizations. At the very top of the list were a shortage of machine learning modelers and data scientists, cited by 58% of respondents. Close to 40% selected data engineering as a practice where skills are lacking, amongst other challenges.
"Giving teams access to expert content on topics around AI and ML and forums of fellow practitioners, along with interactive online training scenarios, can help up-level skills and is a solution for both in-office and remote workers," said Rachel Roumeliotis, vice president of content strategy at O'Reilly.