KOHb - Getty Images

Feature

Meeting the challenges of scaling AI with MLOps

As businesses race to capitalize on the promises of AI in the wake of ChatGPT's launch, strategies to move machine learning software from idea to reality are becoming essential.

Lev Craig

By

Lev Craig

Published: 14 Mar 2023

Although the past decade saw staggering advances in the capabilities of machine learning models, AI hype has reached new heights since the explosion of ChatGPT and other large language models. But without effective strategies for deploying and managing ML software, businesses might not realize AI's full potential.

Running ML models in production requires striking a tricky balance between ensuring model performance and scaling software to meet increasing demand. Even ChatGPT creator OpenAI isn't exempt; skyrocketing traffic has made server capacity and outages recurring issues since the chatbot's launch.

Enter MLOps: an approach at the intersection of data science, ML engineering and DevOps. Although the MLOps landscape remains in flux, the field seeks to address the challenges that arise when taking excitement around AI beyond the theoretical.

DevOps vs. MLOps

The term MLOps draws on the more familiar DevOps methodology, whose name combines software development and IT operations. Like DevOps, MLOps is an approach to producing software that involves a combination of tooling, strategy and culture.

ML software shares many similarities with its non-ML counterparts. Typical performance metrics like latency and load are still relevant with AI software -- sometimes even more so, given that large models can be highly computationally intensive.

Likewise, designing, deploying and operating ML models requires collaboration across teams. Just as the originators of the DevOps philosophy hoped to reduce conflict between software developers and IT ops teams, MLOps aims to promote collaboration between these groups as well as data scientists and data engineers.

In fact, some experts argue that MLOps shouldn't be viewed as distinct from DevOps at all: "I think MLOps is an extension of DevOps," said Mikiko Bazeley, head of MLOps at virtual feature store Featureform. "I don't think it's a separate category."

Piotr Niedzwiedz, CEO of MLOps startup Neptune.ai, shared a similar sentiment in an October 2022 blog post. "At the end of the day, we are all just delivering software here," Niedzwiedz wrote. "A special type of software with ML in it but software nonetheless."

Meeting the specific challenges of running AI at scale

That said, ML and AI do differ from traditional software in some important ways. ML pipelines involve several stages that aren't present in the traditional software development lifecycle, especially before and after models enter production.

Stages of a DevOps cycle, shown as an infinite loop: plan, code, build, test, release, deploy, operate and monitor. — In addition to the more familiar phases of the DevOps cycle, ML software requires additional stages, such as data ingestion and model retraining.

The importance of data quantity and quality is one of the biggest factors that differentiates AI from non-AI software. "The model is driven by the code, which is driven by the data," Bazeley said, "which is totally not the way traditional software products work."

When creating an ML model, data scientists must gather, clean, transform and prepare their data for the model. After the model is developed, it's then trained and validated on data subsets, checking metrics like accuracy, precision and recall.

These stages can themselves be highly challenging and time consuming. But using AI to solve a business problem requires actually putting the working model into production, and serving and deploying an ML model is very different from building one.

"Think about what's needed in order to deploy," said Gilad Shaham, director of product management at MLOps platform Iguazio, in a recent talk at the conference AI at Scale. "The question of how to package, how to secure everything, how to test, CI/CD, automation, monitoring, versioning -- a lot of moving parts that you don't necessarily want the data scientists to have to do."

Some of these are standard DevOps challenges, like integrating with existing applications and scaling software as demand increases. But even familiar issues can be more complex in an MLOps context.

For example, managing computing and infrastructure resource costs is a typical task for IT ops, but the computationally intensive nature of AI raises the stakes and complexity. Likewise, cybersecurity is increasingly integrated throughout the DevOps lifecycle, but security and compliance can be harder to ensure with models that encounter sensitive personal data.

Other challenges are more specific to AI. Producing and operating ML software requires communicating with a larger team that includes data scientists and ML engineers, as well as deciding when and how to retrain models as new data comes in.

As an example of the latter, Shaham described an ML application that uses the average amount a customer spent in the last three months as a model feature. "When you get to deployment, and especially in real-time deployments, how does one access that feature?" Shaham said.

Common points of failure in operationalizing AI

At a high level, any organization deploying ML models will go through a few of the same stages: manual design and development, followed by automation, and finally model monitoring, retraining and analysis.

To date, MLOps tools and platforms have largely focused on building repeatable processes and automations for these stages. "Getting the data, training and deploying the model, setting up monitoring -- the whole thing needs to be as automatic as possible," Shaham said.

But apart from those broad similarities, the MLOps journey can look very different from organization to organization. Companies that never intend to grow beyond several hundred employees, for example, will likely not require certain infrastructure capabilities that a large enterprise needs.

The post-deployment stages are where many companies looking to adopt ML get tripped up. Overall, there are simply many more ways for ML models to go wrong compared with traditional software, and the consequences can be more significant when they do.

"The hard lesson that many companies learn between getting their models from a proof-of-concept state to the real world is that there's this additional set of requirements that come after you've already developed the first model," said Eric Landau, CEO of active learning software company Encord, in another AI at Scale session.

On a practical level, deploying, maintaining and improving an ML model requires extensive tooling, IT infrastructure and collaboration across a wide range of teams and roles. "Most of the failures are on the production side: how to take that model and make it part of a pipeline and scale it up," Shaham said.

This problem is complicated by the tradeoff between model training and performance. Because data set processing and feature engineering can impact latency, those responsible for various stages of an MLOps pipeline must decide how much effort to sink into these processes.

At times, an ML model may run without issue from a performance perspective, but a change in data distribution leads to degraded accuracy and precision. Thus, in some cases, "when ML software goes wrong, it technically worked," Bazeley said. "But there's behavior or there's an impact that maybe wasn't surfaced early on."

Training-serving skew refers to a phenomenon where a model performs well in training but is less effective when it's deployed to production. For example, if a recommender system that uses categorical variables encounters a category in the production data that the model wasn't trained on, the system may not be able to handle it.

When we think of an ML model, is it working, or is it doing what we want it to?

Mikiko BazeleyHead of MLOps at Featureform

In his AI at Scale talk, Landau discussed a use case from Encord's experience, in which a company designed an ML-powered parking management system. The system worked well for the initial location of Boston, but when the company sought to expand to new cities, the model struggled to keep up.

"They realized -- oh my, the same model that we worked with in Boston, it doesn't work as well in Santa Monica and Seattle and Minnesota," Landau said. "There were certain subsets and use cases where it just wasn't really doing very well."

And this issue goes beyond the technical. When it comes to AI, the stakes are simply much higher than with traditional software, Bazeley pointed out. ML models' unpredictable performance on new data can raise big-picture ethical concerns -- and even a model that's accurate and precise on paper could still be problematic in other, deeply damaging ways.

"It could also be that [the model] is being unfortunately accurate and precise in, let's say, mimicking racist conversations," Bazeley said. "When we think of an ML model, is it working, or is it doing what we want it to? I think it begs a deeper question about what would we feel is acceptable behavior."

The future of AI and MLOps

Unlike DevOps, MLOps is still an emerging field. But as more companies seek to use AI in their applications and software, the need for a well-defined MLOps framework is growing.

In a 2022 Deloitte report, nearly all respondents described AI as critical to their organization's success over the next five years. But half reported problems with maintenance or ongoing support after initial launch, and 41% said their organization didn't have sufficient technical skills when it came to AI and ML.

Ultimately, "the maturity around how to build ML stacks just isn't quite there yet, including people knowing what to prioritize very early on," Bazeley said. "There's going to be a lot of confusion going forward."

Evaluating whether to adopt AI and MLOps

This means that it's important for organizations considering adopting AI and ML to ask whether there's a genuine business need in mind or if they're simply jumping onto a trend.

Access to ChatGPT and other large language models, for example, has greatly reduced the barriers to using generative AI. This opens up exciting opportunities for new use cases, especially among those who weren't previously able to take advantage of these technologies.

But at the same time, it's important to think through whether the organization has the resources and need to use AI and ML. When it comes to implementing generative AI, "[companies] need to have an actual product and business model around it," Bazeley said.

Chart listing questions to ask to evaluate an AI/ML project in terms of business, data and implementation feasibility. — There are several factors to consider when evaluating a possible ML implementation for your organization, including business needs, data availability and technical resources.

Any companies seeking to implement a product with the ChatGPT API, for instance, will still need to serve and deploy that software on their infrastructure. At a higher level, they'll also need to differentiate themselves from the many competitors now also using generative AI -- for example, by building up proprietary data assets.

"This is the gap between that's neat -- maybe they had a proof-of-concept model that might do a good job in a demonstration to investors," Landau said, "and that's useful -- something that many people can use consistently in an application over and over. And because AI technology is so new, there will just naturally be fewer companies on the right side of the scale."

In addition to building technical knowledge on the various stages of the MLOps pipeline, successfully operationalizing AI requires collaboration that bridges the business-IT gap. In many organizations, product thinking is pushed over to the business side or product manager, with engineers viewing the technical aspects as their only responsibility.

Combined with increasingly narrow specializations within technical roles, "you see less and less people serving as bridges between the different worlds," Bazeley said.

Supporting creativity and innovation in AI development

Successfully implementing AI in production might also require accepting that ML experiments entail a certain level of risk -- including the fairly mundane risk of running analyses that aren't ultimately profitable.

In a recent Open Data Science survey of ML practitioners, respondents reported that less than half of models in their organizations were ever deployed to a production environment. Often cited as indicative of the failure of ML models, this low success rate can discourage businesses from investing in such projects, but such mindsets might be the wrong way to approach AI.

"What's the balance between supporting the creative spirit of data science," Bazeley asked, "while also making sure that [a model] is well tested and well structured before pushing to production?"

Rather than viewing any model that isn't deployed as a failure, businesses should keep in mind the reality that some analyses simply don't result in useful findings. To make progress and develop truly innovative software, organizations must continue to encourage creativity and experimentation, even in the face of projects that don't pan out as hoped.

Next Steps

How to identify projects that create AI business value

Compare top MLOps platforms

Dig Deeper on Systems automation and orchestration

Part of: What DevOps teams should know about MLOps

Up Next

Meeting the challenges of scaling AI with MLOps

As businesses race to capitalize on the promises of AI in the wake of ChatGPT's launch, strategies to move machine learning software from idea to reality are becoming essential.

Decide when and how to adopt an MLOps framework

Unsure where to start when it comes to standardizing your organization's machine learning processes? Explore key considerations and practical tips for adopting an MLOps framework.

Battle of the buzzwords: AIOps vs. MLOps square up

Another -Ops has entered the arena: MLOps. Is it just another buzzword, or does the term hold its own weight? Learn more about it and how it compares to AIOps.

DataOps vs. MLOps: Streamline your data operations

How many Ops combos can we get? What's DataOps? How is it different from MLOps? This article clarifies the differences and how to choose one over the other.

Set up a machine learning pipeline in this Kubeflow tutorial

For teams running machine learning workflows with Kubernetes, using Kubeflow can lead to faster, smoother deployments. Get started with this installation guide.

How to run ML workloads with Apache Spark on Kubernetes

IT staff looking for ways to maintain ML workloads with ease are increasingly turning to Apache Spark. Follow these simple steps to set up a Spark cluster on Kubernetes.

Search Software Quality

Google adds Gemini CLI for GitHub Actions coding agent
The beta version of Google Gemini CLI for GitHub Actions starts simple and builds in security, but overall, the 'honeymoon phase'...
Scrum master certification exam questions and answers
Are you ready for the Scrum master certification exam? Test yourself on these 10 tough Scrum master exam questions and answers.
8 examples of ethical issues in software development
As software becomes entrenched in every aspect of the human experience, developers have an ethical responsibility to their ...

Search App Architecture

Insomnia vs. Postman: Comparing API management tools
Insomnia has a streamlined interface and focus. Postman has extensive features for end-to-end development. Choosing comes down to...
8 best practices for creating architecture decision records
An ADR is only as good as the record quality. Follow these best practices to establish a dependable ADR creation and maintenance ...
Refactor vs. rewrite: Deciding how to fix problem software
At some point, all developers must decide whether to refactor code or rewrite it. Base this choice on factors such as ...

Search Cloud Computing

AWS reports 17.5% growth, fails to impress investors
Amazon's cloud business delivered better-than-expected growth in the second quarter, but pales in comparison with results from ...
Prep data for machine learning with AWS analytics services
Data preparation is crucial when building and training machine learning models with SageMaker AI. What AWS analytics services can...
Microsoft Q4 earnings surge on cloud results; AI gains steam
Booming cloud business drove fourth-quarter and full-year results past analyst expectations as the AI race continues to heat up.

Search AWS

Compare Datadog vs. New Relic for IT monitoring in 2024
Compare Datadog vs. New Relic capabilities including alerts, log management, incident management and more. Learn which tool is ...
AWS Control Tower aims to simplify multi-account management
Many organizations struggle to manage their vast collection of AWS accounts, but Control Tower can help. The service automates ...
Break down the Amazon EKS pricing model
There are several important variables within the Amazon EKS pricing model. Dig into the numbers to ensure you deploy the service ...

TheServerSide.com

Product backlog vs. sprint backlog: What's the difference?
The sprint backlog and product backlog are important elements of Scrum and essential to iterative and incremental development. ...
Acceptance criteria vs. definition of done: What's the difference?
Software teams must understand the important distinction between acceptance criteria and definition of done and how to use them ...
Spring, Quarkus or Jakarta EE? How to choose a Java framework
Choosing a Java framework is not about which one is best, it's about accepting their tradeoffs of stability, flexibility and ...

Search Data Center

8 ways to enhance data center physical security
Data center physical security is just as important as cybersecurity. Organizations can follow these eight security approaches to ...
Benefits of edge computing over large data centers
Edge computing attracts companies by reducing latency. Its benefits over large data centers include modular design, effective ...
AWS tables Virginia data center after community pushback
The proposed 7.2 million-square-foot operation -- one of the world's largest -- would have added to Amazon's $35 billion data ...

Close