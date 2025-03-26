SANTA CLARA, Calif. -- Machine learning introduced collaboration and performance management issues for engineers, but large language models present an even greater departure from traditional approaches to reliability engineering.

Machine learning operations (MLOps) versus large language model operations (LLMOps) was the topic of presentations and sessions at SREcon this week, including a discussion session specifically addressing attendees' experiences supporting machine learning models in production and how that compares with LLMs.

The rise of MLOps and LLMOps both echo the earlier transition to DevOps -- past points of collaboration between IT ops specialists, developers and data scientists introduced similar organizational friction, attendees said.

"This goes back to DevOps versus SRE and how you handle expertise and responsibility," said Jacob Scott, an SREcon attendee and software engineer with 15 years of experience in operational excellence. "[Things] like, should data scientists be on call? And how do you get people to do that?"

With LLM-based apps, SRE teams face a similar question: "Who can fix it?" Scott said. "There are a lot of failures that SREs are best positioned to fix, like load shedding if your database is on overload and figuring out it's overloaded. But who is positioned to respond to an LLM being buggy, or hallucinations?"

MLOps and LLMOps are both all about the data Another similarity between MLOps and LLMOps is an emphasis on precision when managing data inputs and the struggle to deal with the deep and subtle dependencies this can create in the overall system. That was the consensus among participants in a discussion session this week conducted under the Chatham House Rule, in which statements are repeatable but not attributable to their specific sources. "I've seen multiple places where like you get bad data, like a sub-pipeline goes down, and you don't notice, or you silently stop emitting events that feed it in some way, and the model degrades," a discussion session participant said. "But if you don't have enough fidelity of measuring what success actually is, and days or weeks later, someone running a query across a higher fidelity system will be like, 'Why did this business metric go whack?' And you're like, 'Oh, [expletive]. Now I realize my pipeline is broken.'" This can worsen political friction between parts of the organization, according to another participant. He recalled when an ML team perceived a similar failure as an indication that SREs didn't care about their systems. The problem of subtle degradation in complex systems also applies to LLMOps, where small changes to prompts and models can have damaging results, participants said.

LLMOps is made of people However, the results of subtle changes to LLM data can be even thornier to track down and cause higher-profile failures, said Niall Murphy, co-founder and CEO of SRE tools vendor Stanza Systems, in an interview with Informa TechTarget. "A lot of quality concerns [with LLMs] can be quite narrow, like, 'This model has now gone off the rails with respect to the Schleswig-Holstein debate of the 1850s, but it doesn't actually make a difference to 'How do I make pancakes?'" Murphy said. "There are some question spaces which are commoner than others, so you can have a degradation and still not affect the people who care about pancakes. And that's OK, except when that starts to drift and affect other things as well." MLOps presented a monitoring challenge because failures could be more subtle than a system being "up" or "down." However, those failures could still be measured more concretely -- and avoided more easily -- than with LLMs, which must be measured using subjective human responses to how the AI responds to text-based prompts. "You're taking a bunch of statistics that traditionally a product manager would look at, and you are pulling them into the reliability side of the house," said a discussion session participant. According to another, "When you only have signals from a traditional application or a traditional system, you're figuring out how to keep an adaptive system undersaturated, under capacity, in a happy little circle [of performance] … [But] the golden signal for AI might be, 'Is this prompt in the right context? If I try a family of contexts, are the ones that I'm picking effective?' That replaces a golden signal with a process." The human factor with LLMs, and the relatively high profile of the technology, can also worsen the kinds of organizational conflict some attendees saw with MLOps. For example, Microsoft corporate vice president Brendan Burns described intense mistrust of Azure Copilot among users when the company first rolled out the AI assistant during a presentation Tuesday. "We had some very, very disturbed users when we first rolled out the Azure Copilot who strongly believed that we had stolen … their VM data … and trained it into the model," Burns said. As the Azure Copilot team refined prompts behind the scenes, there were also sometimes tense discussions among internal stakeholders about which team's prompts were used to respond to user questions, Burns said. "If I walk up to the Azure Copilot and I say, 'How do I back up my database?' Is that a prompt for the database team to handle, or is that a prompt for the backup team to handle?" he said. "You can look at agentic approaches to blend those answers together, but especially early on, we just chose one handler and went with it. And then teams would get very upset, and they'd say, 'Why didn't you ever choose my handler? Why do you hate us and love the backup team so much?'" Some discussion session participants reported that user mistrust of AI went even further in their organizations. They recalled users deliberately sabotaging AI test responses because they feared AI would eventually replace their jobs.