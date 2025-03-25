SANTA CLARA, Calif. -- The engineers tasked with keeping the world's systems running smoothly face a steep -- and in some ways unprecedented -- learning curve as generative AI takes center stage in IT.

That was the dominant topic of discussion at SREcon Americas 2025 this week, from its kickoff general session by a Microsoft corporate vice president on lessons learned building Microsoft Azure Copilot to birds-of-a-feather sessions and hallway discussions about large language model operations (LLMOps).

Generative AI creates a profound change in the way systems behave and, thus, a profound shift in how they must be managed, said Niall Murphy, co-founder and CEO at SRE tools vendor Stanza Systems, but better known in the industry as one of the co-authors of Google's seminal "Site Reliability Engineering" book in 2016.

We move into a world where determinism with respect to management of a system has gone away, and we are into probabilistic management. Niall MurphyCo-founder and CEO, Stanza

With LLMOps, "We move into a world where determinism with respect to management of a system has gone away, and we are into probabilistic management," Murphy said in an interview with Informa TechTarget. "And so a huge amount of the techniques and mindsets and approaches that we learned from cybernetics in the '50s … have to be supplanted by things like confidence signals and approaches that attempt to look at holism of a system rather than a specific response."

Microsoft Azure Copilot: Lessons learned Microsoft corporate vice president Brendan Burns shared some of his company's early experiences with probabilistic management as it deployed Microsoft Azure Copilot during a plenary session presentation Tuesday. Among the issues that surfaced for the team that built Azure Copilot was a major change to testing, debugging and observing systems, Burns said. "Obviously, we're going to monitor all the same things: Do we return results successfully? What's the latency? None of that stuff changes, but it no longer says that your system is working right," he said. "And what is going to say whether your system is working right is the user's feedback." That means that the most important signal for SRE and DevOps teams maintaining an LLM-based app is a user's "thumbs up" or "thumbs down," which can be quantified with statistical measures such as net promoter score and net satisfaction but is ultimately "a lot less like measuring things and a lot more like social media," Burns said. Measuring human behavior is slippery at best. For example, the Azure Copilot team noticed that an outage anywhere in the Azure system tended to skew human Copilot evaluations negatively. "If Azure has an outage in general ... the net promoter score for the client tools takes a dip just because people are just a little bit grumpy," he said. "It's not that different from understanding, 'Is this outage due to my system failing, or some downstream dependency failing?' But it's a lot fuzzier." Prompt engineering introduced another new wrinkle for the Azure Copilot team, Burns said. "In these systems, the prompt -- and really, actually, it's more the meta prompt, the stuff that you're putting around the prompt -- is the code," he said. "And so the same things that you think about when you think about rolling out software, you need to be thinking about when you're rolling out the prompt. Any changes there can have a really big impact on the overall quality of the system that you're building." LLM apps are still maturing, especially for structured approaches to versioning and developing meta prompts, Burns said. "I don't think we have a really good way of having things like our integrated development environments [IDEs] reason about that right now, or even version it independently," Burns said. "The fact that it's tied into the code is probably a problem, because you'd want to be able to move through it in version space independently. … We're still exploring the right ways to do software development here."