AgenticOps for infrastructure faces a deterministic dilemma
Human-in-the-loop and static rules don't scale. LLM-as-judge might break the bank. Can AI agents become reliable and affordable enough to handle the explosion of AI infrastructure?
For enterprise AI to succeed in the long term, agents must assume responsibility for reliably managing IT infrastructure as it grows beyond human comprehension. While trust in AgenticOps is growing, the industry must also resolve a set of thorny tradeoffs among reliability, scalability and cost.
Already, AI-generated code is pushing defects, vulnerabilities and reliability issues downstream at high volume, overwhelming IT operations teams and typical code review mechanisms. DevSecOps vendors' proposed solution: more AI agents for testing, hardening and incident management, including automated remediation features.
There's evidence that IT operations teams are embracing this agentic help: A May Omdia survey of 400 IT and cybersecurity professionals in North America found that 81% cited IT operations as the functional area where they were using, piloting or planning to use AI agents. The next most popular answer -- data analytics or business intelligence -- was cited by 62% of respondents.
For most IT teams, however, AI agent autonomy is another matter, though trust in AI agents to make autonomous decisions is growing in some areas, according to recent market research.
"I've asked this question three or four times [in my research], and it usually comes out as a perfect bell curve of 20% [using AI for intelligent] alerts, 60% [for remediation] recommendations and 20% autonomous actions," said Bob Laliberte, an analyst at TheCube Research, who surveyed 330 IT professionals in May to gauge their use of AI in networking decisions. "This year, when I did it, using it for alerts only was down to 1.2%; recommendations only were at 10%; recommendations with manually triggered automation were at 60%; and autonomous actions were up to 28.5%. So we're starting to see that shift as people get comfortable with the technology."
Still, trust remains a significant roadblock to AgenticOps. In Laliberte's May 2026 survey, "Trust or time to validate and be comfortable with the technology" was the top answer to the question "What is the biggest barrier to adopting agentic AI for autonomous actions," chosen by 59.7% of respondents. This trepidation was echoed by attendees at industry conferences such as IBM Think, Red Hat Summit and Cisco Live this year.
A survey of 330 IT professionals by TheCube Research in May found an uptick in trust for autonomous AI agents in networking.
The trust gap vs runaway AI infrastructure scale
Enterprise infrastructure teams are wary of AgenticOps autonomy with good reason, with anecdotal horror stories about AI agents going rogue -- deleting entire email inboxes, or production databases along with their backups, in two recent cases.
Industry research has also shown that agents based on frontier large language models (LLMs) are limited in their ability to reliably handle long multi-step workflows. A report published by Microsoft Research in April showed that, across 20-step document-based workflows, agents based on frontier models corrupted those documents on average 25% of the time.
"The somehow worse finding: adding an agentic harness with tools makes performance 6% worse on average," wrote Corey Quinn, chief cloud economist at Duckbill, in a Substack newsletter May 12. "The entire architectural premise of the agentic movement currently being rammed down our throats ('give the model tools and it becomes more capable') provably degrades outcomes in this benchmark."
Despite the warranted caution, pressure on IT operations teams continues to force the issue, industry watchers say.
"IT admins have been struggling with mounting complexity and having to take on new responsibilities for years," said Scott Sinclair, an analyst at Omdia, a division of Informa TechTarget. "There is too much work to do for the modern admin. The opportunity to leverage tools that can significantly reduce the workload is simply too enticing to pass up."
At some point you're going to want to remove the human from the loop, because they're actually slowing down everything.
Roy IllsleyAnalyst, Omdia
So far, a common answer from infrastructure vendors to worries about AI agents' reliability has been to keep a "human in the loop" to evaluate agents' proposed solutions to problems and oversee their results.
However, that human in the loop is quickly being outpaced by the sheer scale of AI infrastructure. Data centers are expanding to gigawatt scale as part of a $5 trillion buildout to keep pace with frontier AI development, and behemoth rack-scale hardware systems such as Nvidia's Vera Rubin are set to ship in the second half of this year.
AI infrastructure resources have become so in demand that new network chips and devices are being developed to support server clusters spread across multiple gigawatt data centers. Cisco execs estimated during this year's Cisco Live conference keynote that network traffic associated with AI will triple in the next three years alone.
Long term, AgenticOps must work reliably and autonomously at scale for enterprise AI to succeed, said Roy Illsley, an analyst at Omdia.
"In the next 12 to 18 months, this whole agentic orchestration [movement] is going to accelerate, because as you start scaling, talking of megawatt racks, that's when you're going to hit the need for it," Illsley said. "At some point you're going to want to remove the human from the loop, because they're actually slowing down everything."
Respondents to a survey of 330 IT professionals by TheCube Research in May still cited trust as a top challenge to adopting AI tools.
AgenticOps' deterministic dilemma
As AgenticOps tools evolve beyond the need for constant human oversight, the first proposed fix to address the reliability issues inherent in probabilistic AI systems has been to introduce determinism through rules, runbooks and other non-AI workflows that guide agents' work.
"Enterprises should keep agents inside deterministic workflows, where agent reasoning handles orchestration and deterministic systems handle validation and execution," wrote Jonathan Lebert, director and lead of AIOps for the Booz Allen Chief Technology Office, in an email to TechTarget this month. "Long term, success comes from an architecture where every agentic action is bounded, observable, testable, reversible and governed. That's how organizations gain the speed of agentic AIOps without letting small model errors escalate into major infrastructure failures."
However, in the long run, rules and runbooks won't necessarily keep pace with agentic AI scale, either, according to some industry experts. Instead, they run the risk of bringing AgenticOps back to where the first generation of AIOps ended up, used for specific, simple operations, but far from the grand "NoOps" vision it initially started with.
Galileo co-founder Atindriyo Sanyal presents during a breakout session at Cisco Live 2026.
"Part of the problem with the older AIOps platforms was the fact that the rules under which they operated were fairly static, and you could change them, but it's a manual process to change them," said Jim Frey, an analyst at Omdia. "The idea of agentic operations is that it can be more adaptable, it can go and gather data and make its own conclusions about what to do next."
Another common approach to shoring up AI agents' reliability has been to use another AI agent, backed by a separate LLM, as a judge. But as the cost of AI tokens becomes its own major roadblock to enterprise AI adoption, even IT vendors are questioning the viability of this approach.
"When you have millions of transactions happening, these techniques simply don't scale," said Atindriyo Sanyal, co-founder, chief product and technology officer at Galileo, an AI observability company Cisco acquired in April, during a Cisco Live keynote presentation on June 3. "What you really need are metrics that are able to give you the same level of accuracy but dramatically cheaper and faster."
A new hope? Enter specialized observability models
Cisco's Galileo is one example of the latest approach to emerge for reliable AgenticOps, which strikes a balance between reliability and cost with automation that uses lower-cost, specialized models to evaluate AI agent behavior. Cisco officials detailed plans during Cisco Live to make Galileo's small language model, Luna, now part of Splunk, a foundational element of its new Cloud Control AgenticOps platform.
Datadog rolled out a similar agentic observability architecture during its DASH conference this month. IBM and Red Hat also advocated during conferences in May for the use of specialized models, including small language models and open weight models, to cut AI costs.
Specialized models are gaining validation outside IT management. A Cisco customer at a data analytics company said specialized models have become a necessity for accurate results.
"The thing that we do differently -- and this will matter a lot -- we're not just an LLM, we're not just something off the shelf," said Jono Luk, chief product officer at SumerSports in Palm Beach, Fla., which markets an American football analytics platform to professional and college-level teams, during a breakout session presentation at Cisco Live. "One particular [customer] organization said, 'I have five licenses for every single employee in the IT group for every single LLM vendor and still haven't solved my problem.'"
However, for now, industry watchers are still split on whether this approach will be viable long-term. Whether specialized and open-weight models will actually yield cost savings in practice for enterprises remains difficult to quantify, and they have yet to prove themselves as reliable arbiters of agentic behavior at scale in production.
When it comes to ideas like self-healing based on agentic root cause analysis, I think we're already knocking on that door.
Bradley ShimminAnalyst, Futurum Group
"Autonomy is a goal, but not a necessary step to achieve meaningful benefits from adopting agentic AIOps. Full autonomy isn't required to see real value -- decision‑support use cases are where organizations gain the most impact today," wrote Booz Allen's Lebert. "Trust in agentic AIOps won't come from the perfect model; it will come from our ability to manage the risk using observability, continuous evaluation and other operational safeguards."
Another analyst said he's optimistic that specialized models will crack the autonomous AgenticOps problem.
"We could find ourselves at a point where models can 'think outside the box,' as it were," wrote Bradley Shimmin, an analyst at Futurum Group, in an email to TechTarget this month. "I think we already see that in how good models are at inferring intent and disambiguating user requests. That amazes even this grizzled skeptic. When it comes to ideas like self-healing based on agentic root cause analysis, I think we're already knocking on that door."
Even if specialized models move AgenticOps past deterministic training wheels, it will take significant organizational maturity, particularly in data management, to use them effectively, Omdia's Frey predicted.
"Tracking 100% of agentic activity implies clean logs, and that adds cost, too," he said. "All new architectures, including agentic, aren't going to come free."
One representative from a large customer organization said during a Cisco Live breakout session that he plans to implement "self-healing" IT ops over the next 12 months. This will come as the capstone of a multi-year data management project for observability that also included centralizing data collection on OpenTelemetry and consolidating multiple observability tools and data repositories.
"Our ultimate goal is self-healing, but there are a lot of phases we have to go through," said Nimesh Bernard, senior director of enterprise observability at mortgage lender Fannie Mae. "The data consolidation we have, by the end of this year or early next year, will be in a place to do a lot of the things in a self-healing [workflow]. ... That's why we are able to implement a lot of AI use cases without fear."
Beth Pariseau, senior news writer for Informa TechTarget, is an award-winning veteran of IT journalism. Have a tip? Email her or connect on LinkedIn.
Dig Deeper on Systems automation and orchestration