This content is part of the Essential Guide: Don't panic! The definitive guide to IT troubleshooting

Enterprise shops look to AIOps for IT root cause analysis

IT ops pros look to AIOps tools to help find the smoking gun in failing IT stacks, but some call automated root cause analysis more dream than reality.

Enterprises have automated IT root cause analysis in their sights as AIOps hype reaches a fever pitch, but some IT vendors are reluctant to jump on the AI bandwagon.

Automated IT root cause analysis is central to many AIOps tools, a category operations tools that incorporates AI to improve the tool's ability to monitor and manage IT deployments. New Relic's error profiles feature, for example, is meant to narrow down the cause of glitches and speed up IT incident response. Enterprise IT ops pros imagine a day where that response is automated through application development tools and IT service ticket systems. AIOps tool vendors even tout a proactive, rather than reactive, IT monitoring approach, which identifies the root cause of potential problems and stops them before they reach the troubleshooting stage.

But despite customers' demand for such features, some IT monitoring vendors won't adopt AIOps.

LogicMonitor, for example, touts its monitoring tools' ability to pinpoint the causes of errors in the IT stack, but its founder and chief evangelist, Steve Francis, balks at the suggestion that machine learning can automate IT incident response.

People tend to define machine learning as anything they don't understand. Anything we do understand is just statistics.
Steve FrancisFounder, LogicMonitor

"People tend to define machine learning as anything they don't understand," he said. "Anything we do understand is just statistics. Everyone's saying we need to be talking about this, but I don't see the value of it yet."

LogicMonitor customers beg to differ.

"They need to think a little beyond red light/green light and [rather about] how to build innovation around root cause analysis," said Miten Marvania, founder and COO at Agio, a managed IT and cybersecurity firm in Norman, Okla.

Marvania wants LogicMonitor's tool to understand the real effect of events in the IT stack. For example, if a load balancer has five servers attached and one fails, he wants LogicMonitor to know that four out of those five servers must fail before it's a critical event.

"Right now, it's a binary approach where, if a server is down, it's critical," Marvania said. "They need to assess the real impact of that."

AIOps fans flames of IT root cause analysis debate

The most advanced DevOps shops no longer want IT root cause analysis to be the focus of IT incident response anyway, as infrastructures become ephemeral and applications distributed. But in more traditional enterprise IT shops, automatic root cause analysis of ongoing problems is still the holy grail of AIOps.

"If LogicMonitor had a way to understand the least common denominator of problems, then it could directly tell us: 'It's this network switch that's acting up' or 'These database errors seem to be at the end of the chain of dependencies,'" said Andy Domeier, director of technology operations at SPS Commerce, a communications network for supply chain and logistics businesses based in Minneapolis. "It gives an engineer a lot more context about how to approach that problem."

Root cause analysis pain points

LogicMonitor's Francis is not convinced, however, that the approaches Marvania and Domeier suggest are viable. Users can configure LogicMonitor with thresholds that indicate how many load-balanced servers can fail without a critical alert being triggered, for example, but he doubts the discovery of such configurations could be automated out of the box.

"That's not stuff we can know absent human knowledge about their application," Francis said. "That's human knowledge and configuration."

As for Domeier's common denominator idea, Francis said LogicMonitor plans to help customers narrow down likely root cause culprits by correlating alerts, but he is skeptical such correlations can be reliably precise.

"I'm not sure that's a legitimate thing to say anyone will have in the short term," Francis said. "We can shorten the time it takes you to look for your root cause. But I don't think we'll ever be able to say, 'This is it.'"

LogicMonitor could be bluffing about the workability of AIOps for IT root cause analysis as a response to heavy marketing messages from its competitors. But those who've seen AI deployed at scale in IT operations have said that Francis has legitimate concerns.

I would rather we talk about the benefits or features of products outright … saying, 'This works because it's AI' is really glib.
Ben SigelmanCo-founder, LightStep

"The blessing and the curse of these things is that they often demo incredibly well," said Ben Sigelman, senior staff software engineer for Google from 2003 to 2012 and co-founder of infrastructure monitoring startup LightStep. "In a controlled environment where you know what the inputs are, you can show things that are almost magical -- which means someone's going to buy it and then you have the issue of making it work in production."

LightStep specializes in monitoring cloud-native microservices infrastructures, but Sigelman doesn't plan to use the AI buzzword to market LightStep either.

"I would rather we talk about the benefits or features of products outright and, below the fold, it can say it's because of AI, statistics processing or machine learning," Sigelman said. "Just saying, 'This works because it's AI' is really glib."

Could crowdsourced AIOps boost root cause analysis?

Some industry watchers wonder if proactive IT monitoring would be more realistic with a broader set of data collected from multiple enterprise customers of the particular tool. Such proactive data analysis is already in use in manufacturing and refinery facilities, where equipment vendors can analyze streams of data from their machines to proactively identify potential failures.

"We'll see groups of like-minded companies allowing customers to subscribe to aggregated feeds of cleaned-up data," said Brad Shimmin, an analyst at Current Analysis. "AI processing against not just your data but everyone's can help make more accurate predictions."

Francis said this might enable proactive monitoring on broad terms, such as detecting whether a cloud service provider's data center or an internet service provider's network connection is down in a particular region.

"That is a solvable case of root cause analysis because you have enough data from enough data points, and it's a relatively simple problem," he said.

The word relatively is operative there, Francis added.

"Having been a network guy, I know you can have trace routes that work perfectly well for one device that's going over the same network and another one that totally fails" because of EtherChannel routing behind the scenes, he said. "So even that is not going to be a perfect use case."

That's to say nothing of the obvious potential security and compliance snags of aggregate data for automated IT root cause analysis. New Relic, for example, has held off on such a service because of customer worries about sharing IT monitoring data with other companies.

Beth Pariseau is senior news writer for TechTarget's Data Center and Virtualization Media Group. Write to her at [email protected] or follow @PariseauTT on Twitter.

Next Steps

How AIOps improves IT management tools

Robo-ops: Will AI-enhanced tools eliminate support roles?

Automic CTO touts dominant AIOps future

Dig Deeper on IT systems management and monitoring

Software Quality
App Architecture
Cloud Computing
Data Center