E-Handbook: Explore the power and limits of the AIOps architecture Article 3 of 4

AIOps tools expand as users warm slowly to autoremediation

AIOps tool vendors keep expanding the environments they can support with automated remediation features, but users are taking their time to move beyond root cause analysis.

AIOps tools promise to automate incident resolution tasks for an ever-expanding list of infrastructure types, but IT pros are still trepidatious about how extensively to use automated remediation in production.

AIOps has generated industry hype since 2017, as advances in machine learning algorithms prompted IT monitoring vendors to envision a new method of automation for their products. At the same time, complex microservices infrastructures became impossible to manage entirely by human hands alone. Since then, AIOps tools have grown more sophisticated, adding automated remediation features to event correlation and automated root cause analysis, and AIOps vendors that began in specialized areas have also broadened the workloads their tools can support.

Most recently, those vendors include Epsagon, which emerged in 2018 with AI-supported distributed tracing for serverless environments and expanded in 2019 to include container and cloud workloads. It now offers AIOps features it calls Applied Observability, which automate menial incident resolution tasks in response to metrics and logs in addition to traces. Last month, Epsagon launched a partnership with Microsoft centered on Kubernetes environments after previously inking a deal with AWS focused on its Lambda serverless compute service.

AIOps player OpsRamp also expanded its OpsQ tool set with new support this week for synthetic monitoring, which uses scripted transactions to emulate workloads and expose weak links in multi-transaction IT systems. This isn't unique to OpsRamp, as most application performance monitoring (APM) vendors such as Dynatrace and Datadog are also known for synthetic monitoring. But it brings support for more monitoring types under OpsRamp's purview, which already includes metrics and logs. This additional data will enhance OpsRamp's proactive performance degradation detection, automated root cause analysis and automated remediation features.

Users of these tools say automated event correlation and root cause analysis has made a significant impact on their ability to respond quickly to IT incidents.

Arne Saupe, Farmer's FridgeArne Saupe

"Since we started monitoring [with Epsagon] we've experienced fewer incidents," said Arne Saupe, director of engineering at Farmer's Fridge, a food services company in Chicago, which uses Epsagon for an IT environment comprised entirely of AWS Lambda functions. "Previously, if issues were intermittent, it might take us a while to trace it; so now we can see and resolve them on a permanent basis."

Overall, the company's mean time to repair (MTTR) problems in its IT environment has been reduced by 55%, compared to the mix of tools the company's engineers previously used, which included AWS CloudWatch and homegrown creations, Saupe said. Incident-related IT interruptions have been reduced 35%.

Epsagon was a standout for serverless monitoring because it uses AI to automatically discover all the parts of a serverless infrastructure and how they fit together, Saupe said. At the time Epsagon emerged, most other serverless monitoring tools, including native AWS tools, had blind spots as functions traversed multiple systems.

"In cases where we're having issues, it may be triggered by something three steps [upstream] from a Lambda [function] that's affected," Saupe said. "It used to take us a while to trace, but now we see all the inputs going into that Lambda, and can work our way back, see what we recently changed that might be causing the Lambda to fail."

While AI-based discovery and root cause analysis are important parts of that process, Saupe said he hasn't yet begun to experiment with automated remediation using Epsagon's Applied Observability features, though he intends to experiment with them soon.

"It's something the team is interested in learning how to use better, but we haven't dedicated the time to it yet," he said.

OpsRamp synthetic monitoring
OpsRamp's AIOps tool now supports synthetic monitoring.

GreenPages takes small steps into automated remediation

GreenPages Technology Solutions, a systems integrator and managed IT services company, has been a reseller partner and user of OpsRamp since it was spun off in 2014 from IT services vendor Netenrich, with which GreenPages also partnered. That was before the company focused on AIOps, but GreenPages found its support for physical, virtual and cloud environments in a single tool useful to run its managed services platform for midsize customers.

"At the time, the other vendors we worked with supported all three, but based on acquisitions, so even though you worked with one company, they were still disparate tools," said Ron Dupler, CEO of GreenPages, based in Kittery, Me. "'Single pane of glass' is an overused term, but at the time, OpsRamp had what we needed."

That has continued as OpsRamp supported new forms of IT infrastructure, including containers and serverless. It now competes with AIOps specialists such as Moogsoft, but can still fall back on its comprehensiveness, Dupler said.

However, while GreenPages heavily relies on alert reduction and root cause analysis from OpsRamp to run its IT managed services, it's been slower to embrace automated remediation features in production.

"Making sure only real issues get put in front of engineers, and that they're able to benefit from the context of our past experiences is where we've made the biggest bet [with OpsRamp]," GreenPages SVP of services Jay Keating said. As for automated remediation, "we're doing it, but we're timid."

So far, automated remediation has been put in place to resolve simple problems that may crop up, such as a system running out of disk space or restarting a service, Keating said. The company is experimenting with more advanced remediation and evaluating the feedback OpsRamp gives IT staff about what it would have done to automatically resolve incidents if allowed.

It used to take us a while to trace, but now we see all the inputs going into that Lambda, and can work our way back, see what we recently changed that might be causing the Lambda to fail.
Arne Saupe Director of engineering, Farmer's Fridge

"That's been hit or miss," Keating said. "It hasn't gone well enough for us to trust it in production yet."

Often, the solution the tool proposes is correct, Keating said, but proposed for a less than ideal time of day or type of system.

OpsRamp officials said that AI performance and remediation depends on the amount of data and training algorithms have, and that the company will continue to improve its AIOps products in response to customer feedback. OpsRamp also recently added transparency features for its algorithms, such as Observed Mode to show users which alerts would have been correlated before running OpsQ in production and Recommend Mode to point out optimization opportunities.

However, GreenPages will forge ahead, according to Keating. Eventually it hopes to use OpsRamp as a primary execution platform for IT security and cloud cost optimization, as well as IT ops workloads, which the tool also supports.

But GreenPages also has ServiceNow tools that offer such execution engine features, and both OpsRamp and ServiceNow offer integration that can take in data from the other. As AIOps vendors continue to expand, such overlap will only increase, and users such as GreenPages must ultimately decide which tool will take charge of centralized IT automation.

"At some point, [comprehensive data collection] content will be the king, and everything else will be just a scripting platform," Keating said.

Dig Deeper on IT systems management and monitoring

Software Quality
App Architecture
Cloud Computing
Data Center