For businesses deemed essential during the COVID-19 pandemic, AIOps-driven IT incident response is key to keeping services available for customers amid a long-standing IT skills shortage, as well as more recent disruptions from social distancing.
At KeyBank, a financial services institution headquartered in Cleveland, the road to effective AIOps has been traveled gradually over the last three years. Its results didn't come about from deploying a single tool -- instead, KeyBank had to rebuild its IT monitoring data collection system from scratch, consolidating more than 21 monitoring tools down to an Elastic Stack data repository fed by a Kafka data pipeline.
From there, KeyBank attached AIOps software from Moogsoft to correlate events, eliminate false positives and ultimately reduce the high volume of alerts IT teams receive through machine learning, a process that took several months. The bank also had to reconfigure the rest of its systems, such as its ServiceNow help desk, to integrate with Moogsoft, and wrote its own tool, WatchIt, which attaches runbook information to individual infrastructure components via monitoring ID codes. Some WatchIt runbooks automate the resolution of simple problems, such as a system that ran out of disk space or RAM. The KeyBank team also began to use Moogsoft features that alerted them to potential issues before they became incidents and offered hints on how to resolve problems.
"We're past crawl and we're starting to jog," said Mick Miller, senior DevOps architect at KeyBank. "We're seeing a dramatic drop in incidents this year, along with the time it takes to resolve them."
Miller estimated Moogsoft's alert correlation has reduced the number of alerts sent to DevOps teams by 98% over previous years; mission-critical and high-priority incidents have decreased so far in 2020 by a factor of 10.
In addition to alert reduction, automated root cause analysis and some automated issue resolution through the WatchIt system, Moogsoft generates proactive tips on incident response through Situation Rooms. KeyBank recently replaced its Jabber ChatOps tool with this Moogsoft feature, which analyzes chat text to learn how past incidents have been resolved. Moogsoft then uses that data to issue advisories to KeyBank's IT teams when it detects that similar incidents might occur.
"It also allows you to score [the relevance of those tips] as an end user, which is the best kind of AI, when you have machine learning doing its thing with human input," Miller said.
However, Miller is less skeptical than he used to be about the prospect of self-healing systems built on AI as his team grows more comfortable with IT automation tools.
"We are on track now to really start doing this correctly -- talking to our team in the [network operations center], getting their teams to be much more SRE-oriented in terms of their skill set," Miller said. "When you've got people who are programmers and infrastructure people at the same time, autohealing becomes way more possible -- maybe even inevitable."
Signify Health bridges SRE skills gap with AIOps
Even before the upheaval of COVID-19, companies such as Signify Health, a provider of care services in the home in Dallas, had to keep up with business growth, while advanced IT skills were in short supply, a problem only exacerbated by the pandemic's economic headwinds.
But over the last three months, the company has tested AIOps features in beta for its New Relic IT monitoring tools, which were made generally available last month, and begun to put them into production. Ideally, Signify Health would like to hire SREs for each of its 16 cross-functional DevOps teams, but so far has an SRE staff of one.
"They're hard to find," said that staff member, Jeffrey Hines, who's worked as a senior SRE at Signify for six months after joining the company as a senior software engineer nine months ago. "We've been looking for months for good people, and I think we've finally got some good candidates, but it's a challenge finding that many good people, so anything that reduces that need, is definitely a plus."
With a growing business to support, the existing DevOps teams have a huge workload that includes migrating on-premises systems to Microsoft Azure and maintaining CI/CD pipelines in addition to monitoring systems and troubleshooting incidents. Hines tested AIOps features added to New Relic One, previewed in September 2019 and released this spring, that included enhanced alert reduction and the automated creation of notifications and workflows in third-party IT workflow tools.
The AIOps features, especially alert reduction, are headed into production at Signify Health, and while they will take some getting used to, Hines expects them to reduce toil for SREs and eventually integrate with the company's Atlassian Opsgenie incident response system.
Mick MillerSenior DevOps architect, KeyBank
"I have high hopes, based on what I've seen so far," Hines said. "It's a little further down the road for us, but we really want to feed this into Opsgenie, and feed some kind of automation for resolving issues."
So far, Hines has compared alerts correlated by New Relic's AIOps engine to the full volume of alerts the IT team normally sees and found the correlations to be accurate and reliable.
"The tendency is to get so much noise that you can't figure out what's going on," he said. "That's the biggest impact that it's made so far -- I have a better idea of what to look for first."
Hines and his team are still learning the new features in New Relic One, but one advantage of a SaaS tool is that the company's data is already stored and indexed by New Relic, he said, so Signify Health won't have to update its data repositories for AIOps or migrate data to a new tool.