auremar - Fotolia
AIOps tools will smooth IT pros' takeoff into the SRE role at Alaska Airlines, as the air carrier's e-commerce division rolls out AppDynamics Cognition Engine.
Seattle-based Alaska has been a customer of AppDynamics, as well as parent company Cisco Systems for on-premises networking equipment, since 2017. AppDynamics released Cognition Engine, based on its 2017 acquisition of Perspica, in January 2019.
AppDynamics already offered some AI-based IT automation, or AIOps, features. But Cognition Engine added dynamic baselines for anomaly detection based on machine learning that don't require user intervention, along with finer-grained root cause analysis and automated remediation capabilities, as well as deeper integration with Cisco's ACI network virtualization tools.
The site reliability engineering (SRE) team in Alaska Airlines' e-commerce division, which handles the company's web applications, had already used AppDynamics App iQ, Real User Monitoring, and Business iQ to reduce outages and cut the mean time to incident detection. AppDynamics automatically correlated IT monitoring notifications into critical alerts and used admin-set health rules to trigger some low-level tasks, such as restarting server nodes that had been in an unhealthy state for longer than 10 minutes.
Nemo HajiyusufSoftware engineering manager, Alaska Airlines
"It gives us breathing time where we go back the next day and say, 'Hmm. What was the real cause behind this?' And still protect our [customers] so they don't experience a major outage," said Nemo Hajiyusuf, software engineering manager at Alaska, who oversees the e-commerce division's SRE and platform engineering teams. "But it still requires some manual process, as well as really figuring out what's the impact to the customer."
Cognition Engine, by contrast, will let the airline's SREs take their hands off the controls earlier in the process. As soon as AppDynamics agents are deployed, the AIOps tool can automatically learn normal traffic patterns, detect anomalies, pinpoint their root cause and impact to customers, and automate incident resolution more broadly across multiple layers of applications, servers and networks.
AIOps frees IT ops pros to develop as SREs
While AIOps hype has percolated for years, industry experts now believe enterprises have matured to the point where DevOps teams can focus on code quality, as well as velocity, which will prompt more production use of AIOps automation tools. Gartner predicts the AIOps market will jump to more than $9.2 billion by 2025, with particularly strong growth among large enterprises. AppDynamics competitors from Dynatrace and New Relic to GitLab have broadened AIOps and observability features in response to this trend.
Hajiyusuf agreed that an emphasis on reliability is on the rise in general but said the nature of Alaska's business has required that focus from the beginning of its Agile and DevOps transformations.
"One of one of the major outcomes of having dedicated SRE and DevOps teams is allowing product teams to move fast, but make sure we balance that with not breaking too many things," she said.
Even before it deployed Cognition Engine, Alaska had used the time freed up by the automation of routine incident response tasks to refocus its SRE team on proactive tasks, such as detecting issues before they caused an outage or consulting with product teams.
"We were able to reduce the number of outages by 60% [in 2017], and we continue to sustain that," Hajiyusuf said. "Our mean time to detection went from hours, where customer care was calling us to say customers were having issues, to less than 10 minutes, so a majority of outages that happened in 2018 and 2019 were triggered by the site reliability engineering team, and a lot of times we mitigated those issues before our [customers] even knew it."
From there, the team began to experiment with chaos engineering in test and development environments to proactively identify reliability gaps. It also created a reliability checklist for application development teams.
"There was this question from our product group, 'What does reliability mean? And how do I know my system is reliable?'" Hajiyusuf recalled. "[SREs] put together a set of checklists, guidelines to basic reliability that became part of the definition of 'done'."
Alaska's AIOps expansion gains altitude
Hajiyusuf doesn't necessarily buy in to the NoOps concept, in which infrastructure operates without any IT oversight, but predicted that the SRE role will change again as Alaska rolls out Cognition Engine.
"If we get to a point where automated remediation is a real thing that we can trust, maybe SREs could be integrated within the product group, or maybe they could create site reliability engineering classes [for other engineers]," she said.
That will take time, however, and further maturation within the AIOps tool set and Alaska's admins as they learn to use it. The company must establish a strong connection between AppDynamics APM and user experience management tools and the less-familiar AppDynamics Infrastructure Visibility and Network Visibility tools. The airline will also explore relatively new AppDynamics support for cloud services such as Azure Functions.
"AppDynamics is giving us [part of] that holistic view because of their business transaction concept where you monitor the entire journey of user experience," Hajiyusuf said. "But there are some gaps on the cloud side where the licensing [model] and consumption [pattern] become difficult sometimes."