Artificial intelligence is a much-misused term. Many so-called AI systems on the market are nothing more than fairly simple rules-based engines. When it comes to AI in IT operations, the application of an if this, then or if this, else methodology does not really denote the presence and use of AI.
To gain real insight into AIOps uses cases and how the artificial intelligence should function, it is better to look at examples of how tasks have been dealt with historically -- and then consider how AI might assume those duties. This involves a comparison between AI and the thing it is meant to replace or augment: the human brain. Humans will usually notice when something is wrong within a normal pattern. For example, we can spot where a single pixel is dead on an HD or 4K screen. Across a full IT platform, this falls short. There are too many variables for humans to keep an eye on at one time.
As such, enterprise IT monitoring systems aggregate system logs and carry out basic pattern matching to see if everything is OK. When these systems see something amiss, they flag the problem via some mechanism -- often a traffic-light notification on the sys admin's screen -- so that a person can step in to take appropriate action.
When rules-based engines are applied to such systems, common problems can be codified. A standard automated response then rectifies the problem. Although this has helped in the operation of complex platforms, it still leaves a lot to be desired. This is where AI should come to the fore.
At the most basic level, we have three approaches for how AI can be applied:
- advanced if-this logic
- simple intelligence
- advanced intelligence
Let's explore how these approaches to AI in IT operations might play out in hypothetical scenarios.
AIOps and if-this statements
Imagine that you are presented with a problem that you have encountered before. Your brain applies a simplistic approach: "If I have come up against this before, and I fixed it successfully, then I may as well apply the same solution again." This is essentially a rules-based engine approach. It works, mostly.
AIOps systems can easily apply such logic. All they need is a system that comes with a set of known problems and solutions out of the box and a capability to learn as it goes, adding new known issues specific to its environment. Examples would include the application of a patch to a system or the allocation of extra resources to meet a workload's needs.
The application of simple intelligence
What if the simplicity described above doesn't work? For example, the patch is applied, but it doesn't work? The human brain would think, "If only we could go back to where we were with a working system."
An AIOps system should be able to create a restore point before it applies any change; that way, in the event of trouble, the system would know to fall back to that restore point. Even better, AIOps should try to identify the reason a patch failed. Was it due to something that can possibly (or probably) be fixed? What are the odds that the fix will work? How long will it take -- and what will be the effect on the operation of the workload? Can the system evaluate this scenario -- not just in technical terms, but also in a way that considers which responses would minimize the financial hit the business might suffer? These types of questions and answers require real AI capabilities.
The application of advanced intelligence
What happens when AIOps comes up against a problem it has not seen before? The human mind will consider a range of options, from the fight-or-flight response against an unknown threat to the let's-think-about-this reaction to a less-threatening problem.
In this AIOps use case, the AI needs to be able to work through the problem in its own way. A malicious attack against a platform, for example, would be a fight-or-flight situation. Should a DDoS attack be blocked completely (flight), or can the workload be redeployed elsewhere to minimize impact (fight)? Is an intrusion attack something that needs to be blocked, or could it be a false positive caused by someone trying to access the system from an unexpected location or device?
In the fight case, AIOps isn't quite ready. Let's assume that the platform shows signs that something is wrong, but the available data does not point toward a clear root cause. People would try to think beyond their direct experiences. Maybe they have come up against something similar elsewhere that could spark ideas. If not, they might ask others for guidance or suggestions.
Similarly, an AIOps system would look at its installed rules and interrogate its platform-specific issue database. When there is nothing there that's able to address the problem at hand, it could then go to the cloud and see if any other platform with the same AIOps system has seen something similar. In much the same way that humans need to be able to describe a problem in a meaningful way -- "it's not working" isn't helpful -- tools that deploy AI in IT operations need a standardized taxonomy to offer a description of the problem so that the AI can get back meaningful responses from other possible resources.
The AI engine must then be able take the information and analyze the problem again, coming up with possible solutions and the probabilities that those options would work. Any possible solution must be weighed against the potential risks the business would face should that option fail. Where the business risk is too high, the AIOps system must revert to the old standard of alerting admins, who could then apply human intelligence to the problem.
We are still at the early stages of real AI in IT operations, but advancements will be rapid. We should expect to see these technologies mature significantly in the coming years, and adoption will increase in pace with those improvements. Just make sure that any system chosen now can grow with your needs -- and that you understand the capabilities and challenges of AIOps before you fully commit.