beawolf - Fotolia
There's a problem in the data center, and you've only just found about it because users contacted the help desk. You send system administrators and engineers to investigate as quickly as possible and fix it. Then, you sit there, wondering what the support team actually did, how their actions might have changed the platform, what those changes mean going forward and whether the changes actually resolved the underlying cause or just masked it.
This approach feels antiquated in an era of DevOps, cloud and process orchestration and should be replaced with incident management automation. Monitoring software automatically picks up that data center problem before users notice performance degradation. Incident management software relies on AI to identify a root cause and then fix the issue -- without any humans involved.
Complete incident management automation is still somewhat of a pipe dream. However, systems management has grown up from a means to report problems so that they can be fixed to fully automated systems where only the worst problems require human intervention. Software automatically creates and routes incident tickets to the appropriate support team, and workloads can move off affected systems onto alternative resources without users any the wiser.
With automation replacing ad hoc incident management steps, technical IT operations staff can focus time and budget on business initiatives. A company that uses report-and-respond systems management spends between 70% and 80% of its IT budget to keep the platform running, leaving only 20% to 30% for improvements. Shift just 10% of the IT budget from time-intensive incident management and remediation, and it yields a 33% to 50% increase in value-added work.
Incident management automation tools
Many tool vendors claim to offer self-healing capabilities in their products. Buyers must be careful and look for the capabilities that actually save time, resolve issues and increase long-term reliability of the IT deployment.
Monitoring. A monitoring tool or integrated set of tools must pull operations data from across a modern, hybrid platform, including the on-premises hardware and software, as well as public clouds offering platform and SaaS options.
Root cause analysis. Automated incident management systems should understand the interactions between the different aspects of a modern platform to comprehend where a problem really resides, not just where the problem's effects are felt. Remediation requires a level of understanding about the interdependent components of an IT deployment.
Auditing. Incident control systems that fully document the response and remediation actions as they progress enable audits and deeper analysis for capacity plus performance optimization. Even if incident management tools can remediate the problem without any human intervention, they must log what the problem was, the root cause, what steps were taken to remediate the issue and any other details, either within its own system or an allied system, such as help desk software.
Incident remediation automation tools
The biggest and hardest area of incident response is remediation. While automation enables a kind of self-healing infrastructure, AI only knows what it is taught. If a system cannot recognize the issue, humans must step in to help.
Remediation. Some systems identify the root cause and attempt to fix it using an approach based on known-best-attempt solutions. The trouble is that any given IT deployment is different overall than any other one. The tool must fully understand what it is dealing with, from hardware through firmware versions, software patch levels, database linkages, directory services and security systems -- as well as all the interactions between them.
When a tool does identify the correct problem, it also should determine if a fix can be directly applied, whether firmware or software patch levels must be updated first or whether this is too complex and a human must be informed. This decision tree leads to escalation.
Escalation. When a problem goes beyond the tool's capabilities to automatically fix it, then that tool must quickly and effectively escalate the issue -- generally, to a support professional. Escalation must take place in a formalized manner, via help desk software on an in-built alert or messaging feature. The tool should be able to calculate the possible or probable scope and impact radius of the problem based on in-built and user-created rules, assigning it a priority level.
Orchestration. IT organizations must roll out a problem fix swiftly without breaking anything else. Orchestration packages deliver changes to the IT environment and/or application, verify that the changes succeeded and roll back the deployment to its last known-good state if needed.
Idempotency is a more advanced orchestration approach designed to ensure that a desired outcome happens no matter what the underlying platform is.
Workload management. With orchestration and/or idempotency, IT organizations can carry out intelligent workload management. Any issue visible to -- or soon to affect -- users should move to an alternative part of the platform. With the complexity of interactions between various IT workloads, migration must be done intelligently. Workload management, across on-premises, colocated and cloud resources, combines many aspects of effective incident management and remediation.
Seek out a single tool or set of integrated tools from one vendor to avoid the risk of problems slipping through the cracks. Few vendors can provide all the above; however, vendors leading the space include HashiCorp, Electric Cloud and Flexiant. Kubernetes, while not yet mature, is increasing its strength as a hybrid cloud orchestration and management system.
The incident management automation market continues to mature. Developments in machine learning and AI will make future automated incident management systems more effective overall. Some systems monitor and manage security issues, as well as technical ones. These tools identify patterns of usage that indicate distributed denials of service, intrusions and other attacks.