No IT environment is perfect. Issues can range from simple problems, such as a server running low on disk space that causes an application to stop responding, to more complex intermittent issues, such as a finance system that runs poorly at the end of each month when the accounting department prints a year's worth of invoices.
An IT operations administrator might not be able to predict every issue, but an automated IT incident management system might not either.
Systems and operations are hard to look after. They're hard to implement, manage and troubleshoot. Environments change constantly, and ops admins must set up monitoring and change management for all seven layers of Open Systems Interconnection -- eight, if you include users. Every environment is unique and, in turn, imperfect.
Where do you draw the line between employing IT gurus who manually maintain and fix the environment and investing in automated IT incident management systems that report or even remediate issues? Automation and internal IT knowledge must coexist for the best chance at a highly operational environment.
Where to place the line between these two is a problem for each company to work out individually. For the best results, have expert IT staff rely on the systems that show environmental health, rather than act as eternal emergency repair personnel. Humans are fallible, and you also are unlikely to have multiple staff members with the exact same knowledge across all systems inside the company.
Incident management options
A properly configured IT incident management system uses monitoring tools to pick up an issue before a human does. For example, if a remote site's WAN link goes down, it might go undetected until an end user complains. However, a monitoring tool that tracks the availability of any device at the other end of the WAN link -- or even an IP address of the router that provides the WAN link -- will find an anomaly quickly. The IT team can use monitoring settings to trigger an event, such as send an email alert to the whole IT team. The IT experts determine the cause and communicate about the problem to users. To receive an alert from an automated IT incident management system and then act upon it requires less technical and environmental knowledge from the first-line support team than to troubleshoot an issue brought up by a user, especially when the user's understanding of the issue is unclear.
IT incident management systems are becoming more flexible and powerful. For example, Microsoft Azure's Operations Management Suite uses basic functions, such as centralized logging, along with advanced features, such as Service Map, which automatically discovers and builds a dependency reference map of servers, processes and third-party services.
IT incident management systems that incorporate tools like Service Map, application dependency mapping and other features move the effort and work required to address issues out of the hands of the internal IT expert, who must remember every server by IP address, name and disk capacity. Instead, an ops admin can follow standard instructions to set up monitoring and incident management and visualize how specific servers and services interact. If this work is done at the time of system build, complex systems can be self-documenting to record and show connectivity and requirements of all moving parts. The result is that, when something breaks, you can quickly see inside the system and easily discover the point of failure.
Some large companies place heavy emphasis on automation, including for their IT incident management systems. "Automate all the things" is a popular IT catchphrase, but full automation can make little sense for resource management, depending on what the tasks are. Advanced automated IT incident response can lead to a self-healing infrastructure, but that's beyond the reality for most organizations. Automation must start with the most basic processes and build up for any hope of a fluid and functional state, leaving the promise of end-to-end automation as a dream out in the ether.
Another way to approach IT incidents is to create them yourself. Netflix developed a chaos engineering program, Chaos Monkey, and sister tools collectively called the Simian Army, which test system resiliency by purposefully breaking processes or disrupting services. More conservative organizations can experiment with chaos engineering in small doses or in staging environments, rather than take down production systems.
Ultimately, IT operations admins exist to help the rest of the business to do its job. Quick to deploy and easily modifiable automated IT monitoring and incident management systems make this task easier. The right tooling will make a difference. If the effort to set up monitoring and remediation processes seems too cumbersome, maintenance will only become more so when future system changes occur; at the current rate of change in IT, that will be a continual job. Even with a successful IT incident management system with built-in automation, admins still need to understand their environments. Combine that knowledge with automation, and they'll do less manual troubleshooting and have better focus on where to seek out issues.