peshkova - Fotolia


Create an IT support process to take on any outage

Every IT incident must be met with a streamlined resolution procedure, all the way from critical to trivial. Ensure those plans are in place before it's do or die.

IT services and applications won't run 24/7, even with the best technology. Software flaws and unexpected problems in systems, storage and networks can disrupt workloads. When issues occur, IT support must bring order -- and transparency -- to the chaos.

The IT support process generally follows a series of common steps to track and resolve incidents: detect, assign, assess, escalate, delegate, review and resolve. Then, a post-mortem can take place to plan improvements and long-term fixes that reduce incident reoccurrence.

IT support process from start to finish -- and beyond

Issue detection. Detection can come from a user report or an automatically initiated alert from a systems monitoring tool. The issue management system creates a new ticket, assigning severity, fault and type criteria. The severity of an issue is typically minor, major or critical. A minor issue is little more than an inconvenience: Affected users have a suitable workaround, or they find the level of performance degradation acceptable. A major incident has significant impact: Important services work poorly or not at all for some or all users. A critical issue has extreme impact on business operations, such as one that disrupts vital services to all customers, results in data loss or carries the risk of data breach.

Once an issue is created, it includes status information, such as when it started, the severity and where it is in the queue of issues for support staff to address: pending, unassigned, assigned, working or fixed. Status details help to prioritize incidents and requests and justify escalations.

Support assignment. Criteria assigned to an issue help to determine who should work on it. Junior staff can handle simple requests, while specialized requests go to the administrator with expertise in that area and severe issues bring in more and highly skilled staff. The support system might add a manager to the incident to oversee its resolution.

Assessment. Team members assigned to the issue assess its impact, the customers affected, when the incident started, whether there are related tickets and other relevant factors, such as data loss. Based on the assessment, the team decides how to approach and resolve the issue. For example, if an issue started after a software version update, the resolution team might roll back the version and prepare a patch. Usually, the assessment also yields information that can be shared at large as progress details.

Progress information covers the diverse steps IT teams take to resolve an issue and can inform next steps. An issue management tool should maintain a log of progress information throughout the IT support process. While progress logs are vital to the support team, streamlined versions can also go out to the application's user base.

Escalation and delegation. If the initial approach does not resolve the issue, it typically requires escalation. For example, a high-severity issue that has been open for more than an hour triggers escalation.

More staff members work on the issue and join existing communication channels about it. There is often a reassessment of the issue and its circumstances before the team attempts more corrective actions. In a major incident, the responding staff can break up into various roles to keep communication, remediation, management and other tasks organized.

Issue resolution is typically an iterative process, requiring several tests and steps to attempt a fix.

Review. Managers should review how well the assigned team communicates, makes decisions and approaches changes during the IT support process. Also, track how long teams work on types of issues. For example, if a review shows that the team spent an hour on a problem and exhausted multiple avenues of investigation, consider further escalation or more drastic remediation, such as a regional failover.

Resolution vs. fix. Issue resolution generally means that the IT support team alleviated the immediate problem for the customers or the business. Resolution ends the emergency, but additional cleanup tasks might occur to implement a stable, long-term fix. Upon resolution, the support team normally informs affected users that operations are back to normal, and those follow-on tasks are completed at a lower priority level.

Time to recovery (TTR) is usually reported as the time elapsed between initial detection and the resolution. Mean TTR, tracked as MTTR, is often a vital business key performance indicator.

IT incident post-mortems. The IT support process for an incident provides opportunities to learn and make improvements.

A post-mortem is a review process wherein the IT support team and perhaps relevant members from the larger organization consider the underlying causes of an issue, look for patterns and evaluate changes that could reduce its recurrence.

Issue management platforms can initiate post-mortems as a follow-up action to a major or critical issue. The tool supplies a detailed log of the incident response timeline and actions/results for review.

Post-mortems focus on root causes rather than proximate causes. Proximate causes are the reasons or triggers that started the issue. A root cause is the central fault that, if corrected, could prevent all such incidents. For example, an application throws an error because its volume runs out of storage. The application error is the proximate cause of the issue, but the root cause is a lack of monitoring of logical unit number (LUN) usage and remaining capacity. The post-mortem evaluation might result in new storage monitoring that triggers an alert when the LUN hits 85% full. With that fix in place, administrators can add storage before an application error ever occurs. Similarly, a post-mortem could inform a decision to upgrade systems or software.

A post-mortem process can yield process improvements for IT support as well. For example, upon review, the team determines that it should escalate a certain type of issue sooner.

IT support tools

It takes well-organized information and collaboration to maintain a 24/7 year-round enterprise IT deployment. The IT support process relies on a suite of tools: monitoring, alert reporting and issue or incident management. Use monitoring to find an issue; IT issue management tools to classify, organize and report on it; and communication products to keep support staff and users up to date on the progress made.

Monitoring and alert tools discover and report problems via logs, agent-based information or other input, or users report them. There is a plethora of monitoring options in various categories: Datadog, New Relic, Nagios and Splunk are a few. Alert management tool options include PagerDuty and xMatters.

IT issue management platforms organize the remediation efforts of developers, operations staff and business leaders. They also offer a way to communicate to the affected software users. Popular tools to track and manage response to issues include Atlassian Jira Ops, Bugzilla and nTask. An increasing number of issue management tools are available as hosted services from the vendor rather than on-premises deployments. By running the product on an external cloud, the IT organization can keep it safe from outages that affect internal infrastructure.

These tools might rely on a ChatOps connection to products such as Slack for rapid IT support team communication.

Talk it out

Communication is critical in any IT support process, within the team and externally to affected users.

Text chat, video chat, shared notes and documents, and other media keep team members in touch and working collaboratively toward the most productive outcome. If ideas and results aren't shared quickly, teams can see duplicated effort and long MTTRs.

External communication is also vital to the user community. Users expect the business to inform them of an issue and provide timely updates on its resolution. This can keep down the number of duplicate tickets sent in by different users and inspire confidence in the support team and business as a whole. Generally, filter progress information for internal versus external communications. Users do not need to see the detailed progress notes generated within the issue management system. In fact, that information could include IP addresses, software stack configurations and other sensitive materials best kept confidential. User-facing communication is often supported through integrations with specialized communication tools, such as Atlassian Statuspage, or even an intranet or email update.

Dig Deeper on Systems automation and orchestration

Software Quality
App Architecture
Cloud Computing
Data Center