freshidea - Fotolia


IT incident management best practices -- and myths that scare you away

When IT activity inevitably goes awry, admins need to know the best course of action to take to ensure accountability, efficiency and accuracy -- and not get lost in myths.

Dealing with an IT incident is one of the most important aspects of the job. Sure, you'd rather never have incidents in the first place, but we don't live in a perfect world. Businesses judge IT by what happens when there's a problem -- not by how long they go without one.

Avoid these three common traps that ultimately damage the IT department's reputation and image, and instead, follow IT incident management best practices. These three common myths and corresponding best practices relate to the incident itself, the response and, finally, how it is managed.

Detect a problem

Myth: Only communicate and report major issues that users complain about, or IT will look bad because people see lots of things breaking.

Best practice: Record and report any service degradation, and have that information available to whoever needs it, especially decision-makers.

Ideally, IT identifies and remediates a problem or an incident as quickly and accurately as possible to minimize downtime for end users. If you detect an issue before a user calls about it, sometimes, you can fix it before anyone actually notices. The IT staff should maintain monitoring tools to track service levels, either an in-house scripted setup or a purchased tool, such as Microsoft System Center Operations Manager or SolarWinds. Use what works for the given IT deployment, budget and staff.

These monitoring and systems management tools should produce reports of outages over a certain period of time. It's tempting to hide those issue reports under the rug. Instead, own those outages, and make them public. Follow this IT incident management best practice, and it will force the IT team to understand the why and how behind each outage and to find proper solutions to ongoing concerns, rather than quick, temporary fixes.

If you need more money or resources to address a problem, this historical evidence shows the real impact to the business that the incident or trend of incidents has. Armed with this information and drive to improve, you're in a better position than an IT organization that hides its problems.

Bring operations back to normal

Myth: Any fix is a good fix.

It's easy to be coerced into just doing something to see if it makes all that pain go away. However, this point is a time when things can go from bad to worse.

Best practice: A poorly thought-out quick fix can lead to bigger issues. Understand and agree upon a quick fix if a permanent fix isn't possible.

When the help desk phones are ringing off the hook and management has sent an angry email message demanding a problem get fixed right now, the IT team is feeling the pressure. It's easy to be coerced into just doing something to see if it makes all that pain go away. However, this point is a time when things can go from bad to worse -- the IT incident response team accidentally overwrites a database, loses data or breaks a service beyond repair in a few keystrokes and clicks.

As with any task performed in a live IT environment, understand what you're doing, why you're doing it and how to get out of it when it goes wrong. The best action could be as simple as taking a snapshot of a VM before you make a change that is expected to fix the issue, in case it instead makes things worse. If you have daily backups, run another incremental backup so that it is available to restore from if required. Fix the problem, but protect the IT team and the business with cautionary steps. Think about how you'd approach the change if it weren't a fire to put out, and try to tick as many boxes as possible in your incident management approach. The trick is to balance speed and risk mitigation.

After the fix

Myth: You survived, and everything on the dashboard is green again. Move on to the next task.

Best practice: After letting everyone know a service has been restored, perform an incident post-mortem.

The relief of restoring a service is great, but this success needs to lead to more work. This IT incident management best practice breaks down into several steps: Reassess what happened, advise the staff who have an investment in the service about the cause and resolution and, finally, decide on what actions will reduce the same issue's effect or prevent it from occurring in the future. The post-mortem is also a good time to talk about lessons learned and discuss where these lessons apply to other services, too.

Without the follow-up incident tracking and review, unknown risks remain. What if the original fix leads to future problems? For example, data moved onto a different disk could go unmonitored, and nobody will be alerted when it's low on disk space. The disk might not have a scheduled backup. Some fixes require a planned outage to put things back the way they were, in the interest of the IT platform. Every situation is different, so keep all necessary staff members involved in the decision-making process.

It's easy to look at problems with the clarity of hindsight and see what IT could have done better. However, IT incident management best practices often fall by the wayside for the alluring myths. Review incident processes, and have general discussions with staff on how to react to incidents to dispel bad habits and improve the deployment's resiliency.

Next Steps

Conduct a blameless postmortem and focus on the problem

Dig Deeper on Systems automation and orchestration

Software Quality
App Architecture
Cloud Computing
Data Center