Outages and crashes are ubiquitous to the IT experience, but every event is unique. And, as even the best-laid plans sometimes go awry, these random events put IT teams to the test.
For example, a Fortune 500 organization's data center with dual power and generators in a fully meshed design might go dark -- because somebody runs into a power button with a plastic cart.
IT issues are nothing new for business leaders, IT managers or users. It's when they don't know the issue's status or progress that the frustration sets in. But strong crisis management protocols prevent chaos.
Formalize a plan
Every organization should have an IT crisis management playbook that outlines its contingency plans, equipment and processes to handle issues. Some solutions will work, and some won't. But the lessons learned from the crisis are key.
This article is part of
It's impossible to address every potential crisis scenario in the data center, so focus instead on the applications and systems. An IT crisis management playbook for applications should document ownership of each application, the hierarchy of responsibility to manage the situation and what information to share with users. Start with a definition of the applications -- it's unlikely that every member of the IT operations team knows this information for every app in use.
Simple knowledge of each application's profile is critical: At a quick glance, operations personnel should be able to discern the size of an app's user base, as well as which type of customer the application affects. Additionally, teams should have access to a list of application or service dependencies that creates a map of potential areas of effect.
The second significant component of an IT crisis management playbook is a breakdown of common or reoccurring issues and their suggested fixes. Append the top resolution suggestions from the application vendors as well. Don't expect to create an exhaustive list, but describe coverage for five to 10 of the business's most critical applications. Create a comprehensive index for both vendors and IT operations staff to see quickly if they need to escalate an issue -- and to whom -- with internal contact information attached.
A common question about crisis management playbooks is recommended format: paper or digital? If the modern paperless office is any indicator, create both. Paper binders require effort to update and store, but they also work without power -- something that's not a guarantee with a digital version.
A full power outage should never happen, but an organization must prepare for the possibility. Maintenance of a communication protocol, along with application and systems information, saves time in a crisis. IT personnel can consequently focus on the issue and correct it, rather than pursue basic information that increases time to resolution.
Talk to users and stakeholders
Communication is a priority in any outage. Without it, a minor problem can become catastrophic.
Every organization and application has stakeholders, such as application owners, management and customers. Whoever is on task to fix the issue -- an engineer, IT operations admin or application owner -- cannot be the communication point. Someone else must lead communications; otherwise, the fixer becomes too busy relaying information to address the issue.
Include a communication tree within an IT crisis management playbook. Keep a short and updated list of communication points to eliminate confusion over which admin or manager to contact when time is critical. Depending on the nature of the issue at hand, certain communication methods won't work. Large-scale network outages can take email, IM and IP phone systems offline. Cell phones are the obvious choice for communication in this scenario, so be sure to include staff's mobile numbers in the communication tree.
Also, include vendor contacts and key phone numbers for the promised support outlined in applications' service-level agreements.
One option with demonstrated success is to have an engineer on speakerphone with the vendor, while an IT operations admin in the room listens and then conveys key points to management outside. This practice enables the engineer to solve the issue and keep the lines of communication open. Enact procedures that discourage management from checking in too often, which disrupts the resolution process. Set up these processes beforehand, but remember that failures are unexpected and responsibility is difficult to assign. Designate, on the fly, the communication point person; IT operations must come together to define every role in the crisis management process -- and then respect those roles when the time comes.