il-fede - Fotolia
Runbooks are collections of procedures and information that guide IT ops staff as they resolve issues. These documents can cover anything from troubleshooting processes to interconnections, as well as how to restart complex applications.
While a runbook is not an ITIL practice, it can follow the same guidelines and triggers put in place by a service catalog -- and it does follow many of ITIL's guidelines. The runbook's purpose is to keep IT fixes and processes the same, even when the staff changes. This consistency can yield better uptime, reduce staff effort and save costs. However, nothing in IT is as black and white as it might seem.
To write an effective incident response runbook, begin with a focus on the fundamentals. From there, address rebooting and troubleshooting practices, keeping a mind on the cloud and security.
A company must be willing to pay staff and dedicate resources to create and maintain comprehensive documentation. And the argument that those investments will decrease once the runbooks are completed is fundamentally flawed. IT systems are fluid; they are always changing and updating. To keep up, applicable runbooks must also be fluid, and they require resources to remain up to date.
Resources might come from areas outside IT operations. They might include application owners and engineers, which raises costs in terms of staff expenses. Determine whether it's worth it to have a developer, for example, dedicate several hours to help with a runbook to save an IT operations admin a few hours in the future. The answer is tricky, because the employee cost is not the same between IT roles. These considerations might make comprehensive documentation seem difficult to achieve, but there is middle ground.
Rather than document everything, start with the processes that rarely change and are fundamental. It's not calling defeat to adjust the target; it's about a balance between the end value and the effort put in. A cornerstone in any runbook should be the application or IT environment layout. Staff can't troubleshoot a problem with a report server if they don't know the server's IP address. If IT staff know the incident relates to a report server, but must determine which server is the report server, the struggle becomes twofold: first, to identify where the problem is, and second, to fix it.
Document the location of each component of the IT environment, and who is responsible for it -- along with that person's contact information, such as a cellphone number. An application can span multiple virtual environments or extend to the public cloud. Staff must know this arrangement so they can focus on issue resolution. Infrastructure components such as server names, functions and IP addresses don't change, which makes them a solid foundation on which to start an incident response runbook.
Once the runbook details the environment, document the startup and shutdown order for IT systems and its effect. Every system needs to be rebooted at some point. To do so correctly, the IT operations team needs instructions on the correct order of actions and the effect each server in the application stack will have in this process. Include any possible issues or events staff might see in the reboot process and how to address them.
Another important part of an incident response runbook is troubleshooting processes and how to handle incidents. Do not try to include every event or ticket type in the runbook: It will be both huge and out of date immediately after the first software update. Be selective with inclusions and focus on common issues -- those that appear multiple times across IT staff.
Even if everyone already knows the fix, include it in the runbook for new staff so they don't have to repeat the research process. Also include issues that might take considerable time or effort to correct, but only if the issue might reoccur. There is no significant benefit of including a one-time issue.
The troubleshooting section requires more thought and planning than the other two sections described above. Add a table of contents and a thorough index. A runbook is no good if staff can't find what they're looking for.
Prepare for different types of incidents
IT incidents normally fall into a few categories. A feature not working is an example of a common incident, while an outage is a more serious type. Other types include security or access incidents.
Each type has its own place and, specifically, its own runbook. Security incidents, for example, are time-sensitive events that need quick examination and remediation. They also typically require follow-up actions, which could include additional documentation depending on legal needs or concerns.
Outage incidents require immediate attention as well, alongside a rapid escalation process depending on timing and scale. This might involve waking critical staff and alerting management and users. A security incident might not require this level of escalation, which is why it is critical for teams to know what type of incident they face.
Feature incidents are typically low priority and can be resolved as time permits. As such, these could have a much longer time frame, but response teams must still be sure to follow the runbook's documented process and close the incident.
Companies often tag incidents or alerts as critical, severe, moderate or informational -- and that classification system is a problem. This creates a traffic light system and does not consider the unique aspects of each incident, which can create more issues. Instead, categorize all incidents properly and then delegate the severity to ensure teams follow the correct runbook.
Prepare for the cloud
Incident response runbook fundamentals span all environments; however, the modern decentralized data center requires more.
One of the biggest features of a runbook is to document how different systems interconnect and the key steps to correct any issues. This is still critical, but the cloud and a greater distributed application aspect adds other changes for which IT teams must account.
A runbook's limit no longer stops at the cloud or external process line. On-site IT operations staff can't fix issues behind the scenes at a cloud provider. If the applications support multiple zones and teams, IT staff can change services or availability zones if there is an issue, then the cloud becomes an operations issue again.
This ability to change cloud zones or alternate services from on site to cloud, or vice versa, is critical information to include in an updated runbook. This additional information can facilitate less downtime and easier recovery. To be clear, simply pressing a button to move services and fix everything is fine -- but unlikely. There is always a process and an impact to these actions that must be included in the runbook.
With a distributed application, application ownership and ensuring the flow for troubleshooting is critical, as some pieces might be in clouds or remote sites. Cloud services also bring a new wrinkle, as teams must document support contract hours, contract terms, guarantees and support processes to modern runbooks.
For example, if 90% of services are in-house, but the last 10% in a public cloud goes down, then that 90% just became a very expensive brick. That is why it's important to include -- even in an appendix -- the vendor contact name, type of support paid for, stakeholders and key technical contact information. This information can also be put into the incident response runbook at certain checkpoints to ensure support services are called before critical troubleshooting steps are done.
It's not about making the perfect runbook that covers all possible issues, it's about creating a working framework that provides guidance from additional internal or external support if an issue escalates to key levels. The requirements for an incident response runbook have increased, pushing some to instead adopt flows.
Runbooks vs. flows
While some IT operations professionals are familiar with the runbook's technical process aspects, the effect of cloud expansion on IT operations is still relatively new and must be accounted for. For example, moving a database service to the cloud can cause data loss or corruption concerns. Or, the database might have issues moving from a cloud environment back to on premises due to application versioning issues from updates performed or committed in the cloud.
This type of scenario breaks simple runbook structure into more of a decision tree over simple instructions -- which is where flows come into the picture. The runbook is a set of instructions to respond to events or conditions of an event. A flow, or playbook, includes more of the business side and impact rather than just the technical steps found in the traditional runbook.
Business impact includes IT services as well. As the modern data center and application stack grow beyond just IT, teams must consider the flow around these key processes in addition to the steps in a runbook. The impact also affects what driver teams might need for decision gateways.
The runbook can tell IT operations teams when to fail over to a service, but that authorization might fall to engineering or management and the runbook must show that -- and whether an additional sign-off is required for cost, legal or other reasons.
Failing over systems into public clouds or shared co-locations can have serious financial consequences beyond the traditional IT operations level. The playbook -- which supports the runbook and doesn't replace it -- must capture this.
Think of the playbook as the branches of a tree and the runbook as the leaves. The branch leads to a specific path, but the leaves hold the details of what needs to be done. IT events are getting increasingly complex and that affects the documentation needed to address it. Simple outages can still occur, but that outage can have wide-ranging effects if it's a core service shared between teams -- and teams must account for that, as well.
Include a security breach plan
Security breaches are another big subject for playbooks and runbooks. For example, consider what steps and actions to take in the event of an incident, and what effect it could have up or downstream the development pipeline.
As complex as this is, the branch/leaf approach shows how to handle it step by step. Start with common incidents. Focus on one branch for the playbook and add runbooks as necessary. Build that out and use it as a model to create the next set. This will take time, but dividing it into manageable pieces avoids overwhelming staff. It will also create a system that future teams can adjust and update without starting from scratch each time.