Step-by-step guide to develop an AWS runbook

Runbooks enable anyone on the team to execute a task and jump in when things go wrong. Learn how to create one to handle repetitive tasks like monitoring and incident response.

AWS admins should make it a habit to build runbooks with instructions to set up instances, configure resources or accomplish other tasks. All it takes to get started is some basic steps.

AWS runbooks are predefined procedures to achieve a specific outcome. AWS runbooks preserve the institutional knowledge of an organization's cloud operations from one person to another. A runbook should contain the minimum information necessary for team members to perform a given procedure successfully.

In this 101 guide, we'll go over the basic framework to create an AWS runbook.

Plan an AWS runbook

Before you write an AWS runbook, straighten out some details concerning your system and environment.

Start by identifying frequently executed procedures for operations on AWS. Common examples of these procedures might be recurring admin tasks, monitoring and troubleshooting. Look for procedures with high error rates. These are the activities that will benefit the most from a clear guidebook. A good runbook will reduce the probability of failures in cloud operations and overall business. 

Document the requirements to execute the runbook. The AWS runbook should list the required permissions for all the relevant cloud services in development, test and production. Resist the temptation to capture only the permissions for live instances in production -- issues can occur in any stage of the software's lifecycle. Get this information from business and technology stakeholders, particularly technology staff with day-to-day operational responsibilities on AWS.

Next, identify the tools and configurations required to use these cloud services. Include any third-party cloud management and cybersecurity tools, and any other applications that support cloud operations. Then, identify the network connectivity and access needed to run the cloud services.

Finally, document all the constraints that might block runbook execution. Examples include maintenance windows and impacted resources. Identify and address conflicts with other business or operations activities, such as software updates or security mitigation. This precaution ensures continuous operation of cloud services when someone executes the runbook.

AWS runbooks vs. playbooks

On the surface, AWS runbooks are similar to traditional playbooks. But whereas playbooks refer to business activities, such as corporate strategy or marketing, a runbook refers to computer systems or networks. Further complicating matters, AWS defines playbooks differently. In its Well-Architected Framework, playbooks are predefined steps to investigate an issue. Essentially, these are documents used for troubleshooting issues like application network connectivity. On the other hand, AWS runbooks are used more for incident response than troubleshooting issues.

Document escalation procedures

Escalation procedures are an integral element of AWS runbooks. These procedures give the runbook users a designated person to contact if they cannot complete the task. Include these staff members' full contact information and the amount of time it should take for anyone to complete the runbook. This time estimate lets them know when to escalate. For example, if you have a business-critical application that you need to fail over to another AWS Region, your AWS runbook should include the time frame for that failover. If the person running the task cannot complete it quickly enough to avoid massive disruptions, they should escalate the task.

Include the full contact information for any third parties in the escalation path of the runbook, including under what circumstances the team should escalate issues to them. For example, you might have a support contract with a third-party systems integrator that built an application for your organization. A runbook for tasks related to that app should include the support contract, preferred contact and other contractual information.

If your organization requires that stakeholders and decision-makers, such as the IT director or chief information security officer, be informed before any system changes occur, identify under what circumstance the on-duty team members should reach out before they execute a procedure. For example, some organizations deal with personal health and classified information, and decision-makers must be aware of any changes that could compromise that data.

Create an AWS runbook

Now that you have your requirements and escalation procedures set, you can write the runbook. Outline the steps in any given procedure clearly, to reduce the level of operational effort when a team member picks up the document.

  • Do not be brief with runbook procedures. Document each procedure as an action with an expected outcome. Take the time to write in enough detail for any user. It's OK if you don't get everything perfect; you might iterate on a few runbook versions to get it right.
  • Develop AWS runbooks in conjunction with these procedures comprehensively. Runbook development benefits from cross-pollination, so aggregate the work of solution architects, developers, operations staff and technical writers.
  • Implement security and internal controls to ensure that only authorized staff and resources can execute the procedures in the runbook. Executing runbooks against defined targets is a standard security measure. Some organizations use metadata to verify that they're running against the correct target.
  • Operations teams also need procedures in place to reverse a runbook action in case they must roll back a system update or revert another change. Most often, this is through the execution of another runbook -- one to reverse the change and return the AWS environment to its previous state.
  • Include a mechanism in the procedure to verify whether or not a runbook was successful. You can base success on the return codes from the executed actions within the runbook. Alternatively, some systems may recognize successful runbook execution programmatically.

Convert a runbook into code

AWS runbooks are most effective as code. While runbooks guide a human through documented processes, runbook automation removes that human element -- and human error -- entirely. Create AWS Systems Manager Automation documents to run scripts that automate the execution. Another automation option is AWS CloudWatch. Use CloudWatch to create rules to trigger events.

Maintain an AWS runbook

Continuously work on the organization's AWS runbooks. Create a process to review their execution and identify needed optimizations and revisions. Runbooks should become a regular review item as part of a compliance audit, incident postmortem or operations review.

Have new employees shadow experienced team members to learn how to document their work. Look for ways to include runbook creation and maintenance as part of the onboarding process for your technical staff. Runbook maintenance is also a task where you should involve your technical writers during their down periods.

Next Steps

How to create a runbook template for uniform documentation

Develop an interactive DevOps runbook

Dig Deeper on Cloud deployment and architecture

Data Center