Modern software is complex, with dependencies on hardware infrastructure and other software platforms. This means, even with careful design and testing, incidents happen.
IT incidents can have serious business implications, such as customer dissatisfaction, lost revenue and compliance breaches. That said, software developers routinely work with IT staff to quickly identify and resolve incidents, and both groups can learn valuable lessons along the way. These lessons can then roll into development forks and future releases.
But to sort through the aftermath of a significant incident, root out the underlying causes and glean value from the experience requires conscientious effort. A team must set aside the time to assess and analyze the chain of events that led to and took place during and after an incident. These incident post-mortems benefit the DevOps process.
Incident post-mortem basics
An incident occurs when software's behavior deviates from the expected. For example, an incident might occur when developers fail to handle buffer overflow errors and potentially give a hacker access to the software and sensitive business data. Or a server or storage device fails, which renders the software unavailable and precipitates business data loss.
Other incidents occur when poorly configured infrastructure prevents reliable communication between the software and a corresponding database, which results in performance problems or crashes. The list of incident examples is almost endless.
Open up the post-mortem
When an incident occurs, developers' and IT staff's immediate goal is to resolve the issue and return the software to normal operation. This typically involves dynamic responses, such as application reboots, database files restoration, VM migration to other hardware devices and configuration checks and corrections. IT professionals often manage this firefighting as part of their everyday jobs.
The aftermath of an incident -- once the troubled workload returns to normal service for the business -- enables developers and operations team to review an event, assess its effect, identify the core takeaway lessons and select a course of action to prevent a reoccurrence. This process is the incident post-mortem.
From a practical standpoint, the incident post-mortem is typically handled through a team meeting. Team members walk through the timeline and associated evidence -- error logs, for example -- to discuss root cause analysis results and determine next steps. Post-mortem meetings might cover numerous incidents.
However, not all incidents require a post-mortem evaluation. Incidents are usually categorized as low, moderate or severe, so post-mortem meeting time might focus on the most serious or severe incidents.
In DevOps organizations, post-mortem meetings are often most beneficial in the time after the release of a new iteration and before the planning phase of the next. This timing synchronizes the lessons and next steps -- such as patch implementation to handle buffer overflow errors -- with the next release cycle.
Incident post-mortem process
Post-mortems go by many names, such as root cause analysis or incident review. There is no single, universally accepted approach to a post-mortem. Options vary from casual to highly formal, depending on the size of the business, the nature of the product or the severity of the incident.
While the names and approaches might vary, the goals remain the same: Inform leadership and relevant stakeholders and take long-term corrective action. Post-mortems can involve significant time and effort to gather and assess information. Thus, the post-mortem meeting might occur days, or even weeks, after the actual incident.
Ideally, a post-mortem includes a consistent slate of content that covers a summary, triggers, effects, assessment, resolution and conclusions.
Summary. Post-mortem meetings typically start with a high-level outline or summary of the incident. The outline portrays what happened, the services involved, effects on users or customers, the severity and duration, business impact, staff involved in the response and the overall resolution of the incident. This summary is particularly beneficial to managers and application owners who must communicate details of the incident beyond the organization, such as to C-suite executives or even outside regulators.
Triggers. Beyond the summary, a post-mortem digs into the more technical and operational aspects of the incident -- usually starting with the causes and triggers. This part of the post-mortem explains the origins of the failure and highlights the underlying causes. For example, a failure in the application server might have precipitated the incident.
Effects. After the team clarifies the cause, the post-mortem discussion digs deeper into the effects on services, users and overall business. This part of the post-mortem generally evaluates the extent or severity of the incident. For example, a software fault disrupted the organization's flagship SaaS product and left hundreds of paying users without access to the service for the duration of the incident.
Assessment and resolution. Review the incident timeline: Note the time of failure, the time when help calls or tickets began to arrive, the time until those tickets were addressed, the personnel involved, diagnostic procedures implemented and results obtained, and the steps followed to remediate the problem. This discussion can also include failed attempts to fix the issue -- a review of what did not work can be as useful as a review of what did. For example, a review might note that a technician rebooted a failed server to no effect, which led the technician to restore the affected application from backups on another available server and redirect traffic to the new host to restore functionality.
Conclusion. Post-mortems usually wrap up with recommendations and often include some dynamic discussion among the DevOps team about preventative actions or ongoing guidance. For example, a review concludes that an important application or service would have benefitted from a high-availability cluster or failover configuration that would have kept another server functioning. Next steps are to budget and deploy an application cluster to prevent a reoccurrence.
Improve incident post-mortems
So how can organizations make post-mortems more valuable and beneficial to the DevOps team -- and to the business?
First, eliminate blame. One of the biggest weaknesses with incident post-mortem meetings is the potential for blame, which undermines the post-mortem's purpose. A developer wrote code that didn't handle an error gracefully, causing the application to crash, or an operations admin didn't update a server OS, which opened the company's database to hackers. The list goes on. People can be intimidated easily when personal job performance and reputations are on the line. Decouple personnel from actions and guard against personal culpability in incident assessments to conduct more open, honest and thorough discussions.
Assign a leader to each post-mortem cycle. While the leadership role might fall to a DevOps manager, it's helpful to assign others to lead the post-mortem effort. It is increasingly common to assign DevOps staff -- ideally with some knowledge of the incident -- to lead the post-mortems, which further fosters a more open and collaborative atmosphere and reduces the tendency to blame anyone for the issue.
As mentioned, post-mortem evaluations take time and involve gathering extensive information -- so reserve them for the most serious incidents, such as a data breach. If business needs dictate post-mortem analysis for less severe incidents, it might be possible to cover multiple incidents within the same meeting period.
Prepare as many factual details as possible, such as timelines combed from device or application logs, lists of applications involved and other definitive details, which eliminate guesswork and lend confidence to discussions. Also, denote relevant metrics or key performance indicators such as downtime, severity, time to troubleshoot and time to resolve. When an enterprise tracks metrics, DevOps teams can watch trends and understand how issue frequency and resolution effectiveness change over time.
Invest in tools and templates
Organizations can draw upon existing tools to accelerate and even automate parts of the incident response -- and post-mortem -- process. For example, VictorOps incident response software combines log management, monitoring, chat, alerting and reporting and scheduling tasks, which together generate documentation for later post-mortem review. Similarly, Atlassian Opsgenie provides alerting and issue analytics, and PagerDuty handles incident management with human scheduling and data analytics. Such tools help organizations derive detailed and meaningful metrics for incident post-mortems.