What enterprises learn from software failure incidents

Research from Etsy and IBM suggests we are learning the wrong lessons from software failure incidents and points toward how to get it right.

George Lawton

Published: 02 Feb 2018

Software failure incidents can deal a crippling blow to morale across management, developers, QA and ops.

No one enjoys failure, which might be why enterprises tend to quickly gloss over post-mortems in an effort to get back on track. But a better practice is taking the time to recalibrate our mental models of how the complex system underlying apps really works, said John Allspaw, co-founder of Adaptive Capacity Labs, speaking at the DevOps Enterprise Summit in San Francisco.

This insight came from Allspaw's former work as CTO for online marketplace Etsy and collaborative research with IBM, IEX and Ohio State University. This team analyzed software failure incidents to better understand how enterprises can learn from them to improve software quality.

"We have to take human performance seriously, and if we don't, we will continue to see brittle systems," Allspaw said. He advised looking at incidents beyond a normal post-mortem.

Reframe incidents for the C-suite as unplanned investments in enterprise survival. "They are hugely valuable opportunities to understand your systems and what competitive advantage you are not pursuing," he said. Since an incident burns money, time, reputation and staff morale, enterprises should work to maximize the ROI of this investment.

Different mental models

Modern software stacks are built from many moving pieces that are not directly visible to the employees who work with them. As a result, different roles within a company create different models of how these pieces fit together. These models are each useful, but also incomplete. Testing, monitoring and deployment tools suggest different representations of how the systems work, which inform QA, ops and developers, but also lead to different models of how the system works in practice. No single tool or employee creates a complete, universal representative model of the complex system.

Software failure incidents give QA, developers and ops the opportunity to see where their representative models of the system are inaccurate so they can build more accurate models. Everyone interacting with the code will partake in one of these activities:

observe
infer
anticipate
plan
troubleshoot
diagnose
correct
modify
react

How DevOps can break models

Even when an incomplete representative software model is good enough to get things done, the underlying systems are also in a state of change. Over time, drift develops between someone's working model and the constantly evolving enterprise software. This discrepancy is happening more quickly as organizations change faster with DevOps practices.

When they take the time to thoroughly understand how and why software failure incidents occur, everyone in the organization can develop a more accurate model. This can help shape the design of new components, subsystems and architectures. "The incidents of yesterday inform the architectures of tomorrow," Allspaw said.

Incidents also inform new regulation, policies and constraints. A Knight Capital incident that triggered a flash crash in 2012 led to the Securities and Exchange Commission Regulation Systems Compliance & Integrity rules. Payment Card Industry Data Security Standard rules arose when Mastercard and Visa realized they had lost $750 million due to fraud.

Decoding the secret message of failure

The things that get an IT organization's attention are shocks to the system -- reminders that the model is not perfect.

"If we think of incidents as encoded messages from below the line [of our visibility], then your job is to decode them," Allspaw said. "Incidents help gauge the delta between how the system works and how we think it works, and this is almost always greater than we imagine."

To evaluate software failure incidents in a useful way, think about how employees focus decisions across the enterprise. Take the time to gather subjective data from across the organization. What did people do? Where did they look? Which approaches were more fruitful and which ones less so? Everyone can compare their mental models and identify potential errors in them.

Allspaw cautioned that it is not about aligning everyone's models. "The representations are necessarily incomplete. The idea is not to have the same mental models because they are always changing and flawed." A common practice in post-mortems is to find a way to guide the inquiry in a blameless way. Allspaw said this is necessary, but not sufficient.

It won't be easy to do a deep inquiry immediately following an incident, since the entire team is exhausted. However, this is when the specific details are most clear. It's also a good practice to observe all the factors that went into mitigating the software failure incident. You never know when that one quiet employee in the corner will reveal a side of the complex system that others overlooked.

Management might never have realized the importance of that developer without an extensive incident debrief.

What enterprises learn from software failure incidents

Research from Etsy and IBM suggests we are learning the wrong lessons from software failure incidents and points toward how to get it right.

Different mental models

How DevOps can break models

Decoding the secret message of failure

Dig Deeper on Software testing tools and techniques

Cloud incident response: Frameworks and best practices

project post-mortem

CircleCI incident adds to SecOps toil

An introduction to SRE documentation best practices