Software resilience engineering helps teams quash chaos

What does a nuclear power plant or government team of epidemiologists have to teach software engineers? Quite a lot, if your aim is not just better apps, but more resilient ones.

George Lawton

Published: 26 Feb 2019

Enterprise adoption of software resilience engineering is on the rise as a strategy to improve the quality of complex applications. To some extent, chaos engineering drives this increased acceptance, as it can reduce the effect of cloud service outages, network attacks and microservices failures in large distributed apps.

The differences between chaos engineering and resilience engineering can seem like an exercise in semantic juggling -- even Netflix, pioneer of open source tools like Chaos Monkey and Chaos Kong, recently made an effort to focus on the latter term. But they're not exactly synonymous. In the long run, software resilience engineering enables an organization to go beyond the technical details of a failure and improve the team's response to it.

Here's how to get started and enable more resilient software and culture.

Chaos ties to the system

In many ways, chaos engineering is a subset of resilience engineering, just as testing is a subset of software quality. With chaos engineering, teams attempt to figure out how the failure of seemingly isolated components can affect mission-critical systems -- typically, through experiments that simulate disruptive conditions.

Chaos engineering focuses on the specific technical influence of a component's failure in a complex software application. Engineers characterize the system's normal behavior based on measurements of metrics, like throughput, error rates and latency. Then, they deliberately apply some perturbation -- a server crash, domain name system outage, malformed response or traffic spike -- to one or more aspects of this system. The engineers measure how this disruption impacts system behavior -- the smaller the difference, the more resilient the system.

Engineers can work with chaos engineering tools to gradually inject live systems with controlled amounts of failure, which they can immediately roll back if a significant problem occurs. While this kind of testing can be done in a test environment, that safer setup might not capture all the live system dependencies.

Chaos engineering can also alert engineers to specific aspects of the system that are susceptible to problems. For example, an exponential increase in latency that occurs during a linear increase in customer orders might point to database scaling issues. Engineers could then make the database or the systems that interact with it more scalable.

Respond with resilience

Software resilience engineering includes all these chaos engineering details, but it also looks at the bigger picture. Resilience engineering attempts to address issues like how the organization responds to complex failures, how failure modes affect business value and how organizations can create a culture of quality.

Resilience engineering has a history in industrial settings, applied in everything from nuclear power plant operations to massive public safety exercises. Resilience engineering terminology might be unfamiliar to software quality engineers: Safety culture in resilience engineering equates to quality culture in software engineering, while accidents relate to software failures or incidents.

The limit of chaos engineering's technical approach is that it is impractical to engineer tests around every potential failure mode. For example, if Amazon Simple Storage Service goes down in one data center for an hour, such an incident could cause a temporary inconvenience or corrupt the accounting database, which causes significant business problems.

Software resilience engineering, as a practice, helps guide the organizational response to these types of complex failures. Chaos engineering shows teams what different failure modes look like in practice. Resilience engineering helps improve communication and problem-solving skills to rapidly address the root causes of those problems and improve the ability to conduct blameless post-mortems.

Prioritize the response

Some industrial safety engineers push for more industrial and public safety resilience engineering practices within software development. Sidney Dekker, a professor and author of books on human safety, suggested that enterprises can benefit when they shift thinking from error and defect reduction to fostering positive capacities within teams to deal with unexpected problems.

Through resilience engineering, teams can test how well they respond to failures, such as when clocks reset on servers, subsystems shut down and events like distributed denial-of-service attacks occur. All these issues might also be simulated as chaos engineering tactics to ferret out bugs and limitations within the code, complementing the team's work on its response practices. Teams can measure resilience metrics, such as team response time, engagement of team members and team member feedback, which could make the process go more smoothly.

Think of it like this: Chaos engineering focuses on improving the software, while resilience engineering makes chaos the starting point to focus on the response and all that it facilitates -- better team communication, culture and collaboration.

Software resilience engineering helps teams quash chaos

What does a nuclear power plant or government team of epidemiologists have to teach software engineers? Quite a lot, if your aim is not just better apps, but more resilient ones.

Chaos ties to the system

Respond with resilience

Prioritize the response

Dig Deeper on Software test types

If technology breaks, can you keep your business running?

How breaking things builds resilient systems

What is software resilience testing?

CrowdStrike chaos shows risks of concentrated ‘big IT’