Why security chaos engineering works, and how to do it right

While 'chaos' doesn't sound like something software security managers would want, chaos engineering has an enticing amount of value when it comes to identifying potential threats.

Tom Nolle, Andover Intel

Published: 22 Aug 2022

For many, security isn't the first thing that comes to mind when they hear about chaos engineering. It's likely that even fewer would consider it as a fundamental security practice on par with things like network firewall configuration, identity management and intrusion detection.

However, the growing complexity of modern software security layers alongside increasingly modular and distributed architectures has reached a point where the risk of failure validates the legitimacy of chaos engineering as a security tool. As such, it's not unlikely that chaos engineering may enter the realm of not just routine -- but essential -- security management processes.

Let's examine the reasons chaos engineering is gaining traction as a security management method, detail the way its applied in security scenarios and review some of the best practices to follow when putting it into practice -- including some of the common pitfalls to avoid.

The role of chaos engineering in security

Chaos engineering is a broad term that describes the act of performing complex systems tests by injecting failures before an application encounters them in normal operations, monitoring the outcomes and documenting the right course of action. The concept of chaos engineering is often applied to operational hardware -- including networks and server pools -- as well as software development and product testing.

Chaos might not sound like the sort of thing a security specialist or compliance team would want to cultivate within their software systems. The goal of chaos engineering, however, is to prevent chaos by identifying inconspicuous problems and potential failures before they occur in production. And, as the practice matures, chaos engineering is garnering more attention in the field of application security.

By performing chaos engineering on the security layers directly, security specialists gain an opportunity to broaden the number of situations and attack vectors they are capable of simulating. Additionally, it allows them to test how the relationships between each of the multiple layers and features affect the impact of a certain failure. Eventually, this will reveal areas where security layers fail to create an effective barrier against attacks and intrusions.

Applying chaos to security: Injections and monitoring

Chaos engineering testing for security is a matter of balancing two layers. One layer handles the injection of faults; the other is where the monitoring and resolution processes take place. For example, one layer will inject test data to simulate unauthorized access attempts. The other layer will identify issues by watching for signals of security breaches, allowing security teams to locate gaps in access controls.

If those injections induce a failure or reveal a hole in any existing security barriers, the monitoring process should identify the exact point and time where the problem or breach occurred. Logs and monitoring data from the application infrastructure side, along with the log of injected security-based faults, will also help correlate any problems related to infrastructure that may pose a security threat.

While it's possible to apply chaos engineering to the security and infrastructure separately, this would likely be a mistake. Security breaches can come about not just due to unexpected events indirectly linked to security or threat-prevention tools, but as a result of events in IT infrastructure. For instance, faults in infrastructure often trigger systems to run in a "failure mode" that may not break functional elements. Instead, it may provide a potentially unwanted bypass for certain security elements to allow for fixes.

4 tips for a chaos engineering security plan

When conducting chaos-style testing, it's important not to fall into the trap of focusing on common, predictable problems. Instead, try to shift focus toward problems that are, although unlikely, at least a possibility.

In fact, chaos engineering naturally demands testing faults that would be introduced because of both human errors and system failures. Because the goal is to create "chaos," constraining it to predictable behavior contradicts the goal. As such, test injections that introduce high levels of random faults are typically the most effective.

Monitoring is the other key element of chaos engineering, especially when it comes to security validation. Ideally, the sheer volume of test data and possible event combinations and interactions make it very unlikely that most faults aren't replicable. This highlights the critical importance of data: If all the possible information needed to identify and remedy a problem isn't gathered during routine testing, the entire process will waste a lot of time and money.

Logs and telemetry from both infrastructure and applications are a big part of meeting this requirement, as well as accurate information regarding the injected events. Precise and synchronized timestamps are particularly critical because without them, there's no way to reliably document the relationships between certain causes and effects. It's the connection between chaotic events and bad outcomes that make chaos engineering worthwhile, which is easy to lose when there's a lapse in exact time records.

The final key element of chaos engineering revolves around the individuals responsible for it. Security staff can't conduct chaos engineering reviews effectively in isolation, because they probably can't accurately recreate underlying system faults that trigger unexpected failure modes that provide a bypass around certain security measures.

To be effective, chaos engineering requires a cooperative effort between operations personnel and security teams. It's important to establish this cooperative model from the outset when implementing such a program, and equally important to carry it through the resulting test design, execution and evaluation processes.

Why security chaos engineering works, and how to do it right

While 'chaos' doesn't sound like something software security managers would want, chaos engineering has an enticing amount of value when it comes to identifying potential threats.

The role of chaos engineering in security

Applying chaos to security: Injections and monitoring

4 tips for a chaos engineering security plan

Dig Deeper on Software testing tools and techniques

How breaking things builds resilient systems

What is software resilience testing?

CrowdStrike blames outage on content configuration update

fault injection testing