Traditional testing methods largely get the job done for modern cloud and containerized applications, but there are some gaps that can leave teams open to potential problems. Chaos engineering has emerged as a process that allows teams to go beyond traditional testing methods and identify problems beneath the surface within applications in production.
Chaos engineering tools help automate the process of temporarily disabling or throttling specific components of infrastructure to assess its effect on applications in production. Through chaos experiments teams can simply turn off a given service or infrastructure component, monitor apps using various observability tools and troubleshoot known issues.
Automated tools -- like chaos engineering tools -- help teams experiment with potential problem areas that can lead to serious outages. Automation may also add an element of precision to the testing process that can identify the root cause of a problem. These tools can also make it easier to organize and manage a chaos engineering game day to help IT, operations and development teams practice how to coordinate a response to a significant outage.
Let's examine some popular chaos engineering tools and how teams can choose one that suits their needs.
How chaos engineering tools help
The first popular chaos engineering tool was Netflix's Chaos Monkey. The resiliency tool was crude, but it provided the bare components to run successful chaos experiments. Some IT organizations still use it. Several other commercial and open-source alternatives have emerged; i.e., tools with better controls, integration capabilities with the latest platforms and more precise components to run chaos experiments.
Chaos engineering tools identify problems before they become significant issues, said Ryan Petrich, CTO at Capsule, a Linux security service, who has experimented with many such tools. For example, chaos testing identified certificate-expiration problems with Capsule8's transport layer security monitoring. Petrich and his team rectified the situation by updating the monitoring infrastructure to alert the team a week in advance when certificates would expire.
In another instance, Capsule8 discovered through chaos testing that when various systems were taken down, not all of them would trigger a failover to another region.
Even though the tools can help, chaos engineering requires discipline and relatively hygienic ops practices to be successful, Petrich cautioned. "Once availability and resilience become business goals and it is assumed that measures are in place to ensure it, chaos engineering becomes an essential practice to challenge and validate assumptions," he said.
Even though Chaos Monkey is the oldest chaos engineering tool and hasn't evolved a lot, many developers still like the resiliency tool for its simplicity. Chaos Monkey -- and the related failure-injection tool Simian Army -- focuses on terminating virtual machine instances and replicating unpredictable production incidents. These are aspects liked by Jacek Zmudzinski, senior marketing specialist at the software consultancy Future Processing.
"It has been great for creating a mindset for disaster preparation," he said. The only downside with the tool is its lack of recovery or rollback mechanism. While Zmudzinski believes that its simplicity may make it a good option for smaller companies and startups, these companies will need to have a well-oiled rollback mechanism in place to address any problems.
One of the first chaos engineering tools a team might start with is Chaos Toolkit. The open source project comes with a suite of standard chaos experiments one might run, good introductory documentation and supports most major cloud providers.
The Chaos Toolkit establishes a declarative API and makes it easy to code chaos experiments in a version control system in a way that can be automated through a CI/CD system. It includes drivers for Kubernetes, AWS, Google, Azure and other chaos engineering tools, such as Gremlin.
The open source Chaos Mesh can fit into a development workflow and be easy to integrate into Kubernetes infrastructure without any changes to deployment logic, according to Nabil Mounem, founder of Have Websites, a website tools comparison service. Mounem also likes the use of CustomResourceDefinitions to define chaos objects. The API made it easy to version, manage and automate chaos experiments. Chaos Mesh also includes a dashboard to keep track of experiments.
Another benefit of Chaos Mesh, Mounem said, is the ease with which his team can inject faults across various layers of Kubernetes devices at the pod, network, system I/O and kernel levels. It can also add latency, interfere with communications or mimic read/write errors. The tool works with all the major cloud platforms. Recently, the Cloud Native Compute Foundation approved Chaos Mesh to be part of its Sandbox program. Chaos Mesh has helped Have Websites build a more resilient application, in Mounem's belief.
With how chaos engineering tools identify and troubleshoot vulnerability and potential security issues, Eric Florence -- cybersecurity analyst at SecurityTech, a security consultancy -- recently began exploring chaos engineering tools. Florence uses the vendor tool Gremlin because it has plenty of available failure scenarios -- which a large complex digital infrastructure might see itself experiencing at some point -- such as CPU attacks or traffic strain on your systems.
Another Gremlin perk is that the platform can autodetect infrastructure components and make experiment recommendations to identify common failure modes. The tool can also cut off experiments automatically when systems become unstable.
Gremlin includes native integrations for Kubernetes, AWS, Azure, Google Cloud and even bare-metal infrastructure.
It's still early for chaos engineering
"None of the available tools are complete in their ability to test and experiment with a comprehensive suite of failures that are likely to occur in real environments," Capsule's Petrich said.
Outside of these tools, organizations can consider various commercial options that may be a better fit or offer vendor help on their chaos journey, Petrich suggested. Some of these options include tools from Steadybit or Verica, although he hasn't directly used them. Some tools focus on a particular type of failure, so the best approach will usually involve multiple tools and extending them with custom experiments based on assessing an organization's operational risks.
The tools are also limited by what API is in use and the type of scenario you might want to simulate.
"These toolkits are by no means an actual solution to your faults," Florence said. "As they simply highlight them when these scenarios happen and can, at the least, allow you to be ready for when it does happen."
Most of these chaos tools are best suited for testing much larger and complex systems ranging over vast regions instead of small-scale businesses, Florence said. At the same time, he also advises smaller enterprises to run some basic tests to understand what damages the infrastructure and improve the response to an actual problem.