How can Chaos Monkey testing help with microservices?

Resilience testing isn't just for infrastructure. Architects can adopt this disaster recovery testing strategy to build more reliable microservice applications.

Twain Taylor, Twain Taylor Consulting

Published: 21 Feb 2018

Chaos Monkey is a popular open source tool developed by Netflix that takes reliability testing to new heights. The idea behind Chaos Monkey testing is to deliberately kill random nodes across the system at regular intervals to assess whether the system can survive despite these failures. The Chaos Monkey testing principle can help evaluate the reliability of microservice-based applications, but rather than intentionally kill nodes, architects should focus on the interruption of services.

As one service fails, other dependent services could stall or fail in a ripple effect. This delivers a bad user experience. There are many ways to deal with this issue, and when used in combination, they can help architects design more resilient systems.

Fallback services matter

Replicas and fallbacks are not just for infrastructure components, but for mission-critical services as well. Consider creating an alternate, bare-bones service that can take on the load if the default service fails. This is especially useful for services such as billing and transaction processing for e-commerce applications.

Setting up a fallback service could be as simple as routing traffic away from the failed service. Before flipping the Chaos Monkey testing switch, though, you need to ensure backup services for critical processes are ready to take over.

Building resilient infrastructure

With modern container stacks, it is now possible -- even easy -- to automatically restart failed containers and set up autoscaling container clusters. This automation helps ensure resiliency in the infrastructure layer.

At the networking level, shortening timeout limits will ensure services are quickly rerouted to a fallback service after a failure. This helps better optimize the system for performance.

Having persistent data storage for containers is also essential. When a container fails, its stored data is lost unless you configure persistent storage volumes. Persistent storage ensures that, even if a service fails, it can be easily resumed with the old, stored data.

When all efforts still result in lost data, a disaster recovery (DR) tool can ensure you always have access to your data in the event of mass failures. DR tools and services are worth purchasing if you need true data resilience.

Whether it's the service layer or the infrastructure layer, you can build more resilient applications by employing the Chaos Monkey testing principle in your microservice applications.

Next Steps

Choosing the right chaos engineering tools

Why contract testing can be essential for microservices

How can Chaos Monkey testing help with microservices?

Resilience testing isn't just for infrastructure. Architects can adopt this disaster recovery testing strategy to build more reliable microservice applications.

Fallback services matter

Building resilient infrastructure

Next Steps

Dig Deeper on Application development and design

CrowdStrike chaos shows risks of concentrated ‘big IT’

Chaos Monkey

Tools and techniques to test Kubernetes objects

Compare high availability vs. fault tolerance in AWS

Related Q&A from Twain Taylor

What are the benefits of event-driven architecture patterns?

How do you choose between SDKs and APIs for enterprise apps?

Can API mapping fix data integration problems?