How can Chaos Monkey testing help with microservices?

Resilience testing isn't just for infrastructure. Architects can adopt this disaster recovery testing strategy to build more reliable microservice applications.

Chaos Monkey is a popular open source tool developed by Netflix that takes reliability testing to new heights. The idea behind Chaos Monkey testing is to deliberately kill random nodes across the system at regular intervals to assess whether the system can survive despite these failures. The Chaos Monkey testing principle can help evaluate the reliability of microservice-based applications, but rather than intentionally kill nodes, architects should focus on the interruption of services.

As one service fails, other dependent services could stall or fail in a ripple effect. This delivers a bad user experience. There are many ways to deal with this issue, and when used in combination, they can help architects design more resilient systems.

Fallback services matter

Replicas and fallbacks are not just for infrastructure components, but for mission-critical services as well. Consider creating an alternate, bare-bones service that can take on the load if the default service fails. This is especially useful for services such as billing and transaction processing for e-commerce applications.

Setting up a fallback service could be as simple as routing traffic away from the failed service. Before flipping the Chaos Monkey testing switch, though, you need to ensure backup services for critical processes are ready to take over.

Building resilient infrastructure

With modern container stacks, it is now possible -- even easy -- to automatically restart failed containers and set up autoscaling container clusters. This automation helps ensure resiliency in the infrastructure layer.

At the networking level, shortening timeout limits will ensure services are quickly rerouted to a fallback service after a failure. This helps better optimize the system for performance.

Having persistent data storage for containers is also essential. When a container fails, its stored data is lost unless you configure persistent storage volumes. Persistent storage ensures that, even if a service fails, it can be easily resumed with the old, stored data.

When all efforts still result in lost data, a disaster recovery (DR) tool can ensure you always have access to your data in the event of mass failures. DR tools and services are worth purchasing if you need true data resilience.

Whether it's the service layer or the infrastructure layer, you can build more resilient applications by employing the Chaos Monkey testing principle in your microservice applications.

Next Steps

Choosing the right chaos engineering tools

Why contract testing can be essential for microservices

Dig Deeper on Application development and design

Software Quality
Cloud Computing