alotofpeople -

Resilient software strategies all developers should know

Failures are unavoidable. However, the right software design and development choices can help minimize their impact, isolate problems and speed recovery times.

Many architects strive to design application systems with the capabilities to avoid catastrophic failures. Unfortunately, in the real world, crash-inducing errors and overloads are inevitable.

To properly handle such failures, development teams must equip themselves with the right software resiliency practices. This is particularly important when pursuing design styles such as microservice-based architectures, where failures can spread across distributed components and cause widespread disruptions.

Various software resiliency techniques and mechanisms can help teams respond to errors, initiate recovery processes and maintain consistent application performance in the midst of failures. Let's examine four strategies architects can implement to address errors, minimize the effect of failures and continuously maintain a resilient software architecture.

Create a dead letter queue

Individual communications can become stranded for any number of reasons, such as unavailable recipients, improperly formatted requests and missing data. This can be particularly problematic in event-driven architectures, where requests are often dropped into messaging queues to wait for processing while the requesting service moves on to the next operation. As such, unprocessed messages can quickly jam these queues.

The dead letter queue introduces a mechanism that focuses on dealing with these vagrant messages to prevent them from cluttering communication channels and needlessly siphoning the resources needed to keep them in circulation. Development teams can set up a dead letter queue to identify stranded messages and isolate the failure. This enables architects to examine specific errors, as well as maintain detailed, historical documentation that can help guide future design choices.

These rogue messages can then be removed from the queue at-will once they are considered obsolete. Alternatively, they can be resubmitted to either resume operations or replicate the error for debugging purposes.

Use feature toggles for modifications

Another important element of software resiliency has to do with a development team's approach to release cycles for feature updates. Rather than halt operations to add features and modify an application's capabilities, organizations can use the feature-toggle approach to keep applications up and running in the midst of rollouts and updates.

Feature toggles enable developers to incrementally modify applications while leaving existing production-level code intact. Techniques such as canary releases and A/B testing enable developers to roll out updated code to a limited number of instances while they keep the original code in production.

With the feature-toggle approach, teams can strategically configure releases by monitoring new-release instances and using a toggle-like mechanism for rollbacks should modifications cause breakages. In some cases, teams may be able to automatically trigger these rollback toggles if the system detects certain errors or performance inconsistencies.

Basic resiliency design patterns

To maintain resilient software, development teams use certain design patterns that focus on containing failures and providing emergency countermeasures. A number of patterns provide these types of recovery mechanisms and stop errors from spreading uncontrollably from one distributed component to another. Here are some examples:

  • Bulkhead. This pattern isolates subsystems and configures individual modules to halt communication with other components upon failures, reducing the risk that the problem spreads.
  • Backpressure. The backpressure pattern automatically pushes back workload requests that exceed a preset traffic throughput capacity limit, protecting sensitive systems from overloads.
  • Circuit breaker. Building upon both the bulkhead and backpressure patterns, the circuit breaker provides a mechanism that automatically severs connections to problematic components. It will retry the connection at regular intervals to see if the error is resolved.
  • Batch-to-stream. Designed to manage batch processing throughput, this pattern modifies batch workloads and converts them into simplified OLTP transactions.
  • Graceful degradation. This design pattern essentially installs a fallback mechanism for all major components of an application. While this mainly is designed to help provide rollbacks for updates, it can also come in handy in the case of sudden failures.

Promote loose coupling between components

Traditional, monolithic applications mean rigid dependencies in tightly coupled architectures. As a result, one software component almost certainly affects another. Alternatively, in distributed systems such as microservices, architects can minimize these dependencies by decoupling software components.

In a loosely coupled architecture, the dependencies that exist between application components, modules and services are kept to a minimum. Instead, abstractions handle necessary data transfers and messaging processes. As a result, updates or failures that befall one component are far less likely to cause unintentional changes to another. Decoupling isolates problems and prevents them from spreading across other software environments, limiting the risk for widespread errors.

Use sidecar containers to limit failures

A sidecar is a supporting container that runs in the same pod as the primary application container. The sidecar enables teams to add functionality to a container and integrate with external services without changing the main, existing application container instance.

For software resilience, this technique is beneficial in that the primary application logic and codebase remain isolated, limiting risks and failures. However, there are downsides to sidecars. For instance, the addition of sidecars means that developers are responsible for managing more containers and the increased resource consumption they impose. Take efforts to ensure that sidecars won't complicate workloads to the point that application performance suffers. For starters, you'll want to establish a thorough container monitoring system that will track sidecars and measure their impact on the production-level containers they serve.

Dig Deeper on Application management tools and practices

Software Quality
Cloud Computing