Almost all of the poorly-written software specifications floating around out there mention something along the lines of "errors will be handled appropriately" in the programming requirements. Unfortunately, this obscure stipulation usually fails to say how, exactly, developers should address those errors -- if they list the potential errors at all.
This problem only escalates as event-driven systems enter the mix, and systems can drop messages into a queue at-will without assuming responsibility for what happens next. Considering how much harder this makes it to not just trace errors, but prevent cascading failures, any specifications related to event-driven systems need to spell out the specific problems related to event-driven methods, and prescribe remedies for each one.
In this article, we'll take a detailed look into some of the common event-driven architecture failures that development teams face, as well as techniques that have emerged over the years as standard countermeasures.
Types of errors
An event-driven system has at least three pieces: the components that send requests, components that receive and process them (potentially returning a result), and a messaging broker that facilitates the transaction. Each one of these systems can go down or cause another system to fail, and error handling in event-driven architectures can present some unique hurdles.
Here are some of the common situations that developers encounter:
Failed return result
A failed return result occurs when a system receives a request but can't return the result the requestor expects. For instance, this often occurs if a credit card is declined during a transaction, or a warehouse doesn't have a requested item in stock. The receiving system will processes the request but return an error result. In event-driven systems, the events that carry requests can simply drop a notification into a messaging queue and move on. That means a separate process will be responsible for detecting the error and handling it as appropriate.
Remote service down
In this case, the system a requestor wants to reach is out of operation. For instance, an order-entry system could receive an order and the order ID, and an event-handler could put the message into a queue for the credit card system, but the credit card system will never get the request. In an event-streaming system, the message will become stuck in the queue as it waits for the receiver to pick it up, resulting in an ever-growing pile of requests that could overload the queue mechanism or requesting services.
Event routing fails
If an event is misaligned with the process it's supposed to trigger, then it might never be picked up by the right system. For instance, a system that needs to receive messages related to customer orders will experience failures if it's not subscribed to the right event type. This produces the same type of situation as a system being down, as the request will never return a result.
Event-handling system down
This occurs when a message queuing mechanism fails to do its job, which can happen in the case of a bad message-handling configuration. It is also possible to design a messaging system to be highly available and forget one piece or another that supports a critical process. The original requesting system can be working fine, as can the intended recipient, but the message that was assumed to be sent is never passed along at all.
Ways to handle event-driven architecture failures
For all the issues that plague these types of systems, there are some dependable methods and patterns for event-handling in event-driven architectures. Here are a few worth remembering:
Dead letter queues
A "dead letter" queue is a repository of messages that, for whatever reason, were never picked up and processed by the intended recipient. At regular intervals, an implemented cleanup-routine mechanism can comb through the dead letter queue, determine the issue and handle these lost messages as appropriate.
This routine can also be programmed to come through and resubmit old jobs, if needed. At the very least, it can notify requesting systems that the intended recipient is down.
The saga pattern uses both a messaging bus and a "controller" mechanism that receives events from that bus and turns them into process-based messages. This controller could be something like a finite state machine that simulates sequential logic -- anything that can take the message, determine where it came from, analyze its status and then process the next command. For example, this can be accomplished using a switch statement, or a set of nested IF commands.
This implements a somewhat sophisticated rollback system for event-driven architecture failures, as the service can inform the bus of a failure, prompting the controller to fire rollback instructions to each service along the executed chain. However, keep in mind that this controller mechanism adds a good deal more code, which can quickly turn a relatively simple application system into a complex web of dependencies and potential bugs.
Logs and alerts
The system that sends or receives the failure may want to write alerts to a log that something went wrong. These alerts can manifest as things like web pages, emails, instant messages or phone calls that the user can interact with.
These messages should have enough context that an administrator can dive in, debug and fix the problem. One way to stay consistent with alerts is for development teams to provide their own support, implementing a rotating schedule that assigns certain developers to support roles that take responsibility for debugs, fixes and, when needed, escalations.