Using the saga design pattern for microservices transactions

We explore how the saga design pattern can support complex, long-term business processes and provide reliable rollback mechanisms for multistep transaction failures.

Breaking business actions into their smallest possible units through microservices seems like a wonderful idea. However, in business-critical scenarios, the architecture needs to align and coordinate a half-dozen different distributed services and systems. If any step fails, an entire business process may need to roll back and remediate, much like a failed database transaction.

Microservices, by default, have no coordinating or controlling function: Individual services should inherently have no knowledge of the others outside of them. As such, maintaining smooth process flows between systems is much easier said than done without some kind of supervising controller mechanism. Fortunately, we can accomplish this through an architectural design known as the saga pattern.

What is the saga design pattern?

Picture an electronic data interchange (EDI) system interface that takes orders from customers and passes it to a product ordering system. That ordering service then sets off a chain of service calls that alert downstream manufacturing and shipping systems to finish the transaction. The diagram below shows a sample pattern illustrating this, where services call each other in a "round robin" style. Once the entire transaction chain is finalized, shipping sends a message to the product order service to confirm its completion. This can be thought of as a service choreography method.

Web service orchestration through a saga pattern

Typically, the application will perform each one of these actions one at a time. If both the manufacturing order and the shipping order go through, but the payment transaction fails, either a team member or system must send the preceding services an alert to roll back.

Unfortunately, things get a little more complex when large-scale business transactions span significant periods of time. If any of those systems fail along the way, or the order is cancelled, we need a system that can perform a logical rollback and reset all those systems and transactions. For instance, a single failure during the payment transaction could very well require teams to roll back dozens of previous transactions that were completed by dozens of separate systems.

Airline ticketing systems are a perfect example of this problem. An unexpected event can cause people to cancel a trip minutes before the plane takes off. That single cancellation will cause ticketing systems to adjust seat availability, luggage to reroute, payment systems to issue any necessary refunds -- and those are likely just a few of the steps involved.

As you can imagine, this system will need some way to "backwash" itself by reversing some of the earlier messages between the web services. Unfortunately, our transactions are just too complex and long-running to try to simply call all the services and commit them. This requires a controller with a little more nuance than a master program. This requires a controller that can take ownership over the entire process, which is where the saga pattern steps into the picture to deliver service orchestration.

History of the saga design pattern

Hector Garcia-Molina and Kenneth Salem introduced the saga pattern concept in their paper titled "SAGAS" back in 1987, while stressing the danger of assuming a rollback will completely restore a system to its original condition. However, where Molina and Salem talk about the saga pattern in the context of databases, the concept really shined once it was applied to SOA in 2012, when Arnon Rotem-Gal-Oz released his book SOA Patterns and advocated using the event-driven approach for things like the enterprise service bus.

Implementing saga for web services

To illustrate the saga design pattern, imagine that your team implements an enterprise service bus that listens for particular transaction events, and then forwards messages to the systems to start their operation. Once the bus creates message that represents the event, it is sent to any service associated with that event. In this case, the controller is a web service that is triggered by that event. That controller makes function calls to the next business web service in the queue.

Note this gives us two kinds of services:

  • The controllers, which receive events in the form of messages and then relay functional instructions to other services; and
  • The services that carry out the actual business process that needs to happen, and then communicate their completion to move the transaction along.
Diagram with round robin style of service choreography

To implement these controller services, you can essentially create an event handler for an event-driven application, or introduce a finite state machine that simulates sequential logic. That component can then take the message, determine where it came from, analyze its status, and then process the next command. This can be accomplished simply through a switch statement, a set of nested IF commands, or even a single database lookup.

Keep in mind that implementing this design can still be tricky if the overall application demands high reliability. For example, imagine the controller service crashes after it triggers an 'order placed' event, but before it can pass a 'payment complete' event. When that service restarts, it will need to access some sort of transaction log, find the unprocessed transactions, resubmit the event (or events), and mark the work as done. This leaves the possibility that if a commit fails, a system will shut itself down after it sends the event, but before it actually confirms the commit. There are a number of architectural patterns that resolve this specific issue, but the simplest by far is to allow redundant messaging, but program the services to ignore them as needed.

Should you implement the saga design pattern?

The goal of the saga design pattern is primarily to take long-running, multisystem business processes and add the ability to roll back failed systems in an intelligent manner. However, it does add more code, which means new layers of complexity, debugging challenges, bandwidth requirements and processing power.

Put bluntly, an orchestration-focused saga pattern will usually prove to be overkill for simple application-based transactions. Unless your organization is notably struggling to manage large chains of business processes, the code complexity involved in a saga design pattern may cause more problems than it solves. But, if long-running transactions keep you up at night -- especially when it comes to handling failures -- the saga pattern may be the answer you've been looking for.

Dig Deeper on Enterprise architecture management

Software Quality
Cloud Computing