If enterprise developers and architects had to identify one operational demand that continuously haunts them, continuous uptime would be at the top of that list. Classic failover strategies, such as reverting to backup databases or servers, work fine when a single component fails in isolation. However, this remedy falls short when an error induces cascading component failures that could bring down entire application systems.
Cell-based architecture design is one approach aimed at remedying this issue by eliminating single points of failure. In this article, we cover the basics of cell-based architecture, detail the problems it solves, explore when it's an appropriate approach and, finally, assess the potential management risks it may pose.
The basics of cell-based architecture
The cell-based architecture approach addresses issues related to failover by decomposing a software system into large collections of partial or complete copies of the system's various application services and data components. In the case of microservices-based applications, each cell would encompass one or more microservices that operate in accordance with defined business logic. Each cell is independently deployable, configurable and observable as its own instance.
Cell-based architecture takes the fundamental concepts related to microservices and applies them across the entire software environment and the applications within. Microservices enable teams to create small, independent, separate services. If one microservice goes down, the rest of the application system can remain operational. Similarly, cell-based architectures make small, independent copies of the entire production environment; if one copy of production goes down, it likely only takes down a single user.
Cells don't need to include the entire architecture: They can encompass just one logically connected set of services. However, these services can be combined into a Kubernetes cluster, creating a cluster of clusters. Cellery is an open source tool that accelerates the design and deployment of cell-based systems.
Consider a scenario where a single user, on average, sits on roughly 10 of 100 cell-based copies of application services. A single failure involving that one user might bring down the 10 cells in use, but a random distribution of user information essentially guarantees that any other user will not sit on exactly the same 10 cells. Mathematically, the odds become about 1 in 62 quintillion that a single failure causes cascading failures that affect all users and services. Developers can isolate a single problematic cell instance without locking other users out of an application while addressing the issue.
Benefits of cell-based architecture
The first major benefit of the cell-based architecture approach is that it is resilient. The independent nature of these cells helps development teams perform staged rollouts, blue-green deploys and other incremental approaches to updates and deployments. This makes rollbacks easier as well: Simply create an older version of the cell in the cloud, and then flip the router to use it.
Another benefit of cell-based architecture is that it provides developers and testers a scriptable environment where they can conduct sandboxlike simulations for individual updates or feature additions. This type of environment can give software teams stuck in a Waterfall methodology a way to move toward a CI/CD approach. Finally, the ability to create environments where operational service instances can be called on demand is certainly something that provides companies looking to improve business applications a significant competitive advantage over those struggling to maintain constant availability.
When to use cell-based architecture
Given that cell-based architectures can impose additional management complexity and operational costs, the first question to ask is whether a software team faces -- or will face -- problems that the cell-based architecture approach can solve.
The concept of a cell-based architecture originally emerged as a way to address cascading errors and failover problems within complex application systems. A cell-based architecture works on global, internet-scale architectures with hundreds of millions or more users where misconfigured subsystems could interact in unexpected ways.
If you are working with systems that are global scale or if you have so many combinations of test conditions that your test coverage is insufficient, then cell-based architecture provides increased value for the investment. A cell-based architecture may also make sense when you don't have a better option than a complete application rewrite or intensive feature degradation.
Risks of cell-based architecture
Unfortunately, this does introduce another layer of complexity, as software teams must replicate the whole system over and over, likely storing those copies in a cloud environment. This also requires teams to add a service router, such as an API gateway, in order to relay requests to the correct cell. This router needs high availability and might also need to deal with queries that span multiple users and handle things like interservice communication or event sourcing.
The additional layers of complexity introduced by a cell-based architecture also make it harder to identify the root cause of errors, which makes automated monitoring tools a must-have. More complexity also adds use cases, which can lend themselves to hidden abuse cases, where malicious actors can escalate permissions or violate security undetected. Since teams will likely add components to an architecture that also require replication, a cell-based architecture also tends to increase a system's processing power requirements over time. And, since these cells often run in a public cloud, that means increased costs for CPU and memory allocation from the respective cloud provider.