drx - Fotolia


How the SaltStack architecture enables config management at scale

SaltStack's one-to-many communication model enables it to manage configurations in massive IT deployments without network strain. To get the most out of SaltStack, understand ZeroMQ.

To determine whether a configuration management tool can deliver services at scale, examine the underlying engine that drives the system.

In a SaltStack architecture, the engine is a messaging library called ZeroMQ. And while it helps SaltStack scale to support large IT deployments, it comes with potential challenges as well.

Modern IT deployments' scalability issues

Configuration management systems, in general, face a few limitations. This is likely because, when configuration management came about in IT, it wasn't conceivable that a single administrator would deploy configurations at the scale at which many enterprises currently operate. When configuration management became popular, likely one administrator was assigned to hundreds of machines. With the increased resilience of enterprise IT, an administrator manages thousands, if not more, of virtual and physical devices. As a result, these original configuration deployment engines seem limited.

Network protocols, such as SNMP, SSH, Telnet and WinRM, also play an important role in how IT deployments scale. Typically, these protocols run in client-server models, where a server sends a command individually to each client, which then takes action on the command that was sent. Generally, this is a one-to-one interaction between server and client, but there are a few exceptions. ClusterSSH, for example, enables a single SSH server to send commands to many clients. This may bypass some of the constraints of the protocol -- for example, ClusterSSH can only manage Linux machines, and SSH key authentication requires a specifically configured environment -- but can also be resource-intensive when scaling to communicate with thousands or tens of thousands of machines.

To address these common scalability issues, SaltStack uses ZeroMQ in its minion and master model.

The role of ZeroMQ in SaltStack

ZeroMQ works behind the scenes as part of the SaltStack architecture to deliver configurations with little processing overhead. To grasp how ZeroMQ works, think of a chat service, such as Slack. In a traditional configuration management setup, each command sent from a server to a client is like a private Slack message sent to an individual user, with a wait time for a response. Commands run in a one-to-one relationship with the number of messages sent. To extend the metaphor, with ZeroMQ, the command is issued like a message to each Slack channel, and all of the channels' listeners can respond. This asynchronous model enables the deployment to happen more quickly than a one-to-one communication model, since the commands are issued in a one-to-many ratio, while simultaneously listening for a response back.

ZeroMQ keeps the typical network footprint of a message sent over SaltStack small, which eliminates network throughput as a potential bottleneck in delivery. This, in turn, enables a deployment to scale to send configuration instructions to thousands of machines before any negative performance hits to the configuration management system.

As an asynchronous messaging service, ZeroMQ doesn't have to maintain a communication connection to the server to which it deploys or applies configurations, which enables SaltStack to communicate with client systems quickly. Based on the messages SaltStack gets back from the clients, admins can determine deployment status instantly, without having to wait for a refresh, a check-in window or client-server communication.

Salt minion authentication poses a challenge

The SaltStack architecture can, however, encounter a scaling issue when Salt minions -- the machines SaltStack manages -- connect and authenticate to the master node in a "thundering herd."

Authentication, especially, plays a big role in this issue. Even though the Salt master node can handle thousands of network requests, the master node's CPU can only handle so much decryption in a second, and the disk can only store history and performance metrics based on its own input and output.

There are workarounds for each of these issues. Reduce the size of the authentication key to reduce the load on the CPU. Use an external database for a job cache to minimize disk I/O. These fixes are more affordable than rearchitecting a network to handle thousands and thousands of one-to-one requests in a different configuration management system.

When it's time to select a configuration management system for a given IT environment, think about how large you expect to scale, the speed and efficiency with which you expect deployments to happen, and how many resources you can dedicate to the master node.

Dig Deeper on Systems automation and orchestration

Software Quality
App Architecture
Cloud Computing
Data Center