Eclipse Digital - Fotolia
- Mike Matchett, Small World Big Data
After getting just a little taste of data center transformation success, an overly enthusiastic enterprise can rush off on a headlong, downhill race to eliminate as much IT data center cost as possible. And that can lead to trying multiple cloud services, experiments converging infrastructure and software stacks, and the adoption of DevOps-friendly technologies, such as containerization. Each of these can help force out burdensome Capex and provide increased agility.
Still, I can't help but think that in this mad rush to deflate the data center maybe we lose sight of something important. Cloud and containers are grand and hold lots of promise, but often they simply don't arrive on scene with full enterprise management, battle-tested security or -- as we address here -- guaranteed ways to assure service levels.
Keep an eye on the prize
Convergence, cloud and containers are all hot technologies to be sure. The value they provide is increased abstraction between workloads and infrastructure. More abstraction is great for our new distributed, DevOps-oriented world, but it also tends to obscure ultimate visibility into what makes for good IT performance.
There are many ways to look at performance, but let's focus on workload response time as one of the important measures of how happy our end users will be with their IT experience. Imagine a chart that has CPU utilization increasing linearly on the X axis from 0% to 100%. If we plot the average interactive transaction performance for that CPU on the Y axis, we'll end up with an exponential curve starting at a reasonable service time at 0%, but shooting up towards infinity at 100% utilization. (Note: For the mathematically minded, the response time curve can be modeled using queuing theory to calculate the probabilistic waiting time for the increasingly busy resource.)
Driving the utilization of an infrastructure resource as high as possible by adding load, especially to please accountants focused on utilization metrics, eventually becomes counter-productive in terms of IT performance management.
It is true that batch workloads are measured more by throughput, which does max out at maximum utilization. However, response time is critical for any interactive workload. And in today's fast-data world, we process more sources and streams of data in near real time by and for interactive operations and applications. Big data today is about bringing as much intelligence as possible as far forward as possible to the operational edge.
Cloud and containers each change IT performance management in distinct ways, and while change can be scary, there are ways for IT admins to ensure performance stays within an acceptable range.
Prized pets vs. cattle herds
Cloud-app designers tend to treat portions of critical code more like fungible cattle -- and less like the historical pets they used to be in the silos of traditional data center infrastructure. We used to carefully monitor and manage mission-critical applications and their resources; any deviation from normal would be triaged immediately. With many of today's cloud apps, deviation in performance will result in the app being re-provisioned with a new cloud instance that should perform better.
But not necessarily. To understand this, let's look at virtualization, which has brought many benefits, but not always better performance. In a virtualized host, guaranteed actual response time performance has always been a problem.
While a virtual admin can assign a quota of host resources (in terms of utilization) to a given VM, each host resource by definition is shared by many VMs at the same time. Once we understand that response time performance is nonlinear with respect to total system utilization, we can immediately see how the noisy neighbor problem arises on heavily utilized virtual servers -- even if our critical VM has a guaranteed slice of utilization.
As an example, consider how all of the VMs on a given host server have a guaranteed slice of capacity. If enough VMs use their capacity at the same time to drive total utilization of the server above 50%-60%, response time will degrade for all of them. Over a certain threshold of utilization far less than 100%, the underlying server resource still has remaining capacity, but experienced performance can degrade by half. As utilization approaches 100%, responsiveness can degrade to the point where little actual work is even getting through the system.
If we think of clouds, public or private, as large virtual server farms, we can see why a cloud machine instance may not always provide the performance we deserve. The cloud service provider promises a certain amount of resource utilization when we put in our credit card number and check out a cloud server. The cloud provider, however, does not generally certify that your particular machine instance will not be cohosted with many other competing instances. This means that during busy times, many hosted machine instances won't provide the same level of performance as when their underlying cloud infrastructure is less than half idle.
Fundamentally, clouds are cost-efficient because they pool and share infrastructure as widely as possible. A cloud service provider is economically incentivized to stuff as many virtual instances as possible into a given cloud infrastructure footprint. In fact, one of the key areas of profit margin for a cloud provider is in being able to oversubscribe real infrastructure as much as possible across multiple tenants, knowing statistically that many machine instances much of the time are highly underutilized, if utilized at all.
Thus, web app administrators and clever DevOps folks treat their cloud applications more like cattle. They architect their web applications in a distributed fashion across many machine instances such that if any one machine instance within that pool ever suffers slow performance, they simply kill it and restart it. When your service provider is large enough, the restart operation almost guarantees that the new instance will generate in a different area of the cloud infrastructure, away from its previously noisy neighbors. It's worth noting that this cattle approach might not work so well on less-expansive private clouds.
With containerized, microservices-heavy applications, performance can even be more opaque. A single microservice by original definition simply doesn't last long even if its performance is lousy. With a massively containerized application, we might only see poor performance in the aggregate end result. And because microservices can be ephemeral, we can't really manage them as either pets or cattle.
When we had pets assigned to their own isolated infrastructure, end-to-end infrastructure performance management tools enabled IT admins to identify and correct obvious performance problems. While virtualization began to muddy IT performance management, there were still effective ways to correlate application performance with virtualized infrastructure. But once we move our applications to a public cloud, managing for top-notch performance becomes more of statistical cat-and-mouse game. And now with the rise of containers, managing performance is an even greater challenge.
The good news is that, with container architectures, we can readily add performance instrumentation at a very fine-grained level within our application. Given a new crop of highly scalable and responsive management tools, it should be possible to shepherd flocks of containers to greener-performing pastures using clever IT operations automation (likely based on effective use of machine learning).
The real trick for a competitive technology organization will be to proactively, if not predictably and continuously, achieve high performance at the same time it implements a deliberately chosen cost or spend policy. This balancing act in some ways gets harder with cloud and containers -- because of increased opaqueness and scale -- but also easier -- because of distributed data and processing technologies.