How to monitor CPU usage in SDN environments
Monitoring CPU usage in a software-defined network is necessary to determine adequate network capacity and compute capacity, especially if workloads are competing for resources.
The wonderful thing about software-defined networking is it takes core network function out of the realm of purpose-built hardware powered by custom silicon and moves it into the world of software that runs on generic x86 and commodity network hardware.
The bad thing about software-defined networking (SDN) is the same. Moreover, a lot of that network function in your data centers -- and even in some branches -- performs in the same compute infrastructure as the rest of your compute workloads. Consequently, SDN can actually compete for limited CPU resources with other work in the data center.
SDN has two good reasons to monitor CPU usage. The first reason is to make sure you have the network capacity you need. The second reason is to make sure you have the compute capacity you need when all the other workloads sharing the infrastructure are taken into account.
What should you monitor?
In a software-defined network, you have three major classes of entity to monitor:
- physical data plane devices -- mostly switches;
- virtual data plane devices; and
Virtual devices and controllers usually run within the data center, mostly inside hypervisor spaces like VMware, Kernel-based Virtual Machine, Microsoft Hyper-V, Citrix or Oracle. Some devices will be on branch office hardware -- especially as WAN virtualization and software-defined WAN continue to gain ground -- in the form of customer premises equipment that is associated explicitly with the WAN or on a branch host server that runs more traditional workloads, such as a file server. IT teams should monitor resource consumption in all these places.
IT should also monitor multiple CPU-related metrics. First, of course, is utilization. IT teams must determine how much time the CPU executes on workloads. Second, monitor latency -- how long processes wait to get CPU resources. IT teams can drill into many other metrics while troubleshooting, but these are the ones that can alert them to trouble.
How should you monitor CPU usage?
If IT teams are deploying commercially packaged software on actual switch gear, the management tools for the platform typically offer the ability to monitor resource consumption. IT can use that capability to monitor or send data and alerts to a manager of managers in the network operations center.
If IT teams are rolling out open source or a platform with no built-in monitoring, they can usually treat the switches like another Linux box and monitor them like a virtual machine (VM) host.
Everything else pretty much falls into the broad category of VM host. IT teams can monitor these in multiple ways, such as:
- using the virtualization platform's own monitoring tools, from VMware, Microsoft or Citrix, for example;
- using a general-purpose management suite, such as those from IBM, CA Technologies, BMC Software, ManageEngine or SolarWinds, for example; and
- using open source monitoring tools, like Nagios or Zabbix.
What to do when CPU usage goes into the red
The crucial considerations when monitoring CPU usage are the baseline load and anything else that needs the resources. The baseline average should be 75% utilization or less -- this allows for some spikes -- and latencies should be around 5%, which means jobs aren't waiting for CPU cycles. Sustained utilization above 90% should trigger alerts, as should latencies that climb above 10% or so.
IT teams should always do a deeper analysis of the performance issue at hand to make sure CPU issues are not masking other problems, such as failures in a dynamic random access memory unit or excessive waits for I/O. Often enough, however, the problem is actually in the CPU or is related to contention for CPU resources.
In a physical switch environment, tripping thresholds may indicate physical problems with the switch. If, for example, the switch is overheating and becoming inefficient, IT should check the temperature metrics. The thresholds might also indicate that the device is doing too much or it is coping with too much traffic. If so, IT should see if the switch is doing anything that is not required. If the environment has simply outgrown the switch, IT can either replace it or re-engineer the traffic to lessen the load.
In a VM host environment, all those scenarios still apply -- except the scenario of an overheated switch -- with the qualification that doing too much might include sharing resources with non-network workloads.
In this kind of environment, IT teams can look at segregating workloads to protect resources for the network controller or data plane device. They could also consider provisioning host servers with network offload cards, which can dramatically reduce the amount of general-purpose CPU time a network application needs. IT teams could also add CPU resources to spread the work across more cores.