Five questions on systems management

As reliance on systems management tools increases, administrators face varied challenges in deployment, skill sets, infrastructure support and other issues.

Systems management is a critical part of any virtual environment, allowing busy administrators to handle an increasing number of virtual workloads and ensure their proper performance over time. But management tools, and using them efficiently, can sometimes make management more complicated or less productive than it should be. In this Q&A, systems management expert Ian Parker shares his insights on systems management today.

What are the biggest challenges you face with systems management?

Ian Parker: I believe the two biggest challenges today are the premise of a single pane of glass for overall system status, and questionable analysis of the systems management data. Hardware vendors have their proprietary tools that often hook into specific hardware. By comparison, third-party tool providers sell tools that can collect a lot of data but may not have the ability to see every aspect of the hardware. Often, the third-party products operate at more of an operating system (OS) level, while the vendors have agent-based tools that run at the kernel OS level or below. So the problems for me are: How do I get my data in one place, how do I know that data is reliable and how do I make sense of what data I have?

Why are these challenges so important? Is it a matter of the tool, the experience, the infrastructure or some other factor?

Parker: It really involves all of those issues. Remember that the tools vary widely from one hardware vendor to another, and proficiency with one does not necessarily translate well to another product. The large third-party tools may have the ability to aggregate and interoperate with vendor tools, but [they] are typically a large investment in cash, time and integration into your environment.

Consider the staff issues. IT staff experienced in such products tend to be rare, expensive and sometimes difficult to retain – it's one of those occupational specialties that has much more demand than supply. And once you have things set up, who pays attention to them? For example, does the staff sufficiently understand the difference between a SQL server and a Terminal server to know what is most critical to each asset?

Don't even get me started on the reliability and quality of some of the products. Many vendors are at fault here. I've used one third-party product that does Windows server performance monitoring via Windows Management Instrumentation (WMI). The WMI queries it executes as it polls each server would spike the CPU high enough and long enough to actually generate a performance alert by the software because the polling process uses too much CPU.

Infrastructure is a whole other question. If the environment is virtualized, you must decide whether to monitor at the virtual machine or hypervisor level. Worse, the hypervisor tools seem a long way from maturity – and then something like the ESX/ESXi service console migration might completely break your tools because ESXi no longer uses the Linux service console to perform some management functions like running scripts or installing agents. Infrastructure issues are even more problematic for cloud deployments. For example, it might be impossible to gain visibility into assets in the cloud that I might only be leasing or renting, or the provider might be running an OS that is too old for my newer tools to understand. Organizations must also deal with latency and correlation issues with my non-cloud assets.

And there are other infrastructure concerns, like the need for database servers, licensing expenses, and the cost implications of multiple redundant instances. Remember that monitoring tools produce a great deal of data that also has to be stored, backed up and managed. There really is a lot to think about when planning for systems management.

What does a data center infrastructure need to support system management tools? Are there any particular storage, network, server hardware or other resources needed to best support a systems management tool?

Parker: Systems management requires assets, and those assets are going to cost you money. These assets may include SQL licensing, storage, machines to run consoles, the software and OS licensing for them. The list can get a lot longer depending on the complexity of your data center and the tool you want to use.

In my opinion, systems management for performance and capacity planning only becomes truly useful when you have a decent amount of historical data to work with. That means significant amounts of storage. It might also demand a network infrastructure upgrade to handle the added systems management traffic.

All of this might need to be coordinated between multiple teams in a data center, each with its own agenda that might not be fully compatible with other teams. For example, do the server guys really want the network guys to have insight into their utilization (and vice versa)? The most frequent way companies in need of reliability and performance handle systems management is to over-buy assets. They accept the idea of "wasting" some money to guarantee the levels they need. And these days it always seems like storage is the overlooked or underestimated resource.

What kind of background or skill set should a systems management administrator have? What skills would be helpful (but not critical)?

Parker: Performance metrics for database servers are very different than those for desktops or file servers and OS performance can vary a lot. So it helps if IT staff have a working knowledge of what they’re managing. It also depends on your goals; skill sets will vary depending on whether you’re monitoring basic uptime and health, performance level or service-level agreement goals, configuration management, provisioning, and so on.

Staff members trained in broad, general conceptual computer science knowledge (things like hardware and OS design) are often best. However, some environments might also exhibit very specific skill needs. For example, you might need a VMware hypervisor guru if you are heavily virtualized. This is why there is a shortage of the “right” IT people; good IT generalists are hard to find. The person who likes to keep current on a large variety of subjects is much different from someone who prefers specialization. Too many organizations undervalue a strong generalist.

Most organizations aren't really doing systems management well. If they think they are, they are often fooling themselves. For example, if someone calls to complain about an asset, you can go look at something and mostly figure out what is wrong, but that isn't good systems management – that's systems diagnosis. Your management process probably failed if an end user had to complain before you found an issue.

How do you get the most from your system management tools? Are there any tricks or practices that you can share?

Parker: Define what you really need, what you want, and what would be nice to have. Then be realistic about time and expense. It sounds cliché but it's almost always true. If systems management is important to your organization, accept the fact that it is going to cost time and money, and proceed accordingly.

Many organizations don't place a priority on systems management, instead choosing to stay reactive and just fixing things when they break. If that is working for you, that's fine; just don't fool yourself about your real priorities. For example, cost control might be more important, or management may be important, but you don't have the budget or manpower to really address it.

The reality of systems management is that it is a complex process. Successful organizations don’t simply "set up and forget it" – it's an ongoing, evolving, constantly changing thing. And the resulting data can be your friend, so pay attention to it, mine it and understand that it will change over time.

One last note: Be careful about promising return on investment (ROI) to stakeholders. When done right, systems management will be difficult to quantify. Remember that the ROI is actually the absence of problems or the flexibility and agility to meet new computing demands.

Dig Deeper on Systems automation and orchestration

Software Quality
App Architecture
Cloud Computing
Data Center