Data center monitoring system considerations
An effective monitoring system can track environmental variables and alert admins of a potential problem before it becomes a critical emergency.
Data center monitoring is often focused on computers; monitoring system performance, tracking virtual workloads, and reacting to the inevitable warnings and alerts that spell trouble for servers, network or storage within the architecture. But modern data centers need a more holistic monitoring strategy that embraces environmental factors like temperature and humidity – not just within the room – but at a granular level within racks and servers. Let’s cover some key monitoring points for the environment and show you how to deal with environmental monitoring problems.
Aspects of data center environmental monitoring
Many data centers employ sophisticated management tools, but many tools still don’t provide granular insight into environmental conditions; or worse, data center owners simply don’t use the environmental data those tools provide. Part of the problem is heterogeneity. It simply may not be possible to use a single tool that can monitor voltages, fan speeds, temperatures, humidity levels, and other environmental factors across every possible system. In other cases, the availability and placement of necessary environmental sensors may be inadequate for proper monitoring. Yet another part of the problem is a lack of planning and coordination – IT administrators don’t worry about the data center environment as much as they should.
When you’re ready to extend data center monitoring to the environment, take time to consider the following monitoring points:
Sensing and monitoring temperature. One of the most significant results of data center growth is the issue of heat density. It has become much more difficult to manage temperatures on a facility level because rack densities (and corresponding rack heat) may vary widely. As a result, we see hot spots in one zone and cooler spots in another zone. Installing temperature sensors with network connectivity within the data center helps IT administrators look for those hot and cold spots to ensure that all equipment is operating safely. If not, early alerting can allow administrators to boost cooling, shift workloads, or take other pre-emptive action to avert failures.
A good metric to follow is the older ASHRAE recommended temperature range (64.4 to 80.6 degrees Fahrenheit) or the newer ASHRAE standard outlined in TC 9.9. Data center best practices recommend at least one sensor on every rack. If an environment has a hot-aisle/cold-aisle configuration, it becomes acceptable to place a sensor on every “hot” rack or row. Since heat also rises, it is recommended to place sensors near the top of the rack where temperatures are generally highest. Another recommendation is to place sensors near the end of the row where they are able to detect any spillover; hot air entering the cold aisle from the hot aisle.
- Establish precision cooling control. With large enterprise data centers, maintaining consistent levels of cooling and room/row air conditions is essential. Deploying intelligent controls, which are sometimes integrated into cooling and monitoring systems, helps data centers run as efficiently as possible. The goal of intelligent control is to allow multiple large systems to compliment, rather than compete with one another. Let’s take humidity control at a large data center as an example. Let’s assume that for some reason, one unit begins to report a high humidity reading from one of its sensors. Without an intelligent system, that unit’s remediation process may start. However, with an intelligent cooling system in place, the data center monitoring tools will first query the humidity status of all the other units in the facility. If it finds that the other units are operating within range, it will continue to monitor the situation to see if the levels even out. Otherwise, it will send an alert to an administrator or begin a pre-designed remediation process.
Fluid and humidity detection. One chiller leak inside a data center can cost thousands, if not millions, of dollars in damage to a data center and critical business hardware. This type of damage will deal a serious blow to enterprise functionality and productivity. Use leak detection sensors strategically located within the data center to detect leaks, trigger alarms, and help prevent water damage. It’s highly recommended that leak sensors be installed at every location where fluids are present in the data center. Depending on the data center environment, leak sensors are able to operate as a standalone system or can be connected into the central monitoring system to simplify management. In large environments where cooling areas are numerous, leak and fluid sensors can also monitor for areas of condensation and excess humidity. Having humidity sensors as a part of the internal and external rack sensor array will maintain regular levels of humidity control. Drip pans and designated areas for liquid run-off will help curb the risk of a major leak.
Humidity detection can also help detect excessively dry conditions that might precipitate electrostatic discharge (ESD) problems. Dry air is common when free air-side cooling technologies are adopted for the data center.
Integrate the environment with other sensors. Temperature and humidity/liquid sensors are just the beginning of intelligent data center environment monitoring. Smoke/fire alarms are needed at several locations throughout the facility to detect impending fire. While these alarms are usually tied to the building’s fire suppression system, they can also be integrated into the data center monitoring system to provide administrators with an opportunity for early action before more dramatic gas suppression is released.
Monitor power from each power distribution system (PDS) and integrate that data as well. Power monitoring can support a continuous evaluation of the data center’s Power Usage Effectiveness (PUE) and report power faults for early intervention by the IT staff. Some data centers also monitor and integrate data from intelligent uninterruptable power supply (UPS) systems as well, and can track UPS battery and alarm conditions.
Room and rack access (security) sensors report on unauthorized access, alerting the IT administrators – and could even summon security assistance if necessary. As a minimum, such simple physical sensors can at least log door openings and closings to help narrow down the personnel present at the time.
- Managing alarms and notifications. Uptime and data center efficiency have been the main justifications for implementing some sort of environmental monitoring controls. This continues to be a main driver, since the ability to view immediate notifications of a failure or proactively monitor a situation to prevent a failure are critical data center tasks. A centralized and well-managed system allows administrators to respond quickly to emergencies and help retain a higher uptime. Creating a central alarm system is also very important for data center uptime and health. A good alarm system is able to prioritize issues by criticality, to ensure the most serious incidents receive priority attention. When setting up an alarm-based system, it is important to evaluate and designate every alarm for its impact on business and IT operations.
- Remote data center monitoring. Large environments often must leverage outside expertise when it comes to data center monitoring. Remote monitoring capabilities can help organizations keep an eye on their secondary or backup environments, or outsource the monitoring and management to a service provider. The ability to see the health of remote facilities can help IT administrators respond to emergencies faster and bring their environments back to a healthy state. By having external visibility into multiple sites, managers can keep track of alerts, alarms and general data center environmental statistics all in one central place.
More resources on data center monitoring
- Measuring total energy use for an efficient data center
- Configuring the server monitoring tool Zabbix
- Stop server monitoring tools from crying wolf
Data center monitoring best practices
It’s important to remember that a data center monitoring infrastructure will require periodic maintenance and testing – just like any other part of the facility. In addition, the monitoring must change or scale to accommodate the data center’s evolution. Don’t ignore the sensors or allow their placement to remain static as other systems and racks move. Here are some other tips for data center environmental monitoring:
- Testing and Maintenance. All sensors within a data center should undergo regular testing and maintenance. Faulty or erratic sensors should immediately be replaced. One way to identify a faulty sensor is to review readings from similar nearby sensors. For example, when several sensors within a rack report one temperature, but another sensor reports a surprising alarm, it should warrant immediate investigation, but should be approached with a modicum of skepticism until the root of the alarm can be identified and confirmed.
- Be ready for emergencies. Sensors do not prevent emergencies, so common-sense emergency planning should still be part of every data center manager’s agenda. A disaster recovery plan must include immediate personnel notification; know who your data center maintenance team is, and how to reach them quickly. When a cooling failure occurs, your first call will be to your data center HVAC engineers. Be detailed in the description of the problem, too. If your engineers need to bring spare parts, this will help them. When it comes to data center environmental emergencies, every second counts.
- Have a backup plan ready. Monitoring systems have the ability to set off different alarm levels. If your data center is in a hosted environment, it is very important to specify and understand emergencies in your service-level agreement. The hosting provider must have a contingency plan prepared in the case of a sudden disruption. In a private data center, always have sensor monitoring and alert systems operational. Cooling systems may warrant local backup units in the event of an emergency–even if this means using temporary portable cooling systems.
- Have an automated recovery plan. Some monitoring systems have integrated automation systems. In the event of an isolated rack emergency, some systems are able to shut off non-essential servers. Development servers are often big power users that don’t need to be run during production. Any test server that is not essential can be set to shut down when emergency conditions arise.
As IT data centers continue to evolve, managers will begin to see more automated tools to help keep an environment alive longer and without disruption. Automating and centralizing the management of physical infrastructure components for effective resource usage will be the next step in data center design and implementation. They key will always revolve around strategic uptime capabilities. By proactively monitoring server room environmental variables, IT administrators are able to greatly reduce their risk of having extended downtime. This, in turn, creates a more robust and easier to manage data center.
About the author:
Bill Kleyman, MBA, MISM, is an avid technologist with experience in network infrastructure management. His engineering work includes large virtualization deployments as well as business network design and implementation. Currently, he is the Virtualization Architect at MTM Technologies Inc. He previously worked as Director of Technology at World Wide Fittings Inc.