Best practices for maintaining a data center CRAC unit

A CRAC unit failure could often have been avoided through preventive maintenance and correctly sizing cooling infrastructure.

Data center air conditioners rarely fail in the middle of winter -- Murphy has proven time and again that the failure will occur on a hot summer day. But no matter when the cooling system quits, the data center will experience a temperature increase that can potentially threaten sensitive servers and other equipment. And if cooling capacity is running so close to the edge that you can’t shut down a single computer room air conditioner (CRAC) unit for maintenance, you're on a guaranteed road to trouble! This tip reviews several best practices that are important for maintaining a CRAC unit. Before we go any further, it's worth it to note that computer room air conditioners of all types are commonly called CRACs, but technically a chiller-water unit is a computer room air handler (CRAH).

Cooling too critical to leave to chance

Cooling has become such a critical element in our data centers that it's just as important to maintain a CRAC unit as it is to design cooling correctly in the first place. The substantial investment made in cooling equipment, as well as in the computing hardware it sustains, should make preventive maintenance a priority, but it's often not. These days we're trying to "right size" everything in pursuit of energy efficiency, which makes each piece of equipment more critical and reduces error margins. However, it is more common that equipment growth has taxed capacity, and there's fear of shutting anything down for preventive maintenance. Even worse, maintenance contracts are sometimes considered overly expensive, since their cost over a period of years can equal the price of a replacement CRAC unit. Alternatively, CRAC service is relegated to the facilities staff, with no checklist of what should be examined, adjusted and replaced, or how often. In short, instead of relative simplicity of a maintenance call, a cooling failure can become a major repair shutdown when preventive maintenance has been inadequate, or there has been no maintenance at all.

The fear of intentional cooling shutdown

Let's first attack the concerns about short-term temperature rises. ASHRAE TC 9.9 expanded the thermal envelope in 2008, confirming that equipment could be operated continuously with inlet temperatures up to 27 degrees Celsius (80.6 degrees Fahrenheit), and as high as 32 degrees Celsius (89.6 degrees Fahrenheit) for several days, without harming the equipment or voiding warranties. These numbers were agreed to by all the major manufacturers. Since most data centers are still trying to keep equipment cooler than necessary, it is likely that even in a facility with marginal cooling or non-redundant equipment, an individual CRAC unit could be shut off for the few hours needed to do good preventive maintenance without exceeding temperature limits. It's far better to shut down intentionally for a few hours on a day when outside air temperature is not excessive than to lose a CRAC unit catastrophically and be without air conditioning for days, or even weeks, in the hottest part of the year. ASHRAE also defines "rate of temperature rise" limits, which will be covered in a later tip. If a maintenance shutdown causes heat to increase more rapidly than ASHRAE recommends, that should be a good indication that a professional cooling assessment is needed.

While we're discussing operating parameters, let's not forget the most-overlooked item in cooling maintenance -- set points. All air conditioners should be checked to ensure they are maintaining the same temperature and humidity levels, with operational readings recorded for all units. When set points differ from unit to unit, air conditioners can fight each other, using a lot of energy while actually degrading cooling. Experimenting with changes in sensor placement can also help achieve uniform control. An often overlooked fact is that factory locations are not necessarily the best. Over time, variations in temperature or humidity can also indicate faulty sensors or changes in equipment installation patterns, which makes it difficult for the units to maintain a proper environment. Consideration should also be given to increasing set points per the ASHRAE guidelines, but only if inlet temperatures can be maintained within ASHRAE limits at the servers with the highest inlet air temperatures. This could improve cooling efficiency and reduce wear on the air conditioners.

What CRAC unit maintenance should entail

The most important maintenance task for any CRAC unit is filter replacement. Dirty filters overwork the motors and also reduce cooling capacity. If filters are getting dirtier than they should between changes, it would be wise to look for the source of the problem. Particulates also accumulate on computing hardware filters and heat sinks, raising their internal temperatures. One of the most common sources of contamination is the storage and/or unpacking of cardboard boxes inside the data center–an absolute no-no!

Taking care of mechanical items

Which mechanical items should be checked depends on the type of CRAC unit, but if there are belts involved, their tensions should be adjusted quarterly. Belts stretch, and factory parameters should be maintained. Over-tensioning wears both belts and bearings, and under-tensioning results in slippage and reduced performance. Self-tensioned belts may last for five years, but replacing other belts yearly could be a good rule of thumb. In any case, belts should be changed per the manufacturer's recommendations, even if they appear in good condition. It's also important to check motor mounts and pulley set screw tightness. Lubrication of anything that can be oiled or greased should certainly be done, but it's just as important to inspect for leaking lubricants or spattering due to over-lubrication. Clean mechanical systems always run better and last longer.

One of the most overlooked signs of a problem is unusual noise. Operations personnel should be particularly alert to changing sound conditions, since they may be intermittent or change slowly over time, making them easy to get used to. Maintenance technicians may not notice them at all, but they shouldn't be ignored, and could be precursors of bigger problems to come.

The importance of refrigerant levels, electrical testing

Refrigerant levels must be checked at least annually in direct-expansion (DX) units. Falling refrigerant levels could indicate a leak, which should be found and fixed immediately. The proportional valve in chilled-water type air conditioning units (CRAHs) should be checked for proper control and operation.

It is also important to make sure condensate drain lines are not clogged and condensate pumps are working. Depending on conditions, condensation may not form for many months, meaning the pumps sit idle and drains are not flushed. Water should be introduced to ensure proper operation.

Humidifiers should also be checked regularly. Steam canisters may need replacement, or the water pans in infrared humidifiers may have accumulated mineral scales and need cleaning. Ultrasonic humidifiers can also clog if water filters are not replaced regularly. Note that the service cycle for humidifiers will vary with water conditions. A water analysis may help determine how frequently component replacements need to be made.

Another often-overlooked aspect is electrical testing. Just because a CRAC unit runs, doesn't mean all is well. Records should be kept of the current (amperage) being drawn by different components. A motor's rpm should also be recorded, along with the amperage readings. A changing trend in current draw, and/or a motor slowdown, most likely indicates a developing problem that may need in-depth investigation to uncover. Power readings should never be taken before first checking the tightness of electrical connections. Clamp-on meters cause wires to move, and a loose connection on something like a fire sensing wire could end up shutting down the entire data center. Air conditioner power connections should also be part of an annual infrared thermal scan of all electrical systems.

Making time for external maintenance

Maintenance on the external parts of the cooling plant (chillers, pumps, cooling towers and valves) is a major undertaking that is beyond the scope of this tip and is not normally monitored by IT personnel. But shutdowns in these systems should be carefully scheduled with IT, particularly if components are non-redundant, because they may affect the entire data center cooling plant. Facilities personnel are usually aware of the maintenance requirements for these big components, but the one thing often overlooked is manual valve operation. Shutoff and bypass valves may not be used for years and are often located outdoors. Valve failures, usually due to corrosion, prevent the valves from even being operated. They should be cleaned externally, protected if necessary and cycled periodically to ensure they will work when needed. If necessary, replacements can then be scheduled at times when the data center will be least affected.

In short, vendor maintenance contracts are well worth the cost if they provide thorough inspections and service on a monthly, quarterly, semi-annual and annual schedule. For almost all data centers, repair response time coverage on an eight-hour, five-day basis is sufficient. Temperatures can rise somewhat for several days without real consequence, saving the extra cost of 24/7 maintenance contracts. If internal facilities or a third party is handling service, it should be based on the manufacturer's maintenance procedures. Regardless of who is responsible, IT operations should keep track of when maintenance calls were made, have a copy of all readings, problems found and corrective actions taken, and observe the preventive maintenance work from time to time to ensure that what is expected is actually being done, thoroughly and completely.

About the author: Robert McFarlane is a principal in charge of data center design for the international consulting firm Shen Milsom &Wilke LLC. McFarlane has spent more than 35 years in communications consulting, has experience in every segment of the data center industry and was a pioneer in developing the field of building cable design. McFarlane also teaches the data center facilities course in the Marist College Institute for Data Center Professional program, is a data center power and cooling expert, is widely published, speaks at many industry seminars and is a corresponding member of ASHRAE TC9.9 which publishes a wide range of industry guidelines.

Dig Deeper on Data center design and facilities

Cloud Computing
and ESG