Having uninterrupted clean power for the critical load is the objective of every data center. To achieve this goal, the systems in the power path must be properly maintained and tested. Ideally, this would be done without interrupting or potentially exposing the critical load to power loss.
However, maintenance is sometimes seen as a disruptive, (un)necessary evil and expense by some senior managers. This is especially true in today’s economic climate, where every expense is examined to see if can be reduced or eliminated. Nonetheless, periodic maintenance is required to achieve the projected level of equipment reliability and critical load uptime. Of course, this requires that some level of redundancy be built into the power chain to allow for concurrent operation during maintenance (i.e. tier 2-4).
The higher the level of power system redundancy (N+1, 2N or S+S, corresponding to tier levels 2-4) the lower the probability that power to the critical load won’t need to be interrupted during scheduled maintenance procedures. However, redundant equipment is meaningless unless it is properly maintained and tested. Improper procedures and human error have caused outages, even in tier 3- and 4-level systems.
Assuming there is redundancy available to allow for maintenance, let’s examine these key components and best practices for backup power maintenance.
- Main utility power panel
- Generator paralleling switchgear
- Automatic transfer switch (ATS)
- Main power distribution panel
- Maintenance bypass panel (MBP) for the UPS
- Uninterruptable power supply (UPS)
- Battery plant or energy storage for the UPS
- Load balancing
- Planning, documentation, training and supervision
Main utility power panel
The main utility power panel is the first panel in the data center power path. At the utility service entrance, the utility hands off the power to the entire facility. Although this panel is normally untouched during normal operation, it’s recommended that it is visually and thermally inspected on a quarterly or semiannually basis, and no less than annually.
The need to constantly test and maintain backup generators is well recognized by data center facilities managers. In many cases, there is an automated weekly generator exercise routine initiated by the ATS. It is also imperative that all staff be apprised and immediately available during any scheduled maintenance or test. Virtually any type of testing requires constant supervision. For example, starting a generator and ATS load transfer test to the generator, and then moving on to other tasks (or going out to lunch) once the initial load transfer is successful, is poor practice and invites exposure to failure.
While it may be boring to stand around for 30 to 60 minutes just looking at a running generator, it is a good time to listen for unusual sounds and inspect the generator for fluid leaks. It is also good practice to take some voltage and current measurements, as well as rpm and frequency readings. Observe and record oil pressure and temperature gauges and also scan specific areas of the motor-generator with a hand-held IR thermometer or thermal scanner. By recording these readings, you will have a baseline and running record for reference that can be analyzed. You can also use the readings to help monitor for any problems and facilitate preventive service on the suspect areas. Maintenance schedules, such as oil and filter changes, are based on run-time hours, as well as periodic intervals, and are usually prescribed by the engine manufacturer. In addition, diesel fuel should be checked for quality semi-annually, or even more frequently, when warranted.
Generator paralleling switchgear
In larger sites with multiple generators, paralleling switchgear is required. This extra equipment increases the complexity of the data center’s backup power system, as the generator synchronization controls and paralleling switchgear require special attention. Ensuring that the sync controls are working correctly is critical, and regular testing and inspections should coincide with the generator’s physical maintenance. If all the generators aren’t synchronized -- rotating at the exact same rpm and in-phase with each other -- the load won’t be able to be transferred to the generator array. The data center may go down, even if some (or even all) generators are running, but are not in-sync.
Of course, some components of the sync controls are also part of the systems mounted on the generators, and as such must be coordinated with the generator maintenance program. It is very common for the generator, ATS and paralleling gear to be maintained by the same vendor. Focusing on the specialized requirements of the sync controls, such as other switchgear and regular visual and thermal inspections, is recommended.
Automatic transfer switch (ATS)
Note that unlike most types of switchgear that typically remain in static positions and untouched during their service life, ATS equipment is far more frequently used to make, break and transfer power under load. Therefore, it must be closely watched so see if the contacts need be serviced or replaced. Every time an ATS affects a power transfer, it essentially “uses up” the contacts by the arcing, caused by the making and breaking of high-energy circuits. In most cases, the ATS gear must be disassembled to examine or replace the contacts. The electromechanical transfer mechanism must also be serviced to make sure it can move freely and is free from contaminants.
For complete maintenance, the ATS needs to be de-energized. The ATS also needs to have a functional isolation (bypass) path, either internally or externally, to allow for uninterrupted power to the load during maintenance. Not all ATS installations have this feature; those that don’t have it require that power be interrupted for ATS servicing. The ATS bypass must be part of the original design requirements to ensure that it can be serviced without interrupting power to the load. The ATS should be inspected quarterly or semiannually, and maintained annually.
Note that some data centers will operate on generator during UPS or battery bypass operations to avoid possible exposure to a utility outage during maintenance, as there would not be UPS power available to provide ride-through while the generator starts and is ready to accept the load.
In addition to the major equipment categories listed above, larger sites with A-B power systems (2N or S+S) may also have one or more “tie” circuit breakers. The circuit breakers allow power sources to transfer to the alternate A-B side and permit concurrent operation during maintenance. This is normally done “hot-to-hot” (both sides are energized and must be in phase) to keep the critical load energized during the power source transfer. There can be multiple ties located at different points in the electrical system, such as both before and after the ATS and even downstream of the UPS, depending on the level and type of design redundancy. This allows for different sections of the power path to be separately bypassed or shutdown, while still permitting delivery of both sides of the A-B power to the racks. However, to prevent a system outage, it is extremely important to ensure that these tie circuit breakers are only operated in the proper sequence by authorized and fully-trained personnel. Normally, tie breaker handles are kept locked to prevent this problem from occurring.
Main power distribution panel
After power has passed through the ATS, it goes into the main power distribution panel. Typically, this panel feeds the UPS and cooling equipment, as well as lighting and other data center systems. Like the main utility panel, it is normally untouched during typical operation, and it should be visually and thermally inspected annually (at a minimum).
Maintenance bypass panel (MBP) for the UPS
Power into and out of the UPS passes though the MBP and out to the critical load, so it is extremely important that it is visually and thermally inspected. Sometimes, in smaller data center sites, external MBPs are not installed to lower initial UPS purchase and installations costs, or because someone assumed that since the UPS already had an internal bypass, they would not need to also purchase an external bypass panel.
Unfortunately, this assumption is a fairly common occurrence for smaller sites, and it has major consequences if the UPS needs to be de-energized or replaced. These same smaller sites also usually only have a single UPS, so they are forced to cut power to the critical load in the event the UPS needs to be de-energized.
In many cases, the MBP is matched to the UPS and manufactured and installed by the UPS vendor. These matching MBPs can be also equipped with Kirk Key Interlocks and can interactively communicate with the UPS controls to prevent mis-operation. They are usually also covered and maintained under the same UPS service contract. A written procedure and clear understanding of how to operate the MBP should be imparted to key site personnel to help avoid a problem should the need arise to safely bypass the UPS.
Uninterruptable Power Supply (UPS)
Internal systems are electrically checked and visually and thermally inspected. Factory trained service technicians may also run diagnostics. In some cases, the UPS can be placed in internal bypass, and other tests or maintenance procedures require that the UPS be de-energized and externally bypassed via the MBP. In either case, the critical load is then exposed to utility failure unless there a redundant UPS. As noted above, some data centers will operate on a generator during UPS or battery bypass operations to avoid the possibility of a utility outage. Physical maintenance, such as cleaning the UPS fans and changing or cleaning the air filters, is also performed. This is typically done semi-annually, but should be done annually at a minimum.
Battery plant or other energy storage for the UPS
For the UPS to support the critical load from when the utility failure occurs until backup power returns from the generator, stored energy must be always instantly available. Energy is most commonly provided from one or more strings of batteries.
Battery banks require regular maintenance and inspection for signs of corrosion, leakage and temperature variations from cell to cell. Each battery is connected to the other in a series via a jumper cable, and each cable must be checked to ensure it’s tightly connected and free of corrosion. In a typically 480V battery cabinet, there are forty 12-volt batteries and therefore 80 terminals than need to be inspected. This is in addition to the electrical voltage and internal impedance testing, as well as periodic load testing.
Note that some data centers will operate on a generator during UPS, battery bypass or load testing operations. Using a generator is necessary to avoid a utility outage while there is no UPS power available.
Many larger sites have dedicated battery-monitoring systems that can monitor each battery individually, not just the entire string. This is useful for detecting early signs that an individual battery is deteriorating and endangering the integrity of the entire string.
While other forms of short-term energy storage are also used in the data center, such as the flywheel or the so-called “rotary UPS,” their maintenance is primarily mechanical in nature and varies by different manufacturer’s recommendations.
Batteries need maintenance, testing and replacement more than any other power-related component. Depending on the type of battery -- VRLA, wet cell or NiCad -- testing should be done quarterly or semi-annually, but annually at a very minimum. Unless there is an allotted budget for the procedure, it is often deferred or ignored. It is worthy to note that statically speaking, battery failure is the most common cause of downtime, other than human error.
Load testing is usually performed at the initial commissioning of the data center. Typically, it covers all the critical areas in the power path described above. However, once a site is operational, it is difficult to perform load testing without interrupting power, unless it is a tier 3- or 4-level facility. Opinions on the necessity for continued load testing are mixed. Purists will insist that it should be performed regularly. Some larger sites even have load banks onsite and they may be pre-wired into key points in the electrical system.
Other data center operators will see load testing as unnecessary, and under normal condition, an additional exposure to failure that is done only if a piece of equipment is suspect or has been replaced. This is especially true for smaller tier 1- and 2-type sites, where the load banks need to be rented and temporarily wired into panels. Of course, in those cases, the critical load must have another source of power, and the switchgear must be already in place to bridge the power without dropping the load, or it must be shut down during the load test.
One of the more debated issues is runtime testing the battery banks, either directly or while powering the load bank from the UPS, because each full runtime discharge diminishes the working life and capacity of the cells. Even after a successful load test, a single cell can fail the next day, and if utility power is lost, the critical load will be dropped. The only way to mitigate this potential exposure is by having multiple battery strings.
Planning, documentation, training and supervision
Needless to say, this article provides only a top-level view of data center backup power maintenance issues. Actual maintenance procedures vary according to each manufacturer’s service recommendations and requirements and should only be performed by properly trained service personnel. Moreover, key data center staff, such as shift supervisors, should also observe normal maintenance that is performed by outside vendors and in-house technical resources to ensure procedures are followed. Staff should be familiar with, and even able to perform, some basic and emergency procedures, such as manual operation of equipment, starting the generator, ATS power transfers and operation of the UPS bypass gear.
These procedures should be well documented, reviewed and updated as needed. Equipment vendors or service personnel should conduct training, as well as semi-annual or annual refresher courses. In fact, the ability of in-house staff to properly manually operate critical bypass gear may help avert a downtime incident. Properly documented procedures and supervision by on-site personnel may also avert a total data center shutdown.
Moreover, proper documented detailed procedures and supervision by on-site personnel may avert a total data center shutdown. This circumstance can arise if it becomes necessary to stop improper maintenance from occurring by new service personnel who are not fully familiar with the site’s equipment and systems. Emergency procedure documents should be readily available and accessible to key personnel. Documents should contain clearly labeled photos of the equipments’ controls and there should be instructions on the exact sequence of operation and emergency use. Also consider having one to two page emergency procedure cards that can be posted at or near the UPS-MBP and that also include information for manually operating ATS.
The quality and frequency of maintenance is sometimes based on the size of the data center and facilities department. Facilities staff are often far more sophisticated if the organization is running a dedicated data center. Alternately, a facilities department supporting a 2,000 square-foot data center in a large mixed-use building may not be as sensitive to some of these specialized data center requirements, because the emphasis and expectations are more often based on the building's systems. The overall culture and training level of the facilities’ staff makes a huge difference. Also, because many maintenance procedures are contracted out to either the equipment manufacturers or one or more service or sub-contractors, it is imperative that someone from the organization’s own management team be aware of the scheduling, what work is being performed by who, as well as who is supervising it.
Each data center site may differ in the types of equipment and maintenance requirements, yet all sites need to have preventive services that don’t affect the operation of the IT equipment. Some managers try to avoid full failover testing and major maintenance of critical systems, as it could potentially go wrong..This simply moves the limited (and presumably avoidable) known risk on the planned maintenance day, to the unknown exposure, the other 364 days of the year.
By avoiding or deferring maintenance, IT personnel could be exposing the data center to downtime from a variety of undetected malfunctions that went unnoticed while on normal power, but failed during a utility outage. Proper training, planning, supervision and documentation of maintenance procedures, as well as upper management support, is crucial to ensuring that a normal scheduled event doesn’t turn into a downtime debacle.
About the author: Julius Neudorfer is the CTO and founding principal of NAAT. Neudorfer has designed and managed communications and data systems projects for both commercial clients and government customers.