Power, a basic element in every data center, is sometimes taken for granted by server administrators. Furthermore,...
power loss or poor power quality is a major contributor to data center server down time. I am not referring to a major utility power failure, just to a common pitfall of power distribution system practice and management.
There are several basic, but key, power components in the data center:
- utility power source(s) and main power panel(s),
- back-up generator and automatic transfer switch (ATS),
- uninterruptible power supply (UPS) and maintenance bypass panel,
- power distribution unit (PDU) (or sub-panel from UPS),
- rack level PDU,
- and servers' internal power supply
In most cases, server administrators are not involved in the design or operation of the first four items. But they do directly control the rack-level PDU and server power supply. A very common cause of power failures occurs at this level.
The case for dual power supplies
Implementing servers with dual power supplies is a common practice in mission critical environments. Dual power supplies can improve reliability in your data center. However, these servers are sometimes improperly implemented when server administrators attempt to take maximum advantage of the redundancy made possible by have dual power supplies. In some cases, this actually reduces the power redundancy.
In a "perfect" installation, such as a Tier 4 data center, there are two completely independent power paths, each comprised of items 1-6. Each path and the items in the path must be capable of supporting 100% of the entire data center load by itself. This represents true 2N redundancy. 2N redundancy means that there is no single point of failure that will interrupt the operation of the data center equipment.
Of course, not everyone is fortunate enough to operate a Tier 4 data center. While we would all like to have complete power system redundancy, cost usually forces some trade-offs. Ensuring the highest level of system fault tolerance within budget restrictions usually means that although servers have dual power supplies, there are not two completely independent paths for items 1-5.
Server administrators often have false sense of redundancy
As mentioned earlier, the administrator is directly responsible for installing and managing servers and PDUs at the rack level. There is often only one PDU per rack, hence the redundancy value of dual server power supplies is limited to only the failure of the server power supply itself.
A more common scenario, however, is two rack-level PDUs with (hopefully) each of the servers' power supply cords plugged into a different PDU. This creates a "sense" of redundancy for most administrators. However, this is also where the hidden exposure to power problems starts.
Servers are normally installed and operated with both rack level PDUs available. When both supplies are active, the dual supplies will share the server load at approximately 50% each. When either power supply fails or has lost input power, the remaining supply must draw and provide 100% of the load. Therefore, it is best practice to load a PDU to less than the trip value of the circuit breaker that protects it.
However, even if each PDU is at only 60% of rated maximum load there is a problem. In fact, even if the PDU has a current meter, most administrators think they have the capacity to add more servers since they are "only" at a 60% power level. But even at 60% the PDUs are overloaded and no one even realizes it.
Why? If a server experiences a power supply failure, then 100% of the power will be drawn from the remaining power supply and the PDU that it's plugged in to. This means that at a 60% load, 120% of the PDU's power rating will be put on the remaining PDU. The circuit breaker will trip on the PDU or branch breaker and shut down all equipment in that rack. This is a classic cascade failure. The same scenario would hold true if another server or other equipment was added which overloaded the PDU load past the tripping point of either PDU.
Implementing dual server power supplies properly
The only way to safely implement a dual server power supply and dual rack PDU is to never exceed 40% of the face rated value of the rack PDU. A PDU and its feed circuit must always be protected by a circuit breaker. The UL and NEMA mandated codes require that you can only safely draw 80% of the rated value of the PDU.
For example, you can not draw more than 16A from a 20A PDU (24A for a 30A circuit). This means that in a dual PDU rack the entire equipment load should not exceed 16A for the rack. Therefore, each PDU should only have an 8A load on it in order to avoid a cascade overload.
Many racks do not have metered PDUs, either because they're older or there wasn't budget for the extra cost. But as I mentioned earlier, even administrators who do have metered PDUs do not realize that once they go past the 40% power level they are in danger of have a cascade power failure and losing the rack. In addition, because servers are upgraded and added all the time the exposure may continues to increase without warning until a problem occurs. At this point, everyone involved is baffled because they thought they were "fully-redundant."
If you are fortunate enough to have avoided this up to this point, I suggest reviewing your rack level current draw at each PDU. If you do not have a metered PDU, you should consider upgrading to one in the near future. If you have many racks, consider a metered PDU with remote monitoring (via SNMP and/or web) that can send SNMP traps to your management software. This lowers the burden of manually monitoring dozens or hundreds of PDUs. The above applies to virtually all the items in the power path.
The bottom line is make sure that if you are implementing redundancy that it can sustain 100% of the load if the other path fails. Review your load structure, proactively monitor and manage the load levels on all PDUs and all the other elements of your power path. Changing out PDUs can involve some downtime. But like any power path maintenance, some downtime may be required if there is to be true 2N power redundancy. Take your choice: a little planned downtime, or an unplanned surprise shutdown.
ABOUT THE AUTHOR: Julius Neudorfer has been CTO and a founding principal of NAAT since its inception in 1987. He has designed and project managed communications and data systems projects for commercial clients and government customers.