Server hardware degradation is the gradual breakdown of the physical parts of a server.
There are several general areas where server degradation problems occur, including power, temperature, management and memory. The electric components inside servers age over time, and heat sinks and fans get clogged with dust, reducing the server's efficiency and performance.
Server lifecycle management aims to mitigate the effects of hardware degradation by considering how and when servers should be replaced. Previously, IT teams would swap aging servers for new ones about every three years to avoid hardware failure. In the age of widely adopted server virtualization, though, server hardware often stays in production much longer. Clustering technologies, virtualization features such as live migration and improvements in hardware itself are all contributing toward servers living longer than ever. With the possibility of added server longevity, server hardware maintenance becomes even more important.
Common server hardware degradation issues
There are several ways in which server hardware degradation can occur. As a server starts to degrade, an organization might initially notice performance issues. If the problem isn't corrected, the server issue might eventually lead to a hardware malfunction.
Server hardware degradation typically occurs at the component level. Some of the components that are most prone to failure include power supplies, memory and disks.
This article is part of
As its name implies, a server's power supply is responsible for supplying the correct amount of electric power to the server's various components. Although server power supplies are generally reliable, they can and sometimes do fail. The most common cause of power supply failure is overheating. Power supplies have built-in fans that are designed to keep the power supply cool. Over time, though, these fans inevitably bring dust and other contaminants into the power supply. If enough dust accumulates, the dust can reduce airflow across the power supply's components, causing heat to build up. In extreme cases, dust buildup can even cause fans to fail. This can lead to a power supply failure.
Power surges and lightning strikes can also destroy a power supply. These events cause the input current to spike to a level that is greater than what the power supply is designed to handle, destroying the power supply and possibly other components in the process.
Dust can also pose problems for a server's central processing unit (CPU). If dust is ingested into a server, it can inhibit airflow and clog fans and heat sinks. This can cause a server's CPUs to overheat. Most modern servers are thermally throttled, meaning that if the server gets too hot, it will force its CPUs to slow down to prevent damage. When this happens, it can produce noticeable performance degradation.
Memory is another server component that is sometimes affected by degradation. There are several factors that can negatively affect a server's memory. When this happens, the server can experience noticeable performance issues, data loss or system stability problems.
Memory problems are often attributed to excess dust or vibration. Dust can prevent memory modules from making good contact with the sockets in which they are installed. Similarly, excessive vibration can sometimes cause memory modules to become loose, causing them not to function properly.
Like power supplies and CPUs, server memory can also be damaged by excess heat or power surges.
Storage devices such as hard disk drives (HDDs), solid-state disks (SSDs) and disk arrays are among the components that are most susceptible to degradation. HDDs contain spinning media platters and motorized heads that move across the surface of the disk. Like any other mechanical device with moving parts, HDDs simply wear out over time.
SSDs are also susceptible to wear, but of a different kind. Unlike HDDs, SSDs don't contain any moving parts. Rather than storing data on spinning platters, SSDs retain data in flash memory cells. One of the biggest problems associated with the use of flash storage is that write operations are physically destructive to the media. Each time that data is written, the write operation degrades the cell. Each cell is rated to endure a specific number of write operations before the cell eventually fails. Flash storage vendors use wear leveling and other technologies to prevent SSDs from wearing out prematurely.
Despite mechanisms designed to improve durability and longevity, both SSDs and HDDs wear over time and will eventually fail. Such failures almost always result in data loss, unless the disk is part of a disk array that has been configured to provide redundancy.
Although disk arrays protect against data loss, however, the failure of a disk within such an array can lead to decreased storage performance if the array uses a parity based architecture -- such as RAID 5 or RAID 6 -- to safeguard data. When the data center operator replaces the failed disk, the parity information is used to populate the new disk with data. Performance will only return to normal when this rebuilding process is complete.
Addressing server hardware degradation
Although server lifecycle management and hardware refreshes are important aspects of preventing server hardware degradation, there are other steps that data center managers can take. For example, data centers are commonly equipped with filtration equipment that is designed to trap dust. This helps to prevent dust from building up in servers, damaging power supplies, CPUs, memory and other components in the process.
Likewise, servers are almost always plugged into uninterruptable power supplies (UPSes). These UPSes contain batteries that keep servers running in the event of a power failure. However, most are also designed to act as surge suppressors and help to prevent servers from being damaged by electrical surges. Mission-critical servers also tend to be equipped with redundant power supplies, thereby helping a server to continue to function, even if the server's primary power supply fails.
Data center operators commonly have protocols in place to protect against data loss and performance degradation related to disk failure. Many data centers, for example, replace disks at predetermined intervals. The goal behind these storage refresh operations is to replace aging disks before they have a chance to fail. Modern data centers also tend to avoid parity-based storage configurations to prevent these storage refresh operations from affecting the server's performance.
One of the main things that data center operators do to prevent storage hardware degradation is to monitor server health. Monitoring software can, for example, detect fans that have failed or CPUs that are suddenly running at a hotter temperature than expected. Similarly, monitoring software can often detect an impending hard disk failure by looking at the disk's SMART (Self-Monitoring, Analysis and Reporting Technology) information.