What is reliability, availability and serviceability (RAS)?
Reliability, availability and serviceability (RAS) is a set of related attributes that must be considered when designing, manufacturing, purchasing and using a computer product or component. The term was first used by IBM to define specifications for its mainframes and originally applied only to hardware. Today, RAS is relevant to software as well and can be applied to networks, applications, operating systems (OSes), personal computers, servers and even supercomputers.
The three components of the term mean different things. Together they describe the level at which a user can expect a computer component or software to perform.
How does RAS work?
Each part of the term reliability, availability and serviceability describes a specific type of performance for computer components and software.
The term reliability refers to the ability of computer hardware and software to consistently perform according to certain specifications. More specifically, it measures the likelihood that a specific system or application will meet its expected performance levels within a given time period.
In theory, a reliable product is free of technical errors. In practice, vendors commonly express product reliability as a percentage. The IEEE sponsors the IEEE Reliability Society (IEEE RS), an organization devoted to reliability in engineering.
Mean time between failures (MTBF) is one metric used to measure reliability. For most computer components, the MTBF is thousands or tens of thousands of hours between failures. The longer the uptime is between system outages, the more reliable the system is. MTBF is dividing the total uptime hours by the number of outages during the observation period.
Service-level agreements and other contracts often use the nines to describe guaranteed levels of reliability and availability. For instance, five 9s means a reliability level of 99.999% is being promised. The system or component in question will be available 99.999% of the time. Such systems could only be down five minutes a year, so five nines is a high level of reliability. Organizations relying on high-availability systems often require a minimum of four nines or less than an hour of downtime per year.
Availability is the ratio of time a system or component is functional compared to the total time it is required or expected to function. This can be expressed as a proportion, such as 9/10 or 0.9 or as a percentage, which in this case would be 90%.
To calculate availability of a component or software program, divide the actual operating time by the amount of time it was expected to operate. For example, if a device is working for 50 minutes out of an hour, it has 83.3% availability. MTBF can be used to describe availability as well as reliability. A higher MTBF would mean higher availability.
Sometimes availability is expressed in qualitative terms. For instance, it might measure the extent to which a system can continue to work when a significant component or set of components is unavailable or not operating.
Serviceability is the ease with which a component, device or system can be maintained and repaired. Early detection of potential problems is a critical factor of serviceability. In determining serviceability, it's important to consider how easy it is to do the following:
- Diagnose issues.
- Repair problems.
- Obtain parts.
- Take a system down to effect repairs.
- Return it to operation.
Mean time to repair (MTTR) is a metric used to measure serviceability. It's calculated by taking the total amount of time spent on repairs in a given time period and dividing it by the number of repairs. For example, if 20 minutes of time is spent on repairs resulting from two outages, the MTTR is 10 minutes.
Some systems are self-monitoring and use diagnostics to automatically identify and correct software and hardware faults before more serious trouble occurs. For example, OSes such as Microsoft Windows 365 include built-in features that automatically detect and fix computer issues, and antivirus software and spyware autoprotect features include detection and removal programs. Ideally, maintenance and repair operations cause as little downtime or disruption as possible.
Important RAS features and design elements
There are many ways to improve availability and reliability, in particular. These include deploying computer systems and subsystems with more powerful CPUs, and multiple processors and memory modules, and using component redundancy, error detection firmware and error correcting code.
Some of the key ways that RAS is designed into hardware and software are the following:
- Overengineering. Systems are designed beyond the minimum specifications.
- Duplication. Extensive use of redundant systems and components eliminates single points of failure and improves RAS.
- Recoverability. Fault-tolerant engineering methods help ensure RAS.
- Automatic updating. These systems keep OSes and critical applications current without user intervention.
- Data backup. Effective data backup prevents catastrophic loss of critical information and maintains data integrity.
- Data archiving. Archiving systems ensure older data is available when needed for audits and recovery needs.
- Power-on replacement. This is the ability to hot swap components or peripherals, making upgrades and repairs easier.
- Virtual machines (VMs). The use of VMs minimizes the impact of OS and software issues.
- Surge suppressors. These minimize the risk of component damage resulting from power anomalies.
- Continuous power. Uninterruptible power supply lets systems remain operational when there is an interruption in the regular power supply.
- Backup power sources. Batteries and generators keep systems operational during extended power interruptions.
The RAS concept is particularly important when designing a data center. Find out more about how to build a data center.