Definition

What is reliability, availability and serviceability (RAS)?

Paul Kirvan

By

Paul Kirvan

Published: Mar 24, 2025

Reliability, availability and serviceability (RAS) are related operational activities that must be considered when designing, manufacturing, purchasing and using a computer product or component. The term was first used by IBM to define specifications for its mainframes and originally applied only to hardware. Today, RAS is relevant to software as well and can be applied to networks, applications, operating systems (OSes), personal computers, servers and even supercomputers.

The three components of the term mean different things. Together they describe the level at which a user can expect a computer component or software to perform. RAS applies to a broad range of technology elements, including hardware components, central processing units (CPUs) and OSes, system firmware and specialized high-availability computer systems. From an administrative perspective, RAS addresses issues such as maximizing system uptime, minimizing system downtime, identifying points of failure and ensuring data integrity.

How does RAS work?

Each part of the term reliability, availability and serviceability describes a specific type of performance for computer components and software.

Reliability

The term reliability refers to the ability of computer hardware and software to consistently perform according to certain specifications. More specifically, it measures the likelihood that a specific system or application will meet its expected performance levels within a given time period.

In theory, a reliable product is free of technical errors. In practice, vendors commonly express product reliability as a percentage. The Institute of Electrical and Electronics Engineers (IEEE) sponsors the IEEE Reliability Society (IEEE RS), an organization devoted to reliability in engineering.

Graphic explaining how the 9s translate to network downtime.

The nines are used to calculate the percentage of network availability guaranteed in a service-level agreement (SLA) or other contract. They can be translated into quantifiable hours, minutes and seconds of allowable network services downtime.

Mean time between failures (MTBF) is one metric used to measure reliability. For most computer components, the MTBF is thousands or tens of thousands of hours between failures. The longer the uptime is between system outages, the more reliable the system is. MTBF is dividing the total uptime hours by the number of outages during the observation period.

Service-level agreements and other contracts often use the nines to describe guaranteed levels of reliability and availability. For instance, five 9s means a reliability level of 99.999% is being promised. The system or component in question will be available 99.999% of the time. Such systems could only be down five minutes a year, so five nines is a high level of reliability. Organizations relying on high-availability systems often require a minimum of four nines or less than an hour of downtime per year.

Availability

Availability is the ratio of time a system or component is functional compared to the total time it is required or expected to function. This can be expressed as a proportion, such as 9/10 or 0.9 or as a percentage, which in this case would be 90%. Full availability or 100% is the desired goal.

To calculate availability of a component or software program, divide the actual operating time by the amount of time it was expected to operate. For example, if a device is working for 50 minutes out of an hour, it has 83.3% availability. MTBF can be used to describe availability as well as reliability. A higher MTBF would mean higher availability.

Sometimes availability is expressed in qualitative terms. For instance, it might measure the extent to which a system can continue to work when a significant component or set of components is unavailable or not operating.

Infographic on common availability management metrics. — Here are some common availability management metrics.

System and software availability are measured by several different metrics. See four important ones here.

Serviceability

Serviceability is the ease with which a component, device or system can be maintained and repaired. Early detection of potential problems is a critical factor of serviceability. In determining serviceability, it's important to consider how easy it is to do the following:

Diagnose issues.
Repair problems.
Obtain parts.
Take a system down to effect repairs.
Test the repaired system.
Document what was performed
Return it to operation.

Mean time to repair (MTTR) is a metric used to measure serviceability. It's calculated by taking the total amount of time spent on repairs in a given time period and dividing it by the number of repairs. For example, if 20 minutes of time is spent on repairs resulting from two outages, the MTTR is 10 minutes.

Some systems are self-monitoring and use diagnostics to automatically identify and correct software and hardware faults before more serious trouble occurs. For example, OSes such as Microsoft 365 include built-in features that automatically detect and fix computer issues, and antivirus software and spyware autoprotect features include detection and removal programs.

Ideally, maintenance and repair operations cause as little downtime or disruption as possible. The use of AI for serviceability is an important enhancement. In addition to supporting diagnostics and repairs, it can also analyze prior performance and provide insights on the likelihood of the item failing in the future.

Infographic of uptime data center tier standards. — Data centers use uptime tiers to ensure the right levels of availability are tied to specific components, systems and software.

Data centers use uptime tiers, as specified by the Uptime Institute, to ensure the right levels of availability are tied to specific components, systems and software.

Why is RAS important?

Two key goals of virtually any information system are: for the system to stay operational as long as possible and to be easily repaired and returned to service if a failure occurs. Additional reasons why reliability, availability and serviceability are important include the following:

Reliability. A truly reliable system is dependable, exhibits consistent performance and has minimal to no outages. Achieving these attributes helps establish confidence and trust in the system.
Availability. When a system fails, the organization can experience productivity, customer and revenue losses, as well as reputational damage. Since availability is closely tied to reliability, system administrators must use all the tools available to keep the system performing as required.
Serviceability. The ability to quickly and efficiently fix a system that has been disrupted is critical to maintaining its performance. Ease of serviceability helps ensure that the MTTR is as low as possible. Ease of troubleshooting is another important component of service management, and many tools are available to facilitate this essential activity.

Without an effective suite of RAS procedures and tools, system performance -- and the company's success -- can be in jeopardy. AI is expected to be an important component of RAS activities.

Pros and cons of RAS

While the advantages of reliability, availability and serviceability can far outweigh the negatives, both must be considered when initiating or updating RAS initiatives.

Reliability. Keeping systems reliable means they perform consistently without interruption, resulting in minimal downtime and reduced maintenance costs.
Availability. Keeping critical systems available to customers helps ensure their satisfaction and continued use, ensuring revenue protection for the company. Managing availability means that additional technology, e.g., redundant components and data backups, might be needed, potentially adding costs.
Serviceability. If critical systems can be easily fixed, their availability and reliability are improved. This helps minimize downtime, improve performance and reduce overall maintenance costs over time. Investment in the right repair tools and technologies might be needed as well as experienced repair personnel and training for repair techs -- all of which can add to operating overhead.

Important RAS features and design elements

There are many ways to improve availability and reliability, in particular. These include deploying computer systems and subsystems with more powerful CPUs, and multiple processors and memory modules, and using component redundancy, error detection firmware and error correcting code. AI will be an important factor in the enhancement of system reliability.

Some of the key ways that RAS is designed into hardware and software are the following:

Overengineering. Systems are designed beyond the minimum specifications.
Duplication. Extensive use of redundant systems and components eliminates single points of failure and improves RAS.
Recoverability. Fault-tolerant engineering methods help ensure RAS.
Automatic updating. These systems keep OSes and critical applications current without user intervention.
Data backup. Effective data backup prevents catastrophic loss of critical information and maintains data integrity.
Data archiving. Archiving systems ensure older data is available when needed for audits and recovery needs.
Power-on replacement. This is the ability to hot swap components or peripherals, making upgrades and repairs easier.
Virtual machines. The use of VMs minimizes the impact of OS and software issues.
Surge suppressors. These minimize the risk of component damage resulting from power anomalies.
Continuous power. Uninterruptible power supply lets systems remain operational when there is an interruption in the regular power supply.
Backup power sources. Batteries and generators keep systems operational during extended power interruptions.

Learn the essentials of designing a data center efficiently. Check out key components, infrastructure and industry standards before embarking on the project. Explore the what factors to consider to run a sustainable data center.

Continue Reading About What is reliability, availability and serviceability (RAS)?

Server hardware guide: Architecture, products and management

What's the difference between network reliability and availability?

High availability and resiliency: A DR strategy needs both

Data center tiers and why they matter for uptime

Network requirements for cloud computing

Search Networking

What is multi-access edge computing? Benefits and use cases
Multi-access edge computing (MEC) is a network architecture concept that brings cloud computing capabilities and IT services ...
What is 5G?
Fifth-generation wireless or 5G is a global standard and technology for wireless and telecommunications networks.
What is a small cell in wireless networks?
A small cell is a type of low-power cellular radio access point or base station that provides wireless service within a limited ...

Search Security

What is incident response? A complete guide
Incident response is an organized, strategic approach to detecting and managing cyberattacks in ways that minimize damage, ...
What is identity and access management? Guide to IAM
No longer just a good idea, IAM is a crucial piece of the cybersecurity puzzle. It's how an organization regulates access to ...
What is data masking?
Data masking is a security technique that modifies sensitive data in a data set so it can be used safely in a non-production ...

Search CIO

What is a chief data officer (CDO)?
A chief data officer (CDO) in many organizations is a C-level executive whose position has evolved into a range of strategic data...
What is user-generated content?
User-generated content (UGC) is published information that an unpaid contributor provides to a website.
What is business process outsourcing (BPO)?
Business process outsourcing (BPO) is a business practice in which an organization contracts with an external service provider to...

Search HRSoftware

What is HR technology (human resources tech)?
HR technology (human resources tech) refers to the hardware and software that support an organization's human resource management...
What is compensation management?
Compensation management is the discipline and process for determining employees' appropriate pay, incentives, rewards, bonuses ...
What is talent management? Definition, basics and strategy
Talent management is a strategic approach organizations use to attract, develop, retain, and optimize employees.

Search Customer Experience

What are virtual agents and how are they being used?
A virtual agent is an AI-powered software application or service that interacts with humans or other digital systems in a ...
Customer acquisition cost (CAC): How to calculate and reduce it
Customer acquisition cost (CAC) is the cost associated with convincing a consumer to buy your product or service, including ...
What is direct marketing?
Direct marketing is a type of advertising campaign that seeks to elicit an action (such as an order, a visit to a store or ...

Close