Browse Definitions :
Definition

reliability, availability and serviceability (RAS)

What is reliability, availability and serviceability (RAS)?

Reliability, availability and serviceability (RAS) is a set of related attributes that must be considered when designing, manufacturing, purchasing and using a computer product or component. The term was first used by IBM to define specifications for its mainframes and originally applied only to hardware. Today, RAS is relevant to software as well and can be applied to networks, applications, operating systems (OSes), personal computers, servers and even supercomputers.

The three components of the term mean different things. Together they describe the level at which a user can expect a computer component or software to perform.

How does RAS work?

Each part of the term reliability, availability and serviceability describes a specific type of performance for computer components and software.

Reliability

The term reliability refers to the ability of computer hardware and software to consistently perform according to certain specifications. More specifically, it measures the likelihood that a specific system or application will meet its expected performance levels within a given time period.

In theory, a reliable product is free of technical errors. In practice, vendors commonly express product reliability as a percentage. The IEEE sponsors the IEEE Reliability Society (IEEE RS), an organization devoted to reliability in engineering.

Chart showing how the 9s translate to network downtime
The nines are used to calculate the percentage of network availability guaranteed in a service-level agreement or other contract. They can be translated into quantifiable hours, minutes and seconds of allowable network services downtime.

Mean time between failures (MTBF) is one metric used to measure reliability. For most computer components, the MTBF is thousands or tens of thousands of hours between failures. The longer the uptime is between system outages, the more reliable the system is. MTBF is dividing the total uptime hours by the number of outages during the observation period.

Service-level agreements and other contracts often use the nines to describe guaranteed levels of reliability and availability. For instance, five 9s means a reliability level of 99.999% is being promised. The system or component in question will be available 99.999% of the time. Such systems could only be down five minutes a year, so five nines is a high level of reliability. Organizations relying on high-availability systems often require a minimum of four nines or less than an hour of downtime per year.

Availability

Availability is the ratio of time a system or component is functional compared to the total time it is required or expected to function. This can be expressed as a proportion, such as 9/10 or 0.9 or as a percentage, which in this case would be 90%.

To calculate availability of a component or software program, divide the actual operating time by the amount of time it was expected to operate. For example, if a device is working for 50 minutes out of an hour, it has 83.3% availability. MTBF can be used to describe availability as well as reliability. A higher MTBF would mean higher availability.

Sometimes availability is expressed in qualitative terms. For instance, it might measure the extent to which a system can continue to work when a significant component or set of components is unavailable or not operating.

List of four availability management metrics
System and software availability are measured by several different metrics. See four important ones here.

Serviceability

Serviceability is the ease with which a component, device or system can be maintained and repaired. Early detection of potential problems is a critical factor of serviceability. In determining serviceability, it's important to consider how easy it is to do the following:

  • Diagnose issues.
  • Repair problems.
  • Obtain parts.
  • Take a system down to effect repairs.
  • Return it to operation.

Mean time to repair (MTTR) is a metric used to measure serviceability. It's calculated by taking the total amount of time spent on repairs in a given time period and dividing it by the number of repairs. For example, if 20 minutes of time is spent on repairs resulting from two outages, the MTTR is 10 minutes.

Some systems are self-monitoring and use diagnostics to automatically identify and correct software and hardware faults before more serious trouble occurs. For example, OSes such as Microsoft Windows 365 include built-in features that automatically detect and fix computer issues, and antivirus software and spyware autoprotect features include detection and removal programs. Ideally, maintenance and repair operations cause as little downtime or disruption as possible.

Descriptions of data center uptime tiers
Data centers use uptime tiers to ensure the right levels of availability are tied to specific components, systems and software.

Important RAS features and design elements

There are many ways to improve availability and reliability, in particular. These include deploying computer systems and subsystems with more powerful CPUs, and multiple processors and memory modules, and using component redundancy, error detection firmware and error correcting code.

Some of the key ways that RAS is designed into hardware and software are the following:

  • Overengineering. Systems are designed beyond the minimum specifications.
  • Duplication. Extensive use of redundant systems and components eliminates single points of failure and improves RAS.
  • Recoverability. Fault-tolerant engineering methods help ensure RAS.
  • Automatic updating. These systems keep OSes and critical applications current without user intervention.
  • Data backup. Effective data backup prevents catastrophic loss of critical information and maintains data integrity.
  • Data archiving. Archiving systems ensure older data is available when needed for audits and recovery needs.
  • Power-on replacement. This is the ability to hot swap components or peripherals, making upgrades and repairs easier.
  • Virtual machines (VMs). The use of VMs minimizes the impact of OS and software issues.
  • Surge suppressors. These minimize the risk of component damage resulting from power anomalies.
  • Continuous power. Uninterruptible power supply lets systems remain operational when there is an interruption in the regular power supply.
  • Backup power sources. Batteries and generators keep systems operational during extended power interruptions.

The RAS concept is particularly important when designing a data center. Find out more about how to build a data center.

This was last updated in April 2023

Continue Reading About reliability, availability and serviceability (RAS)

Networking
  • subnet (subnetwork)

    A subnet, or subnetwork, is a segmented piece of a larger network. More specifically, subnets are a logical partition of an IP ...

  • secure access service edge (SASE)

    Secure access service edge (SASE), pronounced sassy, is a cloud architecture model that bundles together network and cloud-native...

  • Transmission Control Protocol (TCP)

    Transmission Control Protocol (TCP) is a standard protocol on the internet that ensures the reliable transmission of data between...

Security
  • intrusion detection system (IDS)

    An intrusion detection system monitors (IDS) network traffic for suspicious activity and sends alerts when such activity is ...

  • cyber attack

    A cyber attack is any malicious attempt to gain unauthorized access to a computer, computing system or computer network with the ...

  • digital signature

    A digital signature is a mathematical technique used to validate the authenticity and integrity of a digital document, message or...

CIO
  • product development (new product development)

    Product development -- also called new product management -- is a series of steps that includes the conceptualization, design, ...

  • innovation culture

    Innovation culture is the work environment that leaders cultivate to nurture unorthodox thinking and its application.

  • technology addiction

    Technology addiction is an impulse control disorder that involves the obsessive use of mobile devices, the internet or video ...

HRSoftware
  • organizational network analysis (ONA)

    Organizational network analysis (ONA) is a quantitative method for modeling and analyzing how communications, information, ...

  • HireVue

    HireVue is an enterprise video interviewing technology provider of a platform that lets recruiters and hiring managers screen ...

  • Human Resource Certification Institute (HRCI)

    Human Resource Certification Institute (HRCI) is a U.S.-based credentialing organization offering certifications to HR ...

Customer Experience
  • What is lead-to-revenue management (L2RM)?

    Lead-to-revenue management (L2RM) is a set of sales and marketing methods focusing on generating revenue throughout the customer ...

  • What is relationship marketing?

    Relationship marketing is a facet of customer relationship management (CRM) that focuses on customer loyalty and long-term ...

  • contact center burnout

    Contact center burnout refers to physical, emotional and mental exhaustion experienced by contact center employees.

Close