Oracle database availability -- and database availability in general -- is essential to systems' ability to maximize business continuity and is a critical component of a contingency plan for swiftly and efficiently dealing with potential business interruptions. It is also the most important runtime quality of a database. After all, if the database isn't available to run the workload, other qualities really don’t matter.
Let's begin with a definition of availability. Simply stated, it is the measure of time that a system is ready for use when it is needed. Scheduled or planned downtime is typically not considered in calculating Oracle database availability since the system is "not needed" by definition.
Perhaps the most common availability calculation (Figure 1) involves two metrics: Mean Time between Failures (MTBF) and Mean Time to Recovery (MTTR). MTBF is the average time the system is operational without failure. MTTR is the average time it takes to return the system to an operational state after a failure. There are two ways to improve availability: increase MTBF or reduce MTTR.
Figure 1: Availability Calculation
For example, if a system takes an average of three months (132,480 minutes) to fail and takes on average 30 minutes to return an operational state, your system availability is:
Availability = 3 months / (3 months + 30 minutes) = 99.962%
Availability is usually expressed as a percentage of time for a given year. For instance, you may have heard high availability referred to as "five nines" (99.999%); this equates to approximately 5 minutes of downtime per year. The following table (Figure 2) shows the relationship between availability and downtime.
Figure 2: Availability Chart
System uptime vs. Oracle database availability
There are in fact many ways to view Oracle database availability, each with a different calculation which often leads to confusion. Be careful not to equate availability with uptime. A system can be up but not available. This distinction is important when measuring and reporting availability for such things as service level agreements (SLAs). Another area of confusion is the difference between a failure and a fault. A failure is when a system no longer delivers expected service, resulting in unplanned downtime. It is observable by the system's users -- humans or other systems. A fault, on the other hand, is an interruption in expected service which is recovered in a manner transparent to system users and doesn't result in unplanned downtime. My suggestion is to begin Oracle database availability discussions by making sure all parties agree on context, terminology and how this quality is calculated.
It is important to remember that availability is ultimately measured end-to-end from a user's point of view. Users will define availability by whether the application or service is available when and with the performance they expect. For example, in a web based e-commerce system, if the database is available but the web servers are not, the user considers the system down, and so should you. That being said, with today's complex architectures and business processes, it is not uncommon to see different availability requirements across the architecture and its components. Those same web servers and database might be needed from 8 a.m. to 5:00 p.m. for online access while the database is needed an additional five hours a day for batch processing.
When describing availability requirements, my recommendation is to stay away from ambiguous terms like "high availability" which can lead to confusing and inconsistent designs. Instead, describe Oracle database availability requirements in objective and quantitative terms (Figure 3). Expect requirements to get more detailed as the solution progresses from conceptual to physical design.
Figure 3: Sample Availability Requirements
Finally, availability discussions must consider recoverability under normal operations and in a disaster. Operational recovery requirements are usually more stringent than disaster recovery requirements. Regardless, there are two metrics which are used to define recovery: Recovery Time Objective (RTO) and Recovery Point Objective (RPO). RTO is defined as the maximum amount of time that an IT-based business process can be down before the organization starts suffering significant material losses. RPO is defined as the maximum amount of data an IT-based business process may lose before causing significant harm to the organization.
In our next segment we'll take a look at Oracle database availability architecture by learning the causes of downtime and how Oracle database technologies address these interruptions.
Jeff McCormick is an architecture director at a major health service company and president of the Connecticut Oracle User Group. He has worked in IT for over 20 years as a data and infrastructure architect/administrator, focusing the last five years in enterprise business intelligence. He holds several certifications including Oracle Certified Professional and Microsoft Certified Professional.