IT pros have developed numerous metrics over the years to measure performance. Whether it is a system, process or other item that can fail to operate properly, technology and business professionals need a way to determine how long something is likely to take to fix. This is particularly important in the field of business continuity and disaster recovery.
Mean time to repair (MTTR) is a widely used metric that estimates the average time a system is likely to need repair before it can resume normal operation. The lower the MTTR value, the easier it will be to fix. When managing systems, technologies or processes, the goal is to reduce the average time something will need to repair. If, for example, MTTR is 0, then the item is much less likely to fail than something with a positive value for its MTTR.
When the goal is uninterrupted operation, a low MTTR value means that the item in question -- if it fails -- will be fairly easy to repair and will take minimal time to return to normal operations.
Why is MTTR important?
MTTR is a critical element in business continuity and disaster recovery (BCDR) plans and can become an essential metric to ensure that systems perform without interruption.
Assets with low MTTR are less likely to fail, and if they do, their ability to recover and resume normal operations will take a minimal amount of time. By contrast, if BCDR teams find that a system has high MTTR, such as four to five days, they should probably replace it. Updates and newer components are other options to reduce MTTR in an existing system. Management will need to decide the threshold at which high MTTR necessitates a complete replacement or redesign of the item.
How to calculate MTTR
MTTR is an average of the analysis of several items. For a specific period of time, such as a day, week or month, for each repair that IT performed, the amount of time each repair takes is added to other similar repair values. That value, usually expressed in hours, is then divided by the number of unplanned or unscheduled repair events during the analysis period. This means all events that require repair that were not meant to occur. Scheduled maintenance time frames are not included in MTTR calculations.
In practice, BCDR teams will use this calculation on a series of events that require repair. This will provide them with MTTR. From there, it is easier to get an idea of how much they need to reduce MTTR or if current systems are sufficient.
While this calculation seems relatively simple and BCDR teams can easily configure it with a spreadsheet, potential flaws and errors can occur. For example, the MTTR equation assumes that tasks are performed sequentially by appropriately trained personnel. If the order of tasks is changed, if multiple tasks happen at once or if the person performing the tasks isn't properly trained, the calculation could be incorrect.
MTTR vs. MTBF
Often used in conjunction with MTTR is mean time between failures (MTBF), another important performance and maintenance metric. Whereas MTTR deals with the average time needed to repair something, MTBF expresses the average time between occurrences of system and process failures. This metric indicates the reliability of a system or process.
Similar to MTTR, if the MTBF value is 0, the system is unlikely to fail and can be considered 100% reliable. However, as system failures do occur, a higher MTBF value indicates that the system or process is less likely to fail yet may still experience infrequent outages. An MTBF value above 0 (five to 10 hours or one to two days) means the system or process is much more likely to fail than if the MTBF is large (one to two years). Technology professionals aim for as high an MTBF value as possible yet must be prepared for more frequent failures.
Both MTTR and MTBF provide measurements on the performance and reliability of a system, process or other activity. Values for each metric, as described, can indicate situations where remedial action is needed.
How to reduce MTTR
Lower MTTR means a system or process performs well. Reducing MTTR for specific items begins with setting baseline MTTR that forms the starting point. Subsequent MTTR calculations compared to the baseline will show BCDR teams and admins if progress in system and process performance has been made.
There are numerous actions an organization can take to reduce MTTR values. They include the following:
- build a supply of spare parts and components if a production component fails;
- conduct regular tests and performance reviews to ensure systems are working well;
- perform a business impact analysis to indicate which systems and processes are most critical, and calculate MTTR to monitor performance;
- add MTTR to other performance metrics, such as MTBF, recovery time objective and recovery point objective;
- deploy an optimized incident response plan that protects mission-critical assets and helps rapid response to any malfunction;
- deploy special rapid response teams that respond to system and process outages beyond an incident response team;
- install monitoring systems with sensors that can provide alerts when systems cease to perform properly;
- streamline help desk resources to simplify the reporting process, from problem detection to ticket submission;
- fully train equipment repair teams and train personnel to serve as backups; and
- update the change management process to minimize the chance for error, such as adding verification testing to ensure the system is working properly.