How to manage disaster recovery in SQL Server
SQL Server offers a platform for hosting relational databases. Enacting a disaster recovery plan that won't disrupt mission-critical SQL Server workloads is essential.
Microsoft SQL Server is a relational database management system that can support a wide range of applications, but the data must be readily available to those applications whenever they need it.
Any disruptions to databases services -- whether from natural disaster, equipment failure, cyber attacks or other causes -- can prevent an organization from performing routine operations and conducting everyday business. This may lead to disgruntled users, lost revenue and tarnished reputations. An organization must have an effective disaster recovery strategy that helps to minimize disruptions to SQL Server services, especially when supporting mission-critical workloads.
Organizations of all types and sizes use SQL Server to support transaction processing, business intelligence and analytics applications.
Planning for the worst
Like similar database products, such as Oracle Database or IBM's Db2, SQL Server is built on the Structured Query Language (SQL), a standardized programming language that provides a foundation for managing relational databases and querying their data. SQL Server uses a modified implementation of SQL -- referred to as Transact-SQL -- that adds a set of proprietary programming extensions to the standard language.
Applications that rely on SQL Server must be able to access the data to meet workload requirements, but unexpected events can cause databases to become unavailable or lead to data loss. Disruptions to data services can occur for numerous reasons, including the following:
- a user inadvertently deletes critical data;
- a malware attack encrypts or destroys data;
- an employee spills coffee on the storage system;
- a power failure causes data to become corrupted;
- an earthquake destroys physical storage infrastructure;
- a hard disk drive (HDD) fails and all data on that drive is lost;
- an administrator inadvertently formats an HDD that contains active data;
- software corruption leads to data loss or lack of availability; and
- a rogue employee steals a physical storage device.
These are by no means the only reasons that data might become unavailable but illustrate the wide range of crises that could strike any organization at any time. Whatever the cause, the only way to protect against data loss and service disruptions is to implement a disaster recovery strategy that minimizes downtime and ensures the safe recovery of any affected data. Fortunately, SQL Server has features that can help protect against disaster, many of which are part of the platform's high-availability capabilities.
The primary goal of a DR plan is to ensure business continuity if disaster strikes. Not all plans are the same, however, nor should they be. A DR plan should be tailored to meet an organization's specific data protection requirements. For example, data that drives a mission-critical financial application might require immediate recovery, but data that's used to generate monthly sales reports can likely tolerate a longer delay. When planning a DR strategy, an organization should determine the following three metrics:
- Recovery Time Objective (RTO). This is the maximum amount of time an application can be offline as a result of unavailable data. The metric determines how quickly data needs to be back online after an incident.
- Recovery Point Objective (RPO). This is the acceptable amount of data loss that can be tolerated if an event occurs. This metric is often considered in terms of time. For example, if a database is backed up every two hours and a disaster occurs right before the next scheduled backup, all data changes that took place since the last backup are lost.
- Recovery Level Objective (RLO). This is the level of granularity at which the data should be recovered, such as the instance, database or table level.
An organization will ultimately have to balance the risk of data loss against the costs of implementing a DR strategy. The shorter the RTO and RPO (in terms of time) and the finer the RLO granularity, the greater the costs.
Disaster recovery strategies for SQL Server
After organizations determine the levels of protection they need, they can take advantage of the following SQL Server features to help mitigate the impact of unexpected service disruptions:
- Backup and recovery. A well-tested backup and restore strategy can help protect databases against a variety of issues, including ransomware, user error, hardware failures and natural disaster. Backing up data to a separate location makes it possible to restore a database to a consistent state and avoid catastrophic data loss; however, restoring databases can be a time-consuming process and data changes made after the last backup are lost. Backups have become especially important with the rise in ransomware attacks.
- Always On availability groups. Like backups, availability groups provide database-level protection. An availability group is made up of a group of databases that fail over together to a secondary instance. SQL Server can support up to eight sets of corresponding secondary databases in an availability group. Within the availability group, a database's transactions are applied to a replica database on another SQL Server instance. Availability groups serve as a replacement to database mirroring, which is still supported but was deprecated in SQL Server 2012.
- Always On failover cluster instances. Failover clusters use the Windows Server Failover Cluster (WSFC) nodes framework to provide instance-level protection that includes databases, linked servers, SQL Server Agent jobs and other instance and database objects. A failover cluster is made up of redundant nodes, but only one node can own the WSFC resource group at a time. A cluster also requires a shared storage system, such as a storage area network. Failover is automatic and transparent to connecting applications.
- SQL Server replication. Replication makes it possible to copy and distribute data and database objects from one database to another and then keep those databases synchronized. SQL Server supports three types of replication: transactional, snapshot and merge. Organizations can use replication to distribute data to different locations across local area networks, wide area networks, wireless connections and the internet. SQL Server replication uses SQL Server Agent to synchronize the databases and their data.
- Log shipping. Log shipping enables administrators to automatically apply a database's transaction log on one SQL Server instance to one or more secondary databases on separate instances. A monitor server maintains a record of the history and status of backup and restore operations and raises alerts if operations fail. Log shipping uses SQL Server's backup and restore capabilities to copy the transaction log from the primary instance to the secondary instances. Although log shipping is easy to implement, switching over to a secondary database is a manual operation that can take a significant amount of time.
Many organizations deploy these features in combination to help maximize data protection and minimize downtimes. For example, a database team might use log shipping along with availability groups to protect their databases. In addition, most organizations maintain backups regardless of any other strategies they deploy. Many organizations also implement SQL Server disaster recovery as part of a larger DR strategy that incorporates protections at multiple levels, such as configuring RAID 6 storage or backing up virtual machines running SQL Server instances.
Regardless of the DR strategy that an organization implements, it first requires careful planning that considers RTO, RPO and RLO requirements, as well as factors such as security and compliance. Disaster recovery also goes hand in hand with an organization's high-availability strategy, which relies on many of the same SQL Server tools. Regardless of an organization's type or size, disaster recovery should be a top priority, and the more prepared organizations are to handle disasters, the more likely they are to get through them when they occur.