Backup refers to the copying of physical or virtual files or databases to a secondary location for preservation in case of equipment failure or catastrophe. The process of backing up data is pivotal to a successful disaster recovery plan.
Enterprises back up data they deem to be vulnerable in the event of buggy software, data corruption, hardware failure, malicious hacking, user error or other unforeseen events. Backups capture and synchronize a point-in-time snapshot that is then used to return data to its previous state.
Backup and recovery testing examines an organization's practices and technologies for data security and data replication. The goal is to ensure rapid and reliable data retrieval should the need arise. The process of retrieving backed-up data files is known as file restoration.
The terms data backup and data protection are often used interchangeably, although data protection encompasses the broader goals of business continuity, data security, information lifecycle management and prevention of malware and computer viruses.
This article is part of
Create your data backup strategy: A comprehensive guide
The importance of data backup
Data backups are among the most important infrastructure components in any organization because they help guard against data loss. Backups provide a way of restoring deleted files or recovering a file when it is accidentally overwritten.
In addition, backups are usually an organization's best option for recovering from a ransomware attack or from a major data loss event, such as a fire in the data center.
What data should be backed up and how frequently?
A backup process is applied to critical databases or related line-of-business applications. The process is governed by predefined backup policies that specify how frequently the data is backed up and how many duplicate copies -- known as replicas -- are required, as well as by service-level agreements (SLAs) that stipulate how quickly data must be restored.
Best practices suggest a full data backup should be scheduled to occur at least once a week, often during weekends or off-business hours. To supplement weekly full backups, enterprises typically schedule a series of differential or incremental data backup jobs that back up only the data that has changed since the last full backup took place.
The evolution of backup storage media
Enterprises typically back up key data to dedicated backup disk appliances. Backup software -- either integrated in the appliances or running on a separate server -- manages the process of copying data to the disk appliances. Backup software handles processes such as data deduplication that reduce the amount of physical space required to store data. Backup software also enforces policies that govern how often specific data is backed up, how many copies are made and where backups are stored.
Before disk became the main backup medium in the early 2000s, most organizations used magnetic tape drive libraries to store data center backups. Tape is still used today, but mainly for archived data that does not need to be quickly restored. Some organizations have adopted the practice of using a removable external drive instead of a tape, but the basic concept of backing up data to removable media remains the same.
Disk-based backups made it possible for organizations to achieve continuous data protection. Prior to disk-based backups, organizations would typically create a single nightly backup. Early on, the nightly backups were all full system backups. As time went on, the backup files became larger, while the backup windows remained the same size or even shrank. This forced many organizations to create nightly incremental backups.
Continuous data protection platforms avoid these challenges completely. The systems perform an initial full backup to disk, and then perform incremental backups every few minutes as data is created or modified. These types of backups can protect both structured data -- data stored on a database server -- and unstructured or file data.
In the early days of disk backup, the backup software was designed to run on a separate server. This software coordinated the backup process and wrote backup data to a storage array. These systems gained rapid popularity because they acted as online backups, meaning data could be backed up or restored on demand, without having to mount a tape.
Although some backup products still use separate backup servers, backup vendors are increasingly transitioning to integrated data protection appliances. At its simplest, an integrated data appliance is essentially a file server outfitted with HDDs and backup software. These plug-and-play data storage devices often include automated features for monitoring disk capacity, expandable storage and preconfigured tape libraries.
Some backup vendors have also begun offering backup platforms that are based around the use of hyper-converged systems. These systems consist of collections of standardized servers that have been clustered together and collectively handle backup-related processes. One of the main benefits of hyper-converged systems is that they are easily scalable. Each node within a hyper-converged system contains its own integrated storage, compute and network resources. Administrators can scale the organization's backup capacity simply by adding more nodes to the cluster.
Whether hyper-converged or not, most disk-based backup appliances enable copies to be moved from spinning media to magnetic tape for long-term retention. Magnetic tape systems are still used because of increasing tape densities and the rise of the Linear Tape File System.
Early disk backup systems were known as virtual tape libraries (VTLs) because they included disk that worked the same way as tape drives. That way, backup software applications developed to write data to tape could treat disk as a physical tape library. VTLs faded from popular use after backup software vendors optimized their products for disk instead of tape.
Solid-state drives (SSDs) are rarely used for data backup because of price and endurance concerns. Some storage vendors include SSDs as a caching or tiering tool for managing writes with disk-based arrays. This is especially common in hyper-converged systems. Data is initially cached in flash storage and then written to disk. As vendors release SSDs with larger capacity than disk drives, flash drives might gain some use for backup.
Local backup vs. offline backup for primary storage
Modern primary storage systems have evolved to feature stronger native capabilities for data backup. These features include advanced RAID protection schemes, unlimited snapshots and tools for replicating snapshots to secondary backup or even tertiary off-site backup. Despite these advances, primary storage-based backup tends to be more expensive and lacks the indexing capabilities found in traditional backup products.
Local backups place data copies on external HDDs or magnetic tape systems, typically housed in or near an on-premises data center. The data is transmitted over a secure high-bandwidth network connection or corporate intranet.
One advantage of local backup is the ability to back up data behind a network firewall. Local backup is also much quicker and provides greater control over who can access the data.
Offline or cold backup is like local backup, although it is most often associated with backing up a database. An offline backup incurs downtime since the backup process occurs while the database is disconnected from its network.
Backup and cloud storage
Off-site backup transmits data copies to a remote location, which can include a company's secondary data center or leased colocation facility. Increasingly, off-site data backup equates to subscription-based cloud storage as a service, which provides low-cost, scalable capacity and eliminates a customer's need to purchase and maintain backup hardware. Despite its growing popularity, electing backup as a service (BaaS) requires users to encrypt data and take other steps to safeguard data integrity.
Cloud backup is divided into the following:
- Public cloud storage. Users ship data to a cloud services provider who charges them a monthly subscription fee based on consumed storage. There are additional fees for ingress and egress of data. AWS, Google Cloud and Microsoft Azure are the largest public cloud providers. Smaller managed service providers also host backups on their clouds or manage customer backups on the large public clouds.
- Private cloud storage. Data is backed up to different servers within a company's firewall, typically between an on-premises data center and a secondary DR site. For this reason, private cloud storage is sometimes referred to as internal cloud storage.
- Hybrid cloud storage. A company uses both local and off-site storage. Enterprises customarily use public cloud storage selectively for data archiving and long-term retention. They use private storage for local access and backup for faster access to their most critical data.
Most backup vendors enable local applications to be backed up to a dedicated private cloud, effectively treating cloud-based data backup as an extension of a customer's physical data center. When the process enables applications to fail over in case of a disaster and fail back later, this is known as disaster recovery as a service.
Cloud-to-cloud (C2C) data backup is an alternative approach that has been gaining momentum. C2C backup protects data on SaaS platforms, such as Salesforce or Microsoft Office 365. This data often exists only in the cloud, but SaaS vendors often charge large fees to restore data lost due to customer error. C2C backup works by copying SaaS data to another cloud, from where it can be restored if any data is lost.
Backup storage for PCs and mobile devices
PC users can consider both local backup from a computer's internal hard disk to an attached external hard drive or removable media, such as a thumb drive.
Another alternative for consumers is to back up data from smartphones and tablets to personal cloud storage, which is available from vendors such as Box, Carbonite, Dropbox, Google Drive, Microsoft OneDrive and others. These services are commonly used to provide a certain capacity for free, giving consumers the option to purchase additional storage as needed. Unlike enterprise cloud storage as a service, these consumer-based cloud offerings generally do not provide the level of data security businesses require.
Backup software and hardware vendors
Vendors that sell backup hardware platforms include Barracuda Networks, Cohesity, Dell EMC (Data Domain), Drobo, ExaGrid Systems, Hewlett Packard Enterprise, Hitachi Vantara, IBM, NEC Corp., Oracle StorageTek (tape libraries), Quantum Corp., Rubrik, Spectra Logic, Unitrends and Veritas NetBackup.
Leading enterprise backup software vendors include Acronis, Arcserve, Asigra, Commvault, Datto, Dell EMC Data Protection Suite (Avamar and NetWorker), Dell EMC RecoverPoint replication manager, Druva, Nakivo, Veeam Software and Veritas Technologies.
The Microsoft Windows Server OS inherently features the Microsoft Resilient File System (ReFS) to automatically detect and repair corrupted data. While not technically data backup, Microsoft ReFS is geared to be a preventive measure for safeguarding file system data against corruption.
VMware vSphere provides a suite of backup tools for data protection, high availability and replication. The VMware vStorage API for Data Protection (VADP) enables VMware or supported third-party backup software to safely take full and incremental backups of VMs. VADP implements backups via hypervisor-based snapshots. As an adjunct to data backup, VMware vSphere live migration enables VMs to be moved between different platforms to minimize the effect of a DR event. VMware Virtual Volumes also aid VM backup.
Backup types defined
- Full backup captures a copy of an entire data set. Although considered to be the most reliable backup method, performing a full backup is time-consuming and requires many disks or tapes. Most organizations run full backups only periodically.
- Incremental backup offers an alternative to full backups by backing up only the data that has changed since the last full backup. The drawback is that a full restore takes longer if an incremental-based data backup copy is used for recovery.
- Differential backup copies data changed since the last full backup. This enables a full restore to occur more quickly by requiring only the last full backup and the last differential backup. For example, if you create a full backup on Monday, the Tuesday backup would, at that point, be similar to an incremental backup. Wednesday's backup would then back up the differential that has changed since Monday's full backup. The downside is that progressive growth of differential backups tends to adversely affect your backup window. A differential backup spawns a file by combining an earlier complete copy of it with one or more incremental copies created later. The assembled file is not a direct copy of any single current or previously created file, but rather synthesized from the original file and any subsequent modifications to that file.
- Synthetic full backup is a variation of differential backup. In a synthetic full backup, the backup server produces an additional full copy, which is based on the original full backup and data gleaned from incremental copies.
- Incremental-forever backups minimize the backup window while providing faster recovery access to data. An incremental-forever backup captures the full data set and then supplements it with incremental backups from that point forward. Backing up only changed blocks is also known as delta differencing. Full backups of data sets are typically stored on the backup server, which automates the restoration.
- Reverse-incremental backups are changes made between two instances of a mirror. Once an initial full backup is taken, each successive incremental backup applies any changes to the existing full backup. This essentially generates a novel synthetic full backup copy each time an incremental change is applied, while also providing reversion to previous full backups.
- Hot backup, or dynamic backup, is applied to data that remains available to users as the update is in process. This method sidesteps user downtime and productivity loss. The risk with hot backup is that, if the data is amended while the backup is underway, the resulting backup copy might not match the final state of the data.
Techniques and technologies to complement data backup
- Continuous data protection (CDP) refers to layers of associated technologies designed to enhance data protection. A CDP-based storage system backs up all enterprise data whenever a change is made. CDP tools enable multiple copies of data to be created. Many CDP systems contain a built-in engine that replicates data from a primary to a secondary backup server and/or tape-based storage. Disk-to-disk-to-tape backup is a popular architecture for CDP systems.
- Near-continuous CDP takes backup snapshots at set intervals, which are different from array-based vendor snapshots that are taken each time new data is written to storage.
- Data reduction lessens your storage footprint. There are two primary methods: data compression and data deduplication. These methods can be used singly, but vendors often combine the approaches. Reducing the size of data has implications on backup windows and restoration times.
- Disk cloning involves copying the contents of a computer's hard drive, saving it as an image file and transferring it to storage media. Disk cloning can be used for provisioning, system provisioning, system recovery and rebooting or returning a system to its original configuration.
- Erasure coding, or forward error correction, evolved as a scalable alternative to traditional RAID systems. Erasure coding most often is associated with object storage. RAID stripes data writes across multiple drives, using a parity drive to ensure redundancy and resilience. The technology breaks data into fragments and encodes it with other bits of redundant data. These encoded fragments are stored across different storage media, nodes or geographic locations. The associated fragments are used to reconstruct corrupted data using a technique known as oversampling.
- Flat backup is a data protection scheme in which a direct copy of a snapshot is moved to low-cost storage without the use of traditional backup software. The original snapshot retains its native format and location; the flat backup replica gets mounted should the original become unavailable or unusable.
- Mirroring places data files on more than one computer server to ensure it remains accessible to users. In synchronous mirroring, data is written to local and remote disk simultaneously. Writes from local storage are not acknowledged until a confirmation is sent from remote storage, thus ensuring the two sites have an identical data copy. Conversely, asynchronous local writes are complete before confirmation is sent from the remote server.
- Replication enables users to select the required number of replicas, or copies, of data needed to sustain or resume business operations. Data replication copies data from one location to another, providing an up-to-date copy to hasten DR.
- Recovery-in-place, or instant recovery, enables users to temporarily run a production application directly from a backup VM instance, thus maintaining data availability while the primary VM is being restored. Mounting a physical or VM instance directly on a backup or media server can hasten system-level recovery to within minutes. Recovery from a mounted image does result in degraded performance, since backup servers are not sized for production workloads.
- Storage snapshots capture a set of reference markers on disk for a given database, file or storage volume. Users refer to the markers, or pointers, to restore data from a selected point in time. Because it derives from an underlying source volume, an individual storage snapshot is an instance, not a full backup. As such, snapshots do not protect data against hardware failure.
Snapshots are generally grouped in three categories: changed block, clones and CDP. Snapshots first appeared as a management tool within a storage array. The advent of virtualization added hypervisor-based snapshots. Snapshots might also be implemented by backup software or even via a VM.
Copy data management and file sync and share
Tangentially related to backup is copy data management (CDM). This is software that provides insight into the multiple data copies an enterprise might create. It enables discrete groups of users to work from a common data copy. Although technically not a backup technology, CDM enables companies to efficiently manage data copies by identifying superfluous or underutilized copies, thus reducing backup storage capacity and backup windows.
File sync-and-share tools protect data on mobile devices used by employees. These tools basically copy modified user files between mobile devices. Although this protects the data files, it does not enable users to roll back to a particular point in time should the device fail.
How to choose the right backup option
When deciding which type of backup to use, you need to weigh several key considerations.
Enterprises commonly mix various data backup approaches, as dictated by the primacy of the data. A backup strategy should be governed by the SLAs that apply to an application, with respect to data access and availability, recovery time objectives and recovery point objectives. Choice of backups is also influenced by the versatility of a backup application, which should guarantee all data is backed up and provides replication and recovery while establishing efficient backup processes.
Creating a backup policy
Most businesses create a backup policy to govern the methods and types of data protection they deploy and to ensure critical business data is backed up consistently and regularly. The backup policy also creates a checklist that IT can monitor and follow as the department is responsible for protecting all the organization's critical data.
A backup policy should include a schedule of backups. The policies are documented so others can follow them to back up and recover data if the main backup administrator is unavailable.
Data retention policies are also often part of a backup policy, especially for companies in regulated industries. Preset data retention rules can lead to automated deletion or migration of data to different media after it has been kept for a specific period. Data retention rules can also be set for individual users, departments and file types.
A backup policy should call for capturing an initial full data backup, along with a series of differential or incremental data backups of data in between full backups. At least two full backup copies should be maintained, with at least one located off-site.
Backup policies need to focus on recovery, often more so than the actual backup, because backed-up data is not much use if it cannot be recovered when needed. And recovery is key to DR.
Backup policies used to deal mainly with getting data to and from tape. Now, most data is backed up to disk, and public clouds are often used as backup targets. The process of moving data to and from disk, cloud and tape is different for each target, so that should be reflected in the policy. Backup processes can also vary depending on application -- for instance, a database might require different treatment than a file server