The truth about using snapshot technology as data protection
Marc Staimer discusses using snapshot technology in your data protection strategy and outlines issues you should know about before opting for this approach.
Many data protection administrators, analysts and vendors have become enamored of snapshot technologies. There's a lot to like. There are also misconceptions that can lead to dysfunctional disaster recovery and business continuity plans. These technologies, like all data protection technologies, are not a panacea even though many administrators have come to believe that they are. All data protection technologies have tradeoffs. Snapshot technologies are no exception.
Disclaimer: the snapshot technology described herein is general and not a reference to any specific vendor's implementation. Vendors are very clever. There are many variations, and no two vendors or products are exactly alike. The descriptions expressed are based on the author's experience and are generally correct.
There are two discrete methodologies for data snapshots. The first is to create an exact duplicate copy of the data at a specific point-in-time (PIT). The second is a copy of the data's state or metadata at a PIT. Either type of snapshot can be of a logical unit number, volume, virtual LUN, virtual volume, file system, file store, VMware Virtual Machine Disk, or VMDK, or Microsoft Hyper-V Virtual Hard Disk, or VHD. So, what's the difference?
Complete data duplication takes time. It can take more time depending on the amount of data copied. Clones are commonly synchronous mirrors performed with each write. Many take advantage of triple mirrors where the third mirror is broken to create a point-in-time (PIT) clone. After the clone is created, the third mirror is reestablished and synchronized until it gets up to date. Since the third mirror's data is already established, turning it into a PIT clone is instantaneous. It is immediately available for mounting and can be used for data protection, data warehousing, archiving, test dev, etc. The PIT clone has no impact on the utilization of the second mirror for data protection. There are variations of the PIT snapshot clone that include just copying blocks or files that have changed since the last snapshot. But all variations consume considerable capacity.
This is why state-based or metadata snapshots are more popular than PIT snapshots because they consume a very small amount of capacity. The state-based snapshot is also known as a Copy-On-Write (COW) or redirect-on-write (ROW) snapshot. COW and ROW snapshots make a copy of the metadata or pointers of what the data looks like at a given PIT. That's not a lot of data to consume capacity and is, for all intents and purposes, instantaneous. However, there are significant differences in the technologies.
COW requires the capacity be reserved for the full amount of the data being snapshot. Today, many vendors utilize thin provisioning, so this is not as onerous as it sounds. The capacity is actually assigned only when the data is being copied. The data gets copied only when changes are about to be made to that data set. The COW snapshot first copies the data that's about to be changed to the reserved space that's maintaining a copy of that data as it existed for that PIT before the changes are applied. This is known as a double write performance penalty because the data being changed must be copied first, before any changes can be applied. COW theoretically makes a complete copy of the data for each snapshot as the data changes, but only if the data changes. The actual copying of the data acts as a natural governor to the number and frequency of snapshots a storage administrator will take and keep.
ROW does not require any capacity reservation because it writes all changes to the data separately from the PIT snapshot image and ties them together with pointers. ROW snapshots are commonly on the same volume, LUN, virtual volume, file system or virtual file system -- but not always. ROW is a bit more complicated than COW in that it requires more intelligence (smart algorithms) in piecing the data together on reads. The added complexity often adds marginal latency to reads as the volume of ROW snapshots increases.
ROW does not actually make copies of the data nor does it consume as much capacity as COW, and thus it enables more frequent snapshots that can be retained for a longer period of time. But since there are no actual copies of data, ROW snapshots can be a significant data protection issue. If there is any corruption of the original data, all snapshots that follow are also corrupted. If there is a corruption of changed data, any snapshots that follow will also be corrupted. Note this also applies to COW snapshots before the snapshot is actually copied.
This is not the only COW and ROW data protection issue. Both provide crash-consistent images of the data, meaning the snapshots look like exact data replicas at a PIT, the same as if the system was just shut down. The snapshots are not application-aware, which is an issue with structured data (data that requires a database). It is possible to snapshot a database application in an inconsistent state. If the database application is in an inconsistent state, it takes manual effort and time by the database administrator (DBA) to bring it to a consistent state, usually by journaling forward. Occasionally, the DBA cannot correct the situation.
Database applications should be quiesced (temporarily stopped from performing any more transactions), and its cache flushed and writes completed in the proper order including the index and metadata; then it should be snapped to get an application-consistent image. To do this with a storage, appliance, server or hypervisor-based snapshot requires a plug-in (client or agent software).
It's pretty simple with Microsoft. Most database applications under Windows are VSS (volume shadow services) compatible. VSS will quiesce these database applications on demand. The snapshot technology just needs to talk to the VSS API. Not all of them do. VMware has VSS compatibility with Microsoft Windows, and many storage vendors do as well.
Other applications, such as Oracle, Teradata, DB2, MySQL, PostgreSQL, MongoDB, etc., require specific software plug-ins. Those plug-ins are available from some backup software, replication software and storage vendors. The plug-ins allow the data protection server (or appliance) to put the database application into a consistent state and quiesce it, flush the cache, complete all the writes in the correct order, tell the snapshot application to take the snapshot, and then release the database back to an active state. The data protection server can control the catalog or leave it to the snapshot device. It can also copy the data out to another storage target, including SAN, network-attached storage (NAS), object or cloud storage.
Like all data protection technologies, snapshots have flaws. Those flaws make snapshot technology best utilized as a part of a comprehensive data protection strategy, not as the entire data protection package.
Readers often want a declarative statement as to which snapshot technology is best. Unfortunately, it depends on organizational priorities. There is no one "best" snapshot variation for all circumstances. In summary: ROW consumes the least amount of capacity but has lower reliability and higher potential latency. Cloning has the highest reliability but consumes the most capacity (capacity consumption can be offset by not utilizing triple mirroring, with the downside being much longer times in snapshot creation). And COW is somewhere in-between but a bit closer to ROW snapshots than to cloning. Choose the technology that works best for your organization not the supplying vendor.
About the author:
Marc Staimer is the founder of Dragon Slayer Consulting in Beaverton, Ore. Marc can be reached at [email protected].