Whether your archive is considered a deep archive or an active archive, it essentially consists of the following:
Pre-archive policies: Selected data must be moved from production storage (expensive storage optimized for high-frequency data access and modification) to archival storage (usually -- but not always -- less expensive storage optimized for lesser frequencies of data access and near zero rates of data modification), typically when specified conditions are met. Pre-archiving workflows are usually required to identify data sets that will be retained in archive, as well as to define the policies governing their migration and retention schedules.
Archive data ingestion and protection: Preferably via some automated method, selected data needs to be moved to the archive environment and verified for readability. Once located on the data archiving technology, the data must be subjected to occasional data-copy processes to ensure it's protected from corruption or loss. Since archival data does not change frequently, data protection requirements in the archive environment may differ significantly from production data backup and replication. Backups can be made less frequently, and expensive "across the wire" WAN-based replication is usually not required or cost-justified.
Maintenance and administration: After a designated period of time, archived data needs to be migrated to fresh media -- un-ingested and re-ingested -- to accommodate changes in archive containers or wrappers, or to leverage a change in storage technology. At some point, archival data may expire -- and the need to retain it may end. When this happens, there is usually some sort of process for isolating and reviewing deletion candidates and for "electronic shredding" when the decision is made to delete. This is the administrative work of archiving, and it's a crucial step to combat the dangerous save-everything mentality that dominates many IT shops.
Clearly, technology plays a role in optimizing each of these three sets of activities. Specific data archiving technology has been developed over time to facilitate the processes described above. Data storage pros can keep the list above in mind when monitoring the technology benchmarks and trends I've outlined below. In archive media, watch for capacity and resiliency improvements. Disk is poised to grow in capacity as a function of changes to write methods, including heat-assisted and acoustically assisted magnetic recording, and redesigned disk approaches, such as helium-filled drives with more physical platters. Seagate Technologies, which promotes heat-assisted magnetic recording on shingled media as the next evolution of the disk drive, has demonstrated 60 TB capacities on a 3.5-inch drive, while Western Digital recently released a helium drive with 7 platters instead of the normal 5, and a 6 TB capacity that they expect to grow significantly larger with time. In 2011, IBM and Toshiba demonstrated a 40 TB 2.5-inch drive using bit-patterned media. All vendors argue that the additional capacity will not come with increased energy requirements, suggesting that power costs for a disk-based archive will not grow with capacity in the future.
Of course, tape isn't standing still. With innovations such as Barium Ferrite and Nanocubic coating, Fujifilm has demonstrated (with IBM in 2010) a standard form factor tape with a potential 35 TB of capacity. Current generation applications of the technology have produced an LTO tape with 2.5 TB capacity uncompressed and an Oracle tape, the T10000D cartridge, with a native uncompressed capacity of 8.5 TB. Resiliency tests of these media types have substantiated vendor claims of 30-year durability with proper handling and environmental conditions. The net result is the possibility of a tape library in the very near future with more than 100 PB of capacity occupying two to four raised floor tiles and consuming less power than a few incandescent light bulbs.
Less costly, more resilient media
So, media improvements suggest that archive platforms will become less expensive to operate and more capacious and resilient going forward. Disk array vendors are proffering scale-out architectures ranging from storage virtualization-based platforms to storage clusters that can leverage deduplication and compression features to squeeze the maximum amount of data onto the fewest number of spindles. Keep in mind that the complexity of some of these platforms and their use of proprietary data reduction and content-addressing technologies, however, increases concerns among some archive planners about the vulnerability of these systems as long-term.
In fact, a significant technology for archivists is the Linear Tape File System (LTFS), introduced by IBM in 2012, which enables data to be stored to tape using a native file system. In 2013, IBM submitted LTFS to the Storage Networking Industry Association for development as a standard, and also released an Enterprise Edition of the technology that enhances its interoperability with IBM's General Parallel File System (GPFS). This provides not only shared access to data from heterogeneous clients, but also the ability to automatically migrate files into a tape-based LTFS repository from any storage tier running GPFS. GPFS is said to return archive into a tiered storage architecture, back to where it was in 70s and 80s mainframe environments.
Alternatively, Spectra Logic has introduced its own "deep storage" tier, which enables the ready archiving of large amounts of bulk data via a Deep Storage 3 (DS3) protocol using modified REST-standard commands (e.g., "bulk get" and "bulk put") and a specialized appliance called Black Pearl that front ends an LTFS-formatted tape library. Spectra Logic has developed a unique data archive ingestion scheme that intervenes in the established production workflow and archives output through their technology onto tape. They have also solved a knotty issue for LTFS: what to do with little (or short block) files for which LTFS is not optimized. Spectra's solution was to aggregate many smaller files into larger objects that can then be more efficiently stored in LTFS. So, in the final analysis, Spectra Logic has introduced a mash-up: object storage fitted to a familiar hierarchical file system.
It's worth noting that Spectra Logic's DS3 is derivative of S3, the protocol used by Amazon Web Services, for moving data into their successful cloud storage offerings. With or without DS3, vendors of LTFS appliances have been working to establish a cloud model of their own. Fujifilm has taken an early lead in this space by leveraging the StrongBox appliance from Crossroads Systems to support ingestion of data into archival clouds they have created to support medical imaging verticals, media and entertainment, and more recently, general purpose archive; Permivault and d:ternity are two examples. From an economic standpoint, tape-based cloud archives make significantly greater sense than disk-based, assuming the savings are passed along to consumers.