Orlando Florin Rosu - Fotolia
A primer on DNA data storage and its potential uses
DNA storage isn't just a futuristic concept -- many companies are actively involved in its development and promotion. There's a major use case in data archiving.
The history of storage technology has been a continuous struggle to encode more information in smaller spaces at lower cost. Once information storage became digital, the endeavor centered on creating smaller magnetic, optical and silicon structures to encode a bit of information. While technology continues to squeeze more bits onto a chip or disk, encoding data in the tightly wound double-helix of DNA promises far higher densities.
The (very) fine print on DNA
The pitch of 10 DNA base pairs is 3.4 nanometers long with a 2 nm diameter. Each base pair is a combination of two nucleotides: adenine (A) and thymine (T), or cytosine (C) with guanine (G), according to the National Human Genome Research Institute. If each pair represented a bit, for example AT or TA as zero and CG or GC as one, a DNA strand could conceivably hold 10 bits per 6.8 square nm. In other words, DNA information density is 1.47 terabit/mm2 or 950 terabit/in2, or more than 800 times the density of HDDs.
When one considers that there are 3 billion base pairs in a microscopic human genome tightly wound in each cell, the DNA data storage opportunities are vast.
Unfortunately, our back-of-the-envelope calculation vastly oversimplifies DNA storage processes. Today's technology that we use to synthesize, store and sequence DNA is fraught with errors. It requires any DNA data storage system to have vast amounts of redundancy and use sophisticated data coding.
Nonetheless, the explosive growth in data generation will require revolutionary techniques for storage, particularly for archival purposes. Gartner has high expectations for DNA storage, noting that "all of human knowledge could be stored in a small amount of synthetic DNA." The organization claimed that 30% of "digital businesses" will conduct DNA data storage trials by 2024. Since the DNA can be preserved indefinitely, Gartner sees archival storage of music, video and statistical data as potential applications for DNA storage.
Basic technology, challenges and limitations
DNA data storage and retrieval is a six-step process that converts a digital bitstream into a sequence of base pairs. It is conceptually similar to encoding bits as a series of pits and lands on an optical disk.
Steps to complete this process include the following:
- Coding translates the bitstream into a sequence of base pairs and is an active area of research. Some schemes use the simple one-bit-per-pair scenario described above in our sizing estimate. However, more advanced techniques use Huffman coding, sometimes paired with Reed-Solomon error correction codes, to resist degradation errors from long-term storage.
- Synthesis and assembly uses various biological reactions to create short sequences of DNA and assemble them into longer strands. Because it is much faster and cheaper to generate DNA snippets of a few hundred base pairs than long genome-like sequences, DNA data storage breaks data apart into blocks that are encoded and indexed. The technique is conceptually similar to how a disk drive decomposes files or databases into logical blocks or IP networks packetize data before transmission.
- Storage preserves the DNA in a solution and vials to minimize degradation over time. Exposure to water and oxygen greatly accelerates DNA degradation at room temperature. As a result, most storage hosts samples in vitro in an inert solution or solid. Indeed, in the right environment, DNA can remain intact for eons -- scientists recently extracted the genome from the teeth of a million-year-old Siberian mammoth.
- Retrieval extracts subsets of DNA from a larger sample. There are several techniques for random access extraction from a larger DNA pool that typically use polymerase chain reaction amplification like in COVID-19 testing.
- Sequencing reads the series of DNA nucleotide base pairs through techniques similar to what medical genetic testing uses. DNA snippets are often sequenced in parallel to accelerate the process.
- Decoding turns the base pair sequence into a binary stream by decoding and reassembling the data segments.
Uses and notable companies
DNA data storage is rapidly headed from the lab to production. However, because the synthesis and sequencing processes are slow compared to electronic information processing, the only feasible application is archival storage. For example, it currently takes hours to write a few gigabytes of data, although an experimental parallel processing technique claims to reach a terabyte per day.
DNA storage has a tolerance for high error rates. Unlike in pharmaceutical uses where small errors in the DNA sequence can have profound effects, the ability to employ sophisticated redundancy and encoding algorithms means storage systems can maintain full data fidelity with error rates of 10% or higher in the synthesis and sequencing processes.
The video streaming industry produced a compelling example of the emerging use of DNA for archival data storage. Twist Bioscience recently worked with Netflix to demonstrate the feasibility of DNA for video preservation. Researchers at ETH Zurich encoded the first episode of the Netflix series Biohackers into DNA nucleotides, which it then synthesized into DNA strands using Twist Bioscience's silicon platform. Raw, uncompressed 4K video runs about 250 MBps, which translates to 750 GB for a 50-minute episode. It's an impressive demonstration of DNA's potential as an archival medium.
Twist Bioscience is a leader in DNA data storage and recently presented its technology at the Stanford Compression Workshop 2021. Twist Bioscience, Illumina, Microsoft and Western Digital recently formed the DNA Data Storage Alliance to promote the technology and develop an industry roadmap, use cases and educational materials. Other members include:
- Ansa Biotechnologies
- The Claude Nobs Foundation
- DNA Script
- École polytechnique fédérale de Lausanne -- Cultural Heritage & Innovation Center
- ETH Zurich -- The Swiss Federal Institute of Technology
- Molecular Assemblies
- Molecular Information Systems Lab at the University of Washington
There are several other significant companies -- including Evonetix, Helixworks, Kilobaser and Synthomics -- that pioneer technologies such as DNA synthesis and storage material. This work will facilitate DNA data storage and other therapeutic applications.
DNA data storage is far closer to commercial reality than it is to science fiction. Data storage professionals responsible for archive strategies should follow developments in the field and factor DNA technology into roadmaps alongside evolutions in LTO tape and other archival storage media.