What You Need to Know About DNA Data Storage Today
Learn about DNA data storage from a senior principal research manager at Microsoft Research.
00:00 Karin Strauss: Good afternoon. It's a pleasure to be at the Flash Memory Summit today. My name is Karin Strauss, and I'm a senior principal research manager at Microsoft Research. Today, I'm going to talk about DNA data storage, that is storing digital data into synthetic DNA.
This is a collaboration between Microsoft and University of Washington, and together we created MISL, the Molecular Information Systems Lab. And the goal is to study new developments in biology and chemistry and apply them to information technology industry problems.
So, let me talk about one such problem. What I'm showing here on the left is a chart by IDC that many of you are very familiar with, showing the growth of the digital universe, meaning all the bits that we generate over time, and contrasting that with the installed capacity, meaning all bits that we can store in devices over time. And what we're seeing is that there's a gap and it's a growing gap. Now, I'm not advocating that we store all the information we generate all the time. There's lots of temporary data, lots of data that we can discard, but the issue is that we want to store a portion of the information, and that portion if we simply follow this trend, that portion is shrinking. And that's what I'm showing here on the right, so that can be a problem. And to address a problem like this, we may need quite different approaches to putting a dent in addressing these problems.
01:35 KS: So, let me use this example here to contrast the approach that we currently use with a potential new approach to doing things. So currently, the flash industry and really the semiconductor industry has been following Moore's law, that means essentially creating transistors, making them smaller, packing more devices into . . . More transistors into a device and making those devices more capable.
Now, that's not the only way to do things. So to contrast to that, I'd like to bring up something that came up in one of Feynman's famous lectures in the '50s where he was talking about manipulating molecules, and if we have the ability to manipulate molecules, then we can arrange them in ways that could be used for computation and could be used for storage. And in fact, he used DNA as an example of information storage in nature. Now, he stopped short of suggesting that DNA should be used for digital data storage, but that actually came only a few years later when the structure of DNA started to be more well understood. Okay, so what does it mean to store digital information in DNA?
03:00 KS: DNA is that double helix, and each side of the double helix is composed of bases, A, T, C and G. And if on one side of the double helix, we have A, on the other side of the double helix, we're going to have T. If on one side, we have C, on the other side, we're going to have G. So, from an information storage perspective, the two sides are redundant because if I know one side, I know what's going to be on the other side.
Now, let's focus on only one side of the double helix that is composed of sequences of these bases, A, T, C and G. So, if I want to store a sequence of bits in it, essentially what I need is a mapping, and what I'm showing here on the right is a mapping of bits to bases. So every two bits map to a single base, and so if we want to store a sequence of bits, all we have to do is to map those bits into sequences of bases, and then we can manufacture bases, these sequences of bases.
Today, we have the technology to manufacture these molecules with arbitrary sequences. So obviously our encoding is more sophisticated than that, but this sort of illustrates this concept of mapping. Now we can make molecules of DNA, as I said, and what I want to make clear is that the molecules we make are not biological DNA, they're synthetic DNA. So, there's no life, no cells, no organisms involved in this type of digital data storage that I'm talking about. We're using DNA as a medium to store information. This is synthetic DNA.
04:39 KS: Alright, so why? First, density. So, what I'm showing here is a test tube, on the bottom of that test tube, there's a pink smear, and that is enough DNA . . . So that's DNA, it's dried DNA and it's enough DNA to store 10 terabytes of information. So, a whole hard drive can fit on the tip of that test tube. And to put that in perspective, here's a data center. Data centers today, you can store about an exabyte of data into a Walmart-size building. And to contrast that with DNA here, I'm showing a picture of a data center, and what you'd need to store that equivalent is essentially that pixel there, you can probably barely see it, but there's a pixel there, I promise. And in real size, that's about one cubic inch and so that's essentially one exabyte would fit on the palm of your hand. So that's really high density.
Also, one of the properties of DNA is that it may last for a long time if kept under the right conditions. In fact, there's demonstrations of DNA that's actually quite old, thousands of years or hundreds of thousands of years that has preserved its information. So, there's demonstration that under the right conditions, DNA will preserve its information. And in fact, scientists at ETH Zurich, who we work with, have shown that you can create the right conditions synthetically, so that the DNA can be encapsulated and preserve its information for a long time.
06:19 KS: And in addition to that, because of its high density, it's also pretty easy to keep the DNA around in these conditions, under these conditions. Alright, so that begs the question, how does that compare to other types of media? So, this is what I'm plotting here on the Y-axis. I'm showing volumetric density, and with different colors. The darker blue is showing where these technologies are today. The lighter blue is showing where we think they can get to, and then the limit shows the maximum theoretical density for DNA specifically. Now obviously, if we're going to build a system out of DNA, we need to discount different types of overhang but even after we discount those overhangs, there's still a few orders of magnitude improvement in terms of density comparing to the other technologies. And also, a lifetime of DNA is quite long as I mentioned in the previous slide.
07:21 KS: Alright. So, another desirable property of DNA is that it doesn't go obsolete. And the reason is that now that we know how to read DNA, we'll always have interest, too in reading DNA because of its clinical applications for the health applications, life sciences will always want to have readers of DNA. And so, the technology won't go obsolete. It's not like floppy disks or CDs that it's now hard to find readers for those medium. With DNA, we'll always have readers. And in addition to that, the readers can improve over time independent of the media. So, we don't need to copy from one medium to the next. The medium is DNA, and readers and writers can improve independently.
Another property of DNA that we found out recently, and it was quite a pleasant surprise, is sustainability. So, what I'm talking about here is environmental sustainability, and we compared DNA to other types of commercial media. So here, it's specifically, I'm comparing to tape. We think DNA, the first application of it will be in our archival storage, and so we're comparing it to tape.
08:55 KS: And we compare along three different dimensions: Greenhouse gas emissions, energy consumption and water use. And what we're showing here is that for storing one terabyte of information for a year, DNA requires less of everything, and so it's more sustainable than tape.
Alright, so let me walk you through a DNA that is storage system. So, as I mentioned in a few slides ago, we start with bits and then we convert them into bases, that's the encoding process. And then we make the molecules, and that's the write process, it's called also DNA synthesis. And that's what takes us through the molecular domain. So, until then, we were in an electronic domain, essentially computer making that encoding for us, but now we go into with senses, we're going to molecular domain. We store those molecules and when it's time to read, we'll do random axes, meaning selecting molecules that contain the data we want to read. We'll sequence them, that's the read process, that converts that information back into the electronic domain. It's a noisy representation of the molecules, and then we use coding theory to recover the bits, and this is not unlike what we do with hard drives, flash drives, etc.
10:21 KS: Alright, so is it practical? How far have people gone with it? So, we have been able to store one gigabyte of information into DNA. We try to pick from different types of media like Project Gutenberg, the Universal Declaration of Human Rights. We stored a high-definition video from Okay Go, which I would highly recommend. We store databases, we store archival-quality music in DNA. And one gigabyte doesn't sound like a lot for the storage industry, but it's a start. Flash hit one gigabyte at some point in the past as well. And it's actually been a big deal for the biotechnology area, so much so that we ended up on the cover of Nature Biotechnology. And that has been growing. What we're seeing here over time is essentially exponential growth on the amount of data we can store in DNA, and we think that this will continue to happen.
11:33 KS: Alright, so now, I'm going to walk you through each of the steps in encoding and decoding data from DNA. First step is encoding. So here, I'm showing this Okay Go video. It's 44 megabytes of information. Now, we have now the ability to make sequences of DNA that are about 10 to 20 bytes. And so, if we want to store 44 megabytes of information, what we need to do is to chop up the data and identify the sequence. And so, this is no different than mapping a big binary into multiple floppy disks. You have to number those floppy disks. So essentially what we're doing is numbering the molecules that carry the information in the order that the information is organized in the original data. In addition to that, we want to correct, so we add some redundancy. And finally, we translate the bits into bases of DNA. We also add tags, and this is a chemical file identifier here.
12:49 KS: So, once we have sequences that we want to store in DNA, store into molecules, the next step is DNA synthesis, so that's essentially manufacturing the molecules, and these molecules are manufactured essentially base by base. So let's say here in this example, we start with an A, we want to add a C, and so what happens is we flow in a fluid with all those Cs, a C attaches to that A, and that C you see has a blocking group on top of it, that little hat on top of the C prevents other Cs from attaching so we can attach a single C there. The next step strengthens the bond between A and C, and then the next step, once we're ready to add the new base to that sequence of DNA, we'll de-block the C and allow new molecules to attach to the sequence.
13:45 KS: So, the sequence, the way to visualize it is, the sequence of DNA sort of like grass growing from the ground up, base by base. Now, obviously, we don't do one at a time, one sequence at a time. This actually, we can leverage massive parallelism to grow multiple sequences at a time. So, what you're seeing here is what's called array synthesis, a method to grow molecules in parallel. And what you're seeing, different colors represent different sequences of DNA, and in addition to that, we get some redundancy by having multiple copies of the same sequence grow together, and this is an artifact of the way the process works, so it's actually free redundancy.
14:33 KS: So once the molecules are made, they're removed from the substrate where they were grown and encapsulated, and so here's one example of encapsulation. So, this is silicon nanoparticles, the DNA is attached to these particles, and then another layer of silica is grown on top that protects the DNA from the environment. And then you can organize the DNA into what we're calling here a DNA library, which is very much the analog in the DNA world to a tape library, which is what I'm showing here on the bottom. So it's essentially a way to organize different pools of molecules in a spatial manner so that when you need to reference your data, you can go back to that same location and recover the molecules to read them.
15:28 KS: All right, so let's see how we read information from DNA. So what I'm showing here is an instrument, that there's a type of sequencing of DNA called sequencing by synthesis, and the idea is that, if you remember the double helix that I was talking about earlier, we're essentially reconstituting the second side of the double helix and observing as it grows, as it's reconstituted, what we are seeing, and so just to illustrate that, here is a . . .
We're trying to detect an A, and we flow in a T that's attached to a green fluorophore that's essentially a molecule that glows green when excited at the right frequency, and so because we can see the color green here, we can tell that on the other side of the double helix there's an A, and using chemical tricks, we can remove that fluorescent molecule and then continue growing the molecule essentially base by base so that now this time we have a C, and so the complementary base is a G and perhaps it's attached to a yellow fluorophore, and so we can tell that this is a G being attached and therefore there must be a C on the other side, which is what we're reading.
16:49 KS: So, it's all indirect, it uses lots of computer vision, and what we get in the end is a bunch of noisy representations of the molecules. And so, I'm illustrating here errors with both letters. And what's interesting about DNA is that the errors are not just substitutions, which is the equivalent of bit flips, but we also have base insertions, where a base that we were not expecting there gets inserted, or deletions, where a base is actually missing from the sequence, and this is not unlike networking and wireless transmission, for example. Okay, here's another way to read DNA, nanopores.
17:28 KS: So nanopore devices essentially have a bunch of these pores, they're nanoscale pores, the DNA is dragged through via electrical force, and as the DNA goes through, it causes disturbances in the current through the device. And so, by looking at these disturbances in the current, we can actually tell which of the four bases is passing by. So, this device also generates noisy representations of the molecules and even noisier than the previous device, because it's an emerging technology. It's already commercial, but it's noisier than the previous one. But it can offer other properties that are more desirable, like for example, reading in real time instead of a batch process.
18:19 KS: Okay, so we get a bunch of noisy representations of the molecules we had, and now we want to recover the exact bits that we stored in the beginning. So, the way we do that is through the decoding process, as many of you know. So, the first step is essentially to sort out the molecules and group them in molecules that are likely to have come from the same place in the original file.
Once that is sorted, we can do majority voting for each of these groups to infer what's the likely sequence that was stored, and then use . . . Convert to bits and use regular error correction techniques, like Reed-Solomon, which, again, the audience should be pretty familiar with, to decode and recover the original bits without any bit errors. So then once we have those bits, we can organize them again into the sequence with which we started and recovered the file that we wanted to store.
19:30 KS: Alright, so I just walked you through this DNA data storage system. Our ambition as researchers is to develop this into something that looks like this, which we can place in our data centers and then wrap an archival service around and offer it to our customers. Now, if we can use, if or when we get to this point and we can store information in DNA, it really begs the question, "What else can we do with DNA?" And it turns out, I'm not going to go into a lot of detail here, but it turns out that there are certain kinds of computation that are well suited for DNA. We use that property that I was talking about, that A binds to T and C binds to G in the following way.
So, let's say we have a database of images, which I'm showing here. We do feature extraction over those images just in the traditional way, extract those features, but then use an additional step that maps these feature vectors into sequences of DNA, such that if the images are similar meaning, the feature vectors are not so distant, then they map to sequences of DNA that are not too distant.
20:45 KS: So that if we want to query within that database, we could again take another image, an image that presumably would not be in the database and find similar images, so we do the same thing. We do a feature extraction of that image, we create a sequence of DNA. Now, we want these sequences to bind, so we're going to complement that. So wherever there was an A we put a T, wherever there was a C we put a G so that when that sequence goes into the database, it will bind to similar images or similar items.
21:25 KS: Now, we also need a way to fish it out, so here what I'm showing is a magnetic bead that's attached to the DNA. We actually have the technology to attach a magnetic bead to DNA. And then when we place that into a database so that molecule essentially goes into the database and binds to images that are similar to the query, then we can pull it out mechanically by using a magnet. So those molecules are pulled out and then re-read out, and so we have items from the database that are similar to the query.
22:03 KS: When we started this project, the DNA synthesis and DNA sequencing were processes that were automated, but not much else was automated, and we wanted to make sure that whatever automation we would apply to it would be low cost. So guided by principal, we essentially built this prototype here, it's a contraption, it's really a prototype, it's not intended for large-scale DNA data storage, but what we've wanted to do is to learn and to demonstrate that it was possible to create a DNA data storage system from scratch and that would fully automate the process. And this is what we did here.
So, what you're seeing here is a bunch of bottles, that's the senses part, so that's the part that makes the DNA, and in fact, those four little bottles you see in the back are the four different phases. Naturally, there's other reagents here that are necessary to make the DNA, but the DNA is essentially made, it's stored sort of in the middle here, and then their sequencing, which is a reading process on the right-hand side. And what you're seeing there is one of those nanopore devices that I talked about.
23:15 KS: So, we essentially integrate everything. The goal was not large-scale DNA data storage. The goal was to show that automation is possible, and so we used this contraption to write the word "Hello" into one sequence of DNA. And we learned quite a lot from building a fluidic system that can implement DNA data storage. In addition to that, what you're seeing here is the storage, and what we're doing here is storing one sequence of DNA, but we wanted to be able to manipulate more sequences of DNA into a library. So that library that I showed you before, that's not what that test tube is, but we wanted to implement a library, and so we replaced this piece with a piece that's called digital microfluidics that I wanted to talk about in this crowd because it's essentially built out of a PCB and a few other components.
24:13 KS: So, what this is is a PCB with these pads that you're seeing there on the left. And you can essentially sandwich droplets of fluids between that PCB and a sheet of glass, which is what you see at the top. And then you can control voltages between that sheet of glass and the different pads that you're seeing to control via something called electrowetting, you can control how these droplets behave and you can change their shape and in the process move them around the board.
So, what I'm showing here in the middle is essentially a board in action. So this real-time moving the droplet around, and we can essentially use, in addition, we can use computer vision to track where the droplets are going so that we can move them around and use them as primitives for the protocols required to automate manipulating the DNA, preparing it for reading, for example, drawing it from the library itself, for example, so that we can organize the DNA into a library, organize it spatially and recover it with this device.
25:30 KS: So that's that. So, I just wanted to conclude by saying, the future may include molecules. We may have in the future, hybrid molecular electronic systems. I'm not saying that molecular systems are going to completely replace electronics, each of them have their own place, but they can be combined to solve problems in the information technology industry. So, I just want to conclude by touching on these future systems. So just like today, we have CPUs that are accelerated by GPUs and FPGAs the implication of what I was talking about in a previous slide is that we may use biomolecules as accelerators, we may also have a future with quantum computing and using physics to solve some of our problems.
26:23 KS: And I'd like to point out that this is obviously not just my work, this is the result of hard work by a very talented group of researchers, very diverse, all the way from computer science to coding theory, to computer architecture and systems, to mechanical and electrical engineering, to molecular biology and biochemistry. So great working with everyone, it's been very productive, and if you're interested in getting in touch with us or reading more about our work, I have added pointers here to this slide, so please feel free to explore.
27:02 KS: I thought I'd do a little bit of questions and answers. A question that I get very often is, "Is the throughput and cost of this technology already there for data storage applications?" And it still has a ways to go. It needs to improve, as you know, with flash, it requires research, it requires even more engineering to get it to a point where it can be deployed at the large scale. It can be deployed at the small scale today, but to get to the larger scale, it will require some more research, a lot more engineering. This is normal for every storage technology, but we asked this question ourselves when we started the program is whether it can scale to these larger scales that I was talking about, and it turns out, we looked at fundamental limits, we couldn't find any reason why it couldn't scale to that point.
27:53 KS: So we're very encouraged by that. And another question that's relevant to this crowd is, "Do you expect it to replace flash?" And not any time soon. Flash . . . Latency characteristics, latency is quite a lot better. We're looking at DNA data storage at least as the first stop into the archival storage area, so colder storage. The technology to read the DNA is improving and so latency may get better and we might see it in warmer layers, but really the first stop is on archival storage, and there's other applications that we're looking into, so for example, using DNA as molecular tags, so for example, molecular QR code, we've been studying that. We've been studying search, as I pointed out in a previous slide, so there are other applications that seem quite interesting that they're being explored right now. I'd love to hear about folks working on more sustainable flash though.
29:01 KS: So, if you have any developments that you'd like to share, I'd love to hear about them. So, make flash as sustainable of a storage technology as DNA would be amazing. I'd love to hear from you on that. Finally, I'd like to conclude by making an announcement. We are creating the DNA data storage alliance along with Illumina, Twist Bioscience and Western Digital, and the goal of the alliance is to educate about the technology and also to look into use cases and develop use cases for the technology to get to a roadmap of eventual commercial deployment. So, with that, if you're interested, please get in touch and we'll follow up on that. So, thank you so much for watching the talk today and hope to hear from you soon. Thank you.