Annual Update on Enterprise Storage
This presentation takes a look at the latest developments in enterprise storage.
Download the presentation: Annual Update on Enterprise Storage
00:00 Howard Marks: Hello, and welcome to the Flash Memory Summit. This is our annual update on flash in enterprise storage, session U.1. This year, we've subtitled it "Flash, It's Not Just for Tier 1 Anymore, or Flash Is the New Normal."
I'm sure you're wondering who this guy mumbling on is and so I am your not-so-humble speaker, Howard Marks. I spent about 40 years as an independent consultant and journalist and analyst, basically looking at products, writing about them and telling vendors how they did things wrong. I developed the kind of, "Show that to Howard, he hates everything, he'll find what's wrong with it," attitude. So, recently, that meant I was Chief Scientist at DeepStorage, which was my analyst firm, and the co-host of the "GreyBeards on Storage" podcast.
A little less than two years ago, I turned to the dark side, took a job as a technologist, extraordinary and plenipotentiary at Vast Data -- an all-flash system vendor -- but I am not here representing Vast Data. I'm here wearing my wizard hat, which feels comfortable when I get to put it on once or twice a year.
01:16 HM: So, you can reach me via email at [email protected] or follow me on the Twitters at @deepstoragenet. So, I've been doing this session for eight or nine years, and so we can look at the 12 or so years that storage has been using flash. And before about 2008, flash was a very specialized device, a rackmount SSD, something you bought from Texas Memory Systems, now part of IBM or Violin.
This article is part of
Flash Memory Summit 2020 Sessions From Day One
In 2008, EMC started putting SSDs in their high-end disk drives, but these were very high-cost devices and customers had endurance fears that turned out not to be terribly well-founded, and as a result, hybrid arrays emerged. From about 2010 until 2018 or 2019, we simply had further adoption of flash as the systems adopted to flash better and as the cost of flash came down to make it affordable to wider sets of users and wider sets of data sets. And these things are continuing, but where we started in the 2000 odds and barely into the 2000 teens with flash as the seasoning, not the main meal, and then moved to where flash became a larger and more important portion of people's primary storage for their OLTP applications for the line-of-business applications, and those applications required a very high performance of course, but to a limited amount of data.
03:03 HM: And they did not need a lot of data management, so that this was an all-flash array that used block storage. Lately, the story is changing yet again as the cost of flash comes down based on changes in the market we'll talk about today. The applications flash can be used for expand into more data-intensive applications like machine and deep learning.
So, today, we're going to take a quick update on the market; the state of the SSDs available to enterprise users; the state of NVMe over Fabrics as the new lingua franca of storage; a little review of where storage class memory is; how enterprise -- the adoption of artificial intelligence in the enterprise is demanding changes from storage because those workloads are very different than the OLTP workloads that we primarily worried ourselves about; how QLC flash is driving the cost of flash down and therefore the applications of flash beyond Tier 1, those primary applications. And then we'll look at a couple of illustrative examples of storage systems that follow these trends or exemplify them.
04:25 HM: So, my first data point is that for now, approximately six quarters, Pure Storage has been tied at least for fifth in IDC's market tracking, so an all-flash player is now one of what we would call the major players in the storage business.
Second, data points are that all-flash sales now exceed the sales of hybrid systems, so primary storage systems in general in the first quarter of 2020, IDC had the all-flash market at $2.8 billion versus just $2.5 billion for hybrids. The markets have merged and flash has just emerged as primary storage, and the coverage of the market reflects that, so Gartner has merged their previous general-purpose storage and solid-state storage Magic Quadrants into a new Magic Quadrant for primary storage. And a new class of flash arrays has begun to appear, while the pure all-flash players like Pure and SolidFire used data reduction to reduce the cost of their systems, they were primarily performance plays, not cost plays using QLC flash. New systems are revolutionizing the economics and driving the cost of flash down to where if your data reduces at all, you will be able to use an all-flash system for a cost less than that of a hybrid or possibly even an all-disk system.
06:01 HM: One of the things I've done every year is review and predict what I think is going to happen with flash prices. My predictions have been notoriously bad, but last year, the price difference between enterprise SSDs and enterprise hard drives was about 3.5-to-1 using costs at Newegg, as in, analog for actual costs. And Western Digital exited the 10 and 15,000 rpm hard drive business, so when we say hard drive, fast hard drives are 7,200 rpms, and that makes SSDs more attractive because as the size of hard drives grow, but the performance of hard drives doesn't, the number of drives used to build a storage system of any given size goes down.
So, while we've made a transition in the past four or five years from 4 terabyte hard drives to 16 terabyte hard drives, the number of IOPS or the amount of bandwidth that 1 petabyte of storage using those hard drives can deliver has gone down by a factor of four as the capacity has gone up by a factor of four.
07:18 HM: This year, prices were basically stable. I've made some substitutions on the slide about particular products, but that 3.5 number is still about true. When you factor in things like the data reduction techniques like deduplication don't work well on spinning disk because they turn random I/Os into . . . Excuse me, they turn sequential I/Os into random I/Os. That means that if data reduces at all, the costs are going to be competitive. In terms of making predictions, it was easy from 2008 to about 2015, flash got cheaper about 30% every year. In 2016 and 2017, there was a supply shortage. Last year, I expected costs to go down and return to that 30% price reduction a year, and I was wrong -- prices remained about flat. Next year, I expect that 30%, but I am a cockeyed optimist, so you'll have to take that with a grain of salt.
08:24 HM: In terms of what those SSDs look like that we're using in the data center, main line data center use has switched to NVMes. So, NVMe SSD volume has now exceeded SAS and SATA, SSD volume, the main line SSDs from major manufacturers are now larger than the largest hard drives. Samsung is now shipping or about to ship a 30 terabyte SSD, the largest hard drives from Western Digital or Seagate, they're about 18 terabytes. Nimbus is an outlier, they ship 3.5-inch SSDs with up to 100 gigabytes of capacity, but those are not terribly applicable to most enterprise uses because they have a SATA interface and they bottleneck at the SATA interface very quickly. There's a much greater differentiation between SSDs that are targeted to enterprise uses than there have been in the past, and the performance, endurance, and cost factors between them can vary now by five times or more.
09:33 HM: And that range starts with storage class memory, things like Intel's Optane or Micron's X100, which we'll talk about in a minute, through the kind of drives that enterprise storage systems require that have dual-port SAS or dual-port NVMe controller so that the controller and the SSD can talk directly to two controllers in the storage system. Those drives typically also have a substantial amount of a DRAM cache, and because they have a DRAM cache, they need super- or ultra-capacitors and power failure detection circuitry to protect the data, while the low-end SSDs for enterprise use are three and a half times or so the cost of a hard drive, these dual-port enterprise drives are about 10 times that cost.
And then there are the drives really intended for use in servers which have a single port but may also have that DRAM buffer to both manage endurance and performance. And then there's a low cost, very simple drive that are primarily used by hyperscalers -- they have no DRAM buffer, they don't use QLC as SLC to provide a write buffer, they're very simple and just provide higher performance than hard drives for the very large amount of data that's on a hyperscaler's long tail.
11:10 HM: And if NVMe is the new standard for SSDs, PCIe is the physical interface through which NVMe connects, and we are in the middle of the transition from PCIe 3.0 to PCIe 4.0. PCIe 3.0, frankly, has gotten long in the tooth, it's been 10 years since the spec came out. And because one PCIe slot with 16 lanes, basically the largest slot most servers support, is about 100 gigabits per second, you can only have 100 gigabit per second port, per slot with PCIe 2.0, bandwidth doubles to about 200 gigabits per second, per slot and that means you're going to have SSDs run faster. The PCIe slots are not the limiting factor for some SSDs, especially for U.2 SSDs that only have four lanes.
Today, AMD's EPYC processors are shipping with PCIe support, Intel will be . . . has delayed PCIe support until the Ice Lake Xeon generation, which is now sampling. About two years ago when I first made this slide, the PCI-SIG was promising that PCIe 4.0 would be . . . Excuse me, that PCIe 5.0 would be out by 2020. PCIe 5.0 represents another doubling of performance. Since they've already missed that target, I believe that PCIe 5.0 goes in the pile with technologies like heat-assisted magnetic recording for disks that will show up whenever they show up, but I don't believe predictions anymore.
12:51 HM: Over the past four years, a few years, we've had some controversy about the form factors for SSDs. The 2.5-inch form factor, stolen from the hard drives, made a lot of sense when we were talking about SAS and SATA SSDs that would fit in the same chassis. And U.2, the NVMe-enabled version of that form factor, made sense as a transitional point, but U.2 and that form factor has several problems, the main one being that the volume to surface area is just wrong. There's too much volume, not enough surface area -- you can fit three PC boards into that container, but the middle PC board's got no way to cool and there's no way to dissipate the heat. And so, we've adopted several new form factors. M.2 started off in laptops; it's now used regularly instead of SD cards, which were much less reliable for a server boot. NF1 -- new form factor one -- lost to the EDSFF form factor, which comes in two lengths: short, about 3 inches, and long, about 9 inches. These form factors, because they're flat with an 8 millimeter Z height, allow better airflow and for high-performance applications, the long form factor supports up to eight PCIe 4.0 lanes, quadrupling the bandwidth of U.2 today.
14:45 HM: With the 8-millimeter-thick SSDs, you can fit 32 SSDs in a one new server. And for higher power requirements, whether it's higher performance SSDs or things like GPUs, the short form factor supports larger depths with heat sinks, as you can see on this sample from Kioxia on the slide.
In terms of the market, the biggest news has been that Intel is exiting the flash market. This follows a pattern for Intel where they started off as a memory supplier and have left memory markets as they've become commoditized and margin gets beat out of them. So, Intel wants to remain in the very high-margin processor and Optane business, and the Optane business and processor business are synergistic, so what this means is that Intel is selling their flash division to SK Hynix, that includes SSDs in their fab in Dalian, China, for about $9 billion. But for the next five years, Intel is going to operate the Dalian fab and deliver the wafers to SK Hynix so that the intellectual property transfer is delayed, so this is a five-year transaction. Now, this makes SK Hynix the No. 2 vendor at about 23% market share versus a 31.4% for Samsung. NVMe over Fabrics extends the NVMe command semantics over a network, whether that's an RDMA network, Ethernet or InfiniBand, Fibre Channel or even TCP.
16:32 HM: And the fabrics part adds things like namespaces, discovery and enclosure management to NVMe. So, NVMe over Fabrics is a protocol that's to NVMe the equivalent of what iSCSI or Fibre Channel is to iSCSI -- that is transport. But the difference is that it's a transport that only adds 10 to 50 microseconds for the both protocol and traversing the network. So, what NVMe over Fabrics does, is it allows us to use Ethernet or InfiniBand as an interconnection where we previously would have used SaaS, and we can continue to use Ethernet or InfiniBand or Fibre Channel as the server-to-storage access network. But now, we can also use it as the controller to medium connection, and that means instead of having to have dual-ported drives, and those drives being owned by a pair of controllers, we can make the drives and media shared across a larger number of controllers for resiliency, load balancing or any other reason.
17:42 HM: We're now seeing some of that from some innovative companies like the one I work for. And in the enterprise, what we're seeing is that NVMe for Fabrics has both replaced the shelf-to-controller connection and the user-to-storage connection in arrays like Pure Flash Array/X, and NetApp's AFF Series, so users can take advantage of that lower latency. We've seen some use of RDMA networks to NVMe JBOFs, with a parallel file system in the HPC and Skunkworks kind of communities. And hyperscalers are now investigating using TCP to allow that sharing, not between dedicated controllers in a storage system, but between a large number of the servers that process requests for drunken frat boy photos or other data that they may be storing in large data sets.
18:48 HM: The biggest change over the past couple of years, in fact mainly over the past year, it has been the rapid adoption and expansion of what have become known as data processing units, or DPUs -- what we called two or three years ago, smart NICs, that combined an ASIC for accelerations of things like encryption or NVMe over Fabrics or the Reed-Solomon or Galois field arithmetic for erasure coding, combined with a programmable arm or other risk processor, and one or more 100-gig Ethernet cards. All of these allow both the offload of everything from networks and firewalls to NVMe over Fabrics processing to . . . The other big advantage of a DPU is that it can be the PCIe root and main processor in a JBOF, where JBOF today are typically made from x86 servers to get all the PCIe lanes they need.
With something like a Mellanox BlueField, you could have a BlueField which provides the Ethernet or InfiniBand ports and does the NVMe over Fabrics routing to SSDs connected via PCIe switches to the card without the need for the x86 servers that can both bring down the speed and the latency of the solution.
20:17 HM: So, storage class memory in the form of 3D XPoint has been shipping from Intel as Optane SSDs for four or five years now, and they've frankly not been a huge success, because unless you have an application that requires their particular mix of capacity and performance, they don't offer a cost performance advantage over standard SSDs. And most users have simply decided that rather than spending five times as much per gigabyte on Optane, they would buy five times as much capacity on the SSD. Micron last year started shipping their X100, finally joined Intel, and that's an add-in-card-only form factor, they're not shipping U.2, so it's not really enterprise targeted, I think more towards HPC and hyperscalers. We're expecting a second generation of 3D XPoint in a year now. And 3D XPoint is now an Intel product exclusively.
21:23 HM: Over the years at FMS, I've seen many other memory types that want to be persistent and faster than flash and cheaper than NAND. The only other one that's gotten any traction at all is Everspin's Spin-transfer Torque MRAM, which a few vendors have used as a replacement for a DRAM cache on SSDs so that they don't have to build in a power protection server. The other interesting part of the high-end for SSDs has been the return of SLC, although the vendors who use it don't call it that. Samsung Z-NAND and Kioxia's XL-Flash are really SLC, and it's been further optimized with many planes per chip so that they can have greater parallelism and small pages so that you don't have problems with having to coalesce rights. And that gives you then a much better read latency, but it still has the read/write asymmetry that flash has. So, while both vendors are making flash SSDs using this accelerated SLC, we're not seeing it in the DIMM form factor. In fact, flash DIMMs seem passé. We saw them a few years ago, and the development of Optane DC seems to have killed that market off.
22:56 HM: What we are seeing is that new applications of various types of analytics and especially artificial intelligence are moving their way into enterprise data centers out of what I call HPC Land, but these workloads present very different characteristics to a storage system. The data sets are much larger. Things like training a self-driving car or facial recognition model requires that those systems train on petabytes of files, and they're files, not databases that you can store on blocks, so you need to have some kind of file system overlay to manage those files. We get much larger I/Os, and therefore demand for much more bandwidth.
So, the rack you see on the right has three Nvidia DGX A100 GPU servers. Each one of those servers can consume over 150 gigabytes per second of bandwidth, and those I/Os are almost completely random because they're reading the files each with one photograph in some order determined by their code, and that becomes random I/O across the system. And that means not only that this AI solution needs all-flash to keep that expensive GPU server fed --and Nvidia doesn't like it when I call them expensive -- but it also means that we need to optimize all the pipes between the storage and the GPU servers.
24:38 HM: If you're working with small amounts of data you're going to crunch extensively you can load the data into the NVMe SSDs in the GPU server. But to really process a lot of data, Nvidia developed a new storage architecture called GPUDirect Storage. GPUDirect Storage is an Nvidia extension to Linux, and the drivers from GPUDirect Storage perform RDMA directly into the memory of the GPU, and that bypasses the CPU and main DRAM provides about twice the bandwidth of the CPU path and much lower CPU utilization.
So you can do this with NVMe SSDs, you can do with a file system driver and with systems that support NFS over RDMA and extensions like in Connect and multi-path, you can do that with multiple connections via NFS to get that 150 gigabytes per second or more. So, if you have a large amount of data that's going to be read randomly, QLC is probably a good place to put it, that archive data, all of those facial recognition photos, they don't change, and so they don't have to be overwritten. It reduces how much endurance you use, but endurance is still a problem for QLC drives. So, on a simple QLC drive, you get about 20 times the endurance if you write 20 ways that align with the 128K page size of the internal flash, and so you can coalesce rights and many systems do, the question is, "Where you do that?"
26:25 HM: Do you use a DRAM buffer in the controller? While coalescing to 128K in VRAM -- which you can only buy in about a 32 gig stick today -- becomes difficult, you could use a local 3D XPoint buffer, you could use a shared 3D XPoint buffer, or you could use DRAM in the SSDs, but all of those have costs, you could throttle performance at the controllers, and so we'll see some vendors simply don't provide very much performance in their QLC model compared to their TLC models. And then you have to minimize right amplification. You have to say, "I'm going to minimize how much where I create on these SSDs by not doing things like post-process data reduction or excessive garbage collection or balancing the load when it doesn't need to be balanced." All of which move data from one place to another, and that means write that data one more time and consume one more of the right array cycles that QLC can manage. So, using QLC has now become relatively mainstream.
27:40 HM: Several vendors have announced those products over the past 18 months or so. So, NetApp, about a month ago, announced the FAS 500F, and notably, this is an FAS, it's not an AFF. So, this FAS is the brand name they use for their hybrid systems, all of their all-flash systems which are now all-NVMe are called AFF, and that means that's got a smaller CPU and lower performance and somewhat less effective data reduction. The base system comes with 24, 15 terabyte QLC SSDs and can be expanded once, and so now customers can get the full set of NetApp functionality at a price in line with using QLC flash.
The other major player recently has been Pure, who announced their FlashArray//C -- it's based on the FlashArray//X which you see here -- but rather than the 40 gigabyte per second shelf-to-controller connection, they've upgraded that to 50 gigabytes per second. It uses 24 or 49 terabyte modules that Pure builds themselves that they call NVMe DirectFlash modules, and because I see these capacitors here in the photo they've provided, it appears that they are at least in part using DRAM on the SSD to coalesce rights and reduce the wear on the QLC flash.
29:17 HM: Finally, Vast Data, who I work for, has created a system that disaggregates the compute, the controller function to a series of Docker containers and places all of the media in highly available JBOFs that allows us to share 3D XPoint and QLC SSDs across all of those controllers. We use a global FTL to bring the cost down.
I'd like to thank you all very much for coming to this session, and I hope that you'll have a wonderful day.