Optimizing NVMe Drives for Your Applications Software-Enabled Flash for Hyperscale Data Centers
Guest Post

Optimizing NVMe-oF Storage With EBOFs & Open Source Software

Data-driven apps run faster on multi-server systems when the available flash storage is networked to maximize usage, but one problem is how to network standard NVMe drives at fair costs while achieving high performance. NVMe over Fabrics can help.

Download this presentation: Optimizing NVMe-oF Storage With EBOFs & Open Source Software

00:00 Sujit Somandepalli: Hello, and welcome to Flash Memory Summit 2020. In this session, we will talk about extending your storage capabilities with NVMe over Fabrics. My name is Sujit Somandepalli, and I'm a principal storage solutions engineer at Micron where I focus on performance-tuning SSDs for various NoSQL databases, including MongoDB, Cassandra, RocksDB, and among other applications.

As part of this presentation, I will briefly go over NVMe over Fabrics, its benefits and the different ways of connecting flash to your servers. We will look at the Foxconn-Ingrasys EBOF with Marvell 88SN2400 converter controller. I'll also provide a brief overview of Micron's heterogeneous-memory storage engine, and finally, conclude with some test results from running MongoDB against YCSB.

01:00 SS: What is NVMe over Fabrics? NVMe over Fabrics extends the NVMe protocol to the network, increasing the reach well beyond the server chassis that confines typical SSDs today. NVMe over Fabrics was first standardized in 2016. And since it's based on NVMe, it inherits all the benefits, like being a lightweight protocol with an efficient command set. It is multicore-aware and can take advantage of protocol parallelism. The figure on the right compares NVMe and NVMe over Fabrics models and highlights the various transport operations that are available to the user. There are two relevant Ethernet transport options, including RDMA over Converged Ethernet or RoCE v2 and TCP. Both of them have their advantages and disadvantages. RoCE v2 has lower latency but requires specialized RDMA-capable NICs. TCP transport has higher latency and higher CPU utilization but can use standard NICs on the system. Currently, RoCE v2 is more prevalent in the market.

02:12 SS: With NVMe, your options to scale are limited to the server chassis or to the rack. This limits the scope and reach of your NVMe storage through the rack itself. NVMe over Fabrics allows virtually unlimited amount of storage to be connected across a data center via its radius. This allows for the creation of smart solutions where you can fully disaggregate storage and compute. It enables orchestration at scale by carrying out storage as you provision solutions. It removes the limitation of drive slots in a system, enabling you to grow your storage footprint as you need.

There are many different ways of connecting flash to your server. In this session, we'll talk about the different ways of connecting flash to your server using the fabric. Just a bunch of flash, or JBOF, allows you to scale storage across a data center using PCIe switches to fan out your SSDs. An EBOF, or Ethernet bunch of flash, allows you to fan out your SSDs using Ethernet switching. EBOFs typically have lower costs due to the advantage of Ethernet switch networks, which are simpler and cheaper compared to the complicated PCIe networks that require specialized switches and retimers and other hardware.

03:43 SS: In the case of a JBOF, the NVMe-to NVMe-over-Fabrics conversion takes place at the shelf level using one or more data processing units, whereas in an EBOF, the bridging is usually done at the SSD controller itself or the SSD carrier or the enclosure. The Marvell 88SN2400, which is central to this session, is an NVMe over Fabrics SSD controller for cloud and enterprise data centers. It bridges NVMe and NVMe over Fabrics. It takes PCIe Gen 3 x4 NVMe interface and presents two 25 Gigabit Ethernet ports. The chipset supports both TCP and RoCE v2. However, for this session, we will focus on using RoCE v2 as the transport.

The Foxconn-Ingrasys EBOF is an NVMe over Fabrics solution that's built around the Marvell 88SN2400 bridge chipset. It has 24 U.2 slots and a 2U chassis, and each of these U.2 slots has the 88SN2400 bridge chip. The EBOF can provide up to 3.2 terabits of redundant Ethernet connectivity, and this 2U shelf can provide 75 gigabytes of throughput and up to 20 million IOPS to your application. Micron's homegrown heterogeneous-memory storage engine is one of the first storage engines designed from the ground up to accelerate Linux workloads using flash-based storage and storage class memory technologies.

05:32 SS: This unique open source software was designed to maximize the capabilities of these new technologies by intelligently and seamlessly managing data across multiple storage classes. The result is significantly improved performance, decreased latency and improved drive endurance by lower write amplification even when deployed at scale. One of the first implementations of HSE is a MongoDB storage engine that delivers all of these features and more.

The next section of the presentation focuses on results from benchmarks that we ran with MongoDB and YCSB. MongoDB is a popular cross-platform document-oriented NoSQL database. It uses JSON-like documents with optional schemas to store its data. MongoDB has a built-in storage engine called WiredTiger. We tested WiredTiger against YCSB and also Micron's HSE as the storage engine in MongoDB against YCSB. A group of engineers from Yahoo first published a framework instead of workloads in 2010 to compare different key-value and cloud-serving databases.

06:53 SS: The workloads that we tested as part of this benchmarking effort are modelled to represent real-world scenarios. Workload A represents user session recording and has 50% reads and 50% updates to the database. Workload B represents metadata tagging for photo website, for example, and has 95% reads and 5% updates. Workload C denotes the loading of a user profile cache when a user logs into a website and has a 100% read characteristic. Workload D represent user status updates in a social media network, for example, and has 5% inserts and 95% read latest characteristic. Workload F is similar to a user activity recording workload and is represented by 50% reads and 50% read, modify, writes. Our test configuration consisted of Dell PowerEdge R7525 servers with dual AMD EPYC 7502 processors and 512 gigabytes of RAM.

08:05 SS: We used Ubuntu 20.04 as the operating system on these servers. Mellanox ConnectX-5 100 gigabit networking was used to provide connectivity to the fabric and also to the clients. There were two primary configurations that we used. Local, where the Micron 7300 PRO NVMe drives were directly connected to the server and provided database storage. The second configuration was with the EBOF where we used the Foxconn-Ingrasys EBOF to provide NVMe over Fabrics connectivity to the servers. We had four drives connected to each of the servers, and because of the versatility of the EBOF, we were able to connect multiple servers to perform parallel testing. We tested a single-node MongoDB with the four drives aggregated with LVM storage as the database storage. YCSB is executed from a separate client machine on the same network as the database server.

09:12 SS: This 100-gigabit network is separate from the network that's used for connecting to the fabric. We generated a 2 terabyte data set by loading 2 billion records using YCSB's load flag. And for each test, we cleared all the system caches and restarted the database and restored the data set to get clean results each time. YCSB controls its load by changing the number of thread counts that it uses to connect to the database server. We tested various thread counts and found a load that gave us the optimum combination of YCSB operations per second and YCSB read latencies. Once this thread count was found -- and in this case, it was 96 -- we ran longer tests and repeated them multiple times to get consistent and repeatable results. The long-duration test also makes sure that any background operations that MongoDB has captured and represents a more realistic scenario.

10:22 SS: Moving onto the results. This slide shows the performance comparison when using MongoDB with Micron's heterogeneous-memory storage engine as the storage engine inside MongoDB. As you can see from the chart on the left, the difference in performance when going from local to NVMe over Fabrics is minimum, and it is corroborated by the difference in latency seen on the right.

It's interesting to note that the fabric has higher performance when inserting data. We think that this is due to the difference between updates and inserts in MongoDB. An update in MongoDB is typically a read operation and an update operation. That is, MongoDB has to read the existing record from the database and update its data and insert a new record to the database, whereas an insert operation in MongoDB is simply an insert to the database. And we think that this is why the performance of workload D, the EBOF performs better

This slide compares the performance of MongoDB with its built-in WiredTiger storage engine, and you can see here that there is some performance loss when using NVMe over Fabrics compared to the local NVMe devices. The interesting thing to note here is that typically MongoDB is deployed as a multinode system, and so a lot of these latency and operation-per-second differences can be absorbed by using a multinode system.

12:06 SS: Finally, this slide shows the performance comparison when we used local disks with WiredTiger and NVMe over Fabrics disks with HSE as a storage engine in MongoDB. You can see that the use of HSE as a storage engine shows more than five times the improvement in workload A and a significant improvement in latencies, it can be seen as well. We see improvement and performance across the board for all the workloads, but the workload A has a much higher improvement. This clearly shows that Micron's HSE not only increases the performance of your application, but it can also do it extremely well when used over the network using NVMe over Fabrics. Additional testing was performed on Nvidia's GPUDirect storage using the DGX A100. In this case, we provided direct connectivity to the GPUs using the EBOF, and this allowed us to scale past the limitation of eight drives in the DGX A100. More details about the results and configuration can be seen in my colleague Wes Vaskes presentation in the AI track "Analyzing the effects of storage on AI workloads."

13:28 SS: In conclusion, we have looked at how Micron's heterogeneous-memory storage engine in a fabric environment provides a dramatic improvement when compared to the legacy WiredTiger storage engine used in MongoDB. The Marvell 88SN2400 converter controller and the Foxconn-Ingrasys EBOF allows easy scaling of NVMe or drives in a data center. NVMe over Fabrics allows the consumer to design fully disaggregated data centers where applications can be composed and provisioned on the fly. It also allows you to grow your compute and storage independently. Today, low-cost bridges or data processing unit-based platforms are used to connect and bridge NVMe SSDs into the fabric, and in the future, we may see native NVMe over Fabrics SSDs which will further reduce TCO.

Here are some references to some of the technologies used in this presentation. Thank you for your time attending this session. If you have any questions, please stick around and I'll be happy to answer them. Thank you.

Dig Deeper on Flash memory and storage

Disaster Recovery
Data Backup
Data Center