This content is part of the Essential Guide: How NVMe-oF advantages are transforming storage

NVMe-oF fends off scale-out storage challenges

NVMe over Fabrics has proven to be a useful fix for the problems faced by scale-out storage architectures, and vendors and users alike are taking notice.

There are, in general, two recognized ways to scale storage architectures: scale-up and scale-out. Scale-up products work by increasing the capacity and horsepower of a single hardware platform, whereas scale-out tools increase capability with extra servers or nodes. Historically, scale-out has been more complex to implement, but with the advent of NVMe over Fabrics, could this be set to change?

Developed to reduce performance overhead over a fabrics network, nonvolative memory express over Fabrics can be used to get around the limits faced by some scale-out storage architectures. As more organizations look for high scalability, vendors are beginning to incorporate NVMe-oF technology into their products to reduce the complexities involved in scale-out storage.

The scale-out challenge

Scale-out storage tends to fall into two categories:

  1. Tightly coupled: Storage nodes or servers are closely bound to each other, with features like shared memory and proprietary high-speed backplanes. Some products that we think of as monolithic or scale-up are actually scale-out architectures, such as Dell EMC's PowerMax.
  2. Loosely coupled: In this scenario, nodes are not tightly bound together, but can operate separately. Nodes are connected using some high-speed networking -- typically, Ethernet -- that isn't directly built into the platform. A good example here is NetApp's SolidFire, which uses multiple 1U servers and standard 10 Gigabit Ethernet (GbE) networking.
The next logical step in NVMe development has been to enable the protocol over a fabric or network.

Tightly coupled, scale-out products generally offer higher levels of resiliency and more consistent performance, whereas loosely coupled architectures scale much more but have to deal with the impact of both storage drive and node failures.

Implementing scale-out storage is a challenge because data needs to be both protected and consistent. This means implementing techniques into the platform to detect when nodes fail and to reprotect data across a scale-out cluster in the event of device or node failure.

What is NVMe-oF?

As performance demands in the data center have increased, storage has been a constant bottleneck in delivering fast and efficient applications. NVMe is a technology that was developed to reduce the storage protocol performance overhead with solid-state media. NVMe-based SSDs, which connect to a server using Peripheral Component Interconnect Express (PCIe), deliver much greater bandwidth in terms of IOPS and throughput and at much lower latency than SAS and SATA SSDs.

NVMe-oF technology
How NVMe over Fabrics works

The next logical step in NVMe development has been to enable the protocol over a fabric or network. NVMe-oF describes a set of standards that has been developed to transmit the NVMe protocol over a Fibre Channel (FC), Ethernet or InfiniBand network.

NVM express transport mapping
The NVM express transport is an abstract protocol layer that provides NVMe command and data delivery.

Today, products exist for NVMe over FC, NVMe-oF using remote direct memory access over Converged Ethernet (RoCE,) NVMe over InfiniBand and NVMe/TCP using standard Ethernet network interface cards.

How does NVMe-oF help with scale-out storage?

One scenario we're seeing is the disaggregation of the components in a typical storage appliance. This architecture enables a more direct path between the host and storage media, bypassing the need to transmit data through a centralized controller. Even current scale-out products have this restriction, which can result in not fully utilizing the capability of SSDs. By providing a more direct I/O path, a single host can talk to many drives and vice versa. This reduces latency and increases scale-out capability.

Part of the NVMe specification provides the feature set to make these tools work. With SAS and SATA drives, I/O is stacked into a single queue, creating a bottleneck when reading to and from the internal NAND media. NVMe introduces the capability for 65,535 queues, each of which can hold up to 65,535 queue elements. This makes it possible to implement a highly parallel, many-to-many architecture between host and drive, with a separate queue for each host/drive relationship.

Vendors implementing this kind of technology include E8 Storage with E8 NVMe appliances and host-based software drivers. The appliance acts as a metadata server and Ethernet-to-PCIe bridge, with traditional storage tasks, like snapshots, offloaded to each connected host.

Excelero has a software-based product that connects together many servers into a mesh of storage consumers and providers. The NVMesh software enables any storage consumer to access any drive in any server without going through the target server CPU. The result is an architecture where the addition of extra capacity can be achieved with almost negligible overhead on existing applications.

WekaIO uses a similar technique to deliver a scale-out file system architecture called Matrix. The low latency of NVMe across the network, combined with distributed processing, enables the Matrix file system to operate at a speed that is faster than local drives.

Hardware focus

Vendors have also focused on building hardware-only tools that enable high scalability.

Pavilion Data Systems has developed a platform that uses up to 20 custom hardware blade servers and 72 NVMe drives to create a rack-scale architecture capable of supporting 120 gigabytes per second of bandwidth at 100 microseconds. Application hosts use standard 40 GbE or 100 GbE RoCE network adaptors and NVMe-oF drivers.

Vexata has developed an architecture that uses commodity hardware components to scale capacity and performance around an Ethernet midplane. Back-end scalability is achieved through hardware-based Enterprise Storage Modules (ESMs), while front-end connectivity offers NVMe-oF capability and a direct hardware I/O path with I/O modules (IOMs). Existing implementations currently offer up to 16 ESMs and two IOMs, although the architecture can scale to much more.

Apeiron Data Systems is another startup following the hardware model. The Apeiron ADS1000 platform uses NVMe over Ethernet and custom host bus adapters to deliver a scale-out architecture that can grow to support thousands of drives in a single configuration.

NVMe-oF is providing the capability to remove the constraints of a traditional architecture and to create products that are more distributed in nature. The common thread for all of these vendor offerings is to reduce the length and impact of the I/O path from host to media. This will be a feature of future storage designs, as latency remains the biggest challenge for storage to overcome.

Next Steps

NVM at Scale: A Radical New Approach to Improve Performance and Utilization

Dig Deeper on Flash memory and storage

Disaster Recovery
Data Backup
Data Center
and ESG