Optimizing NVMe-oF Storage With EBOFs & Open Source Software SuperWomen in Flash Leadership Award
Guest Post

Software-Enabled Flash for Hyperscale Data Centers

Software-enabled flash is a technology that redefines digital storage in a new way. Here you will find an introduction to software-enabled flash and its key capabilities.

Download this presentation: Software-Enabled Flash for Hyperscale Data Centers

00:02 Rory Bolt: Hi, I'm Rory Bolt, one of the principal architects of software-enabled flash, a newly announced technology from Kioxia. Today, I'll be covering what software-enabled flash is, why we created it, and the concept and capabilities it contains. I'll finish up by directing you to a website where you can find more information on software-enabled flash.

What is software-enabled flash? Software-enabled flash is a media-based, host-managed approach to managing flash resources. We'll talk more about this in the following slides. We redefine host interactions with flash to allow applications to maximize performance and extract maximum parallelism from flash resources, and we also provide functionality that would be difficult, if not impossible, to achieve with existing interfaces. Giving the host control over the media enables to host the defined behavior and performance characteristics for flash-based solutions. This can enable new storage functionality that maximizes the performance and efficiencies of flash, or simply stated, software-enabled flash gives you the ability to extract the maximum value from your flash resources.

01:15 RB: Coupled with this media-based hardware approach is a software-enabled API. This API is designed to make managing flash media easy, while exposing all the capabilities inherent in flash. The highlights in this API are, first and foremost, it's an open source flash-native API. It's designed to make it easy to create storage solutions with powerful capabilities. It abstracts the low-level details that are vendor- and flash generation-specific, so that your code can take advantage of the latest flash technologies without modification. Being an open project, any flash vendor can build a software-enabled flash device that is optimized to their own media.

And finally, even though that just the API and documentation are published today, Kioxia will be releasing an open source software development kit with reference source code, utilities and examples in the near future.

02:14 RB: As previously introduced, software-enabled flash consists of hardware and software components working together. Over the past several years, I've had a fantastic opportunity. I've been able to meet with storage developers for most of the world's hyperscalers and talk to them about their storage needs. Taken with the input of other engineers at Kioxia, it has allowed us to distill a list of basic requirements and features for hyperscale customers.

I should mention that although these hyperscalers face similar problems, their individual priorities and approaches vary significantly, so a generalized solution is required. On the requirement side, class abstraction is not just about code reuse. There can be major economic and performance advantages to transitioning quickly to the newest generations of flash at scale. Scheduling is an increasingly important topic for the hyperscale customers.

03:10 RB: Access to all the media or how to avoid the rate tax has implications on the overall cost of the solution. Host CPU is a sellable resource to the hyperscalers, and so we have to minimize our impact to the host CPU resources. Finally, flexibility in DRAM relieves the device from handling worst-case scenarios. On the functionality side, data placement is important to minimize write amplification. Isolation is important for both security and release from noisy neighbor problems. Latency is a topic near and dear to hyperscale customers, and greatly affects the performance of their systems. Buffer management is tied to the flexible DRAM configurations that we'll be talking about soon. And adaptability. The hyperscale environment is dynamic and changes rapidly behind the scenes. The preferences for standard configurations that can be provisioned, configured and deployed in real time.

04:15 RB: In response to those requirements and required features, the software-enabled flash API covers these major areas and a few more. Grouped broadly, the areas of the API are concerned with hardware structure, buffer management, programming and error management of the flash. Not listed separately, but touching on almost all of these areas, is a latency control.

In order to maximize the performance of flash storage, it is necessary to be aware of the geometry of the flash resources. The software-enabled flash API allows storage applications to query this geometry in a standardized manner. Some of the characters exposed are listed here. Additionally, the API allows for control over how many flash blocks may be open simultaneously and can control and manage the associated write buffers.

When discussing the programming of flash, it's important to note that the optimal programming algorithms vary from vendor to vendor, and often between flash generations for any given vendor. Software-enabled flash handles all of these details and lets the flash vendor optimize for their own media. The API was created with consideration for other flash vendors, as well as all the foreseeable needs of Kioxia's own future flash generations.

05:33 RB: Finally, with respect to error management, software-enabled flash allows vendors to optimize error-reduction techniques specific to their own flash media characteristics, and it also controls the time budget for error recovery attempts.

This is a high-level block diagram of our software-enabled flash controller. It's one possible design, and other vendors are free to implement different architectures, as long as they comply at the API level. For example, although we use toggle interface to connect to our flash chips, as illustrated on the bottom of the hardware block diagram, other vendors might implement an on-fee connection to their flash media. Note that our design uses standardized interfaces where it's possible and is focused on the management of the flash media itself, programming, lifetime management and defect management. The controller has advanced scheduling capabilities on a per-die basis and hardware acceleration for garbage collection if needed, as well as wear-leveling tasks. And another important to call out is the use of the on-device DRAM is optional. Post-memory resources can be used instead.

06:53 RB: Now, I'd like to introduce the software components of software-enabled flash. Kioxia has created and will release a software development kit. This will provide open source reference block drivers, an open source reference flash translation layer and open source reference device management utilities. Bundled with the SDK will be an API library, once again in open source, as well as the device driver. This block diagram shows how the pieces of the SDK interface with each other.

Two notes on the software layering. First, as you can see, it's possible to write user applications that are software-enabled flash native that can interact directly with the device. We have a couple of proof-of-concept applications running today on our FPGA prototypes -- these include a software-enabled flash engine for FIO, as well as a version of RocksDB and Firecracker that are software-enabled flash native.

07:49 RB: Although these applications are currently just proof of concepts, in the future, we plan to include open source native applications as part of the SDK itself. The second thing to note is called out as the SDS stack in the center of the diagram, this is a software-defined storage stack. Hyperscalers are already running their own software to find storage stacks in their environments; these software-defined storage stacks can be modified to interface with software-enabled flash at the library level and are not dependent on the use of the SDK at all. In such cases, the SDK serves as a detailed example of best practices for software-enabled flash.

Now, I'll introduce the concept and features of software-enabled flash that will be necessary to understand the features we provide. This is a depiction of a software-enabled flash unit. At the top is the actual software-enabled flash controller; beneath it the die are arranged in four banks on eight channels, providing 32 die.

09:00 RB: The first concept that I will just introduce is that of a virtual device. A virtual device is a set of one or more flash die providing hardware-level isolation. A individual flash die can only be allocated to one virtual device. The next concept is that of a QS domain. A QS domain is a mechanism to impose post-capacity quotas and provide scheduling policy and software-based isolation. You can have multiple QS domains sharing a single virtual device.

The final notion is that of a placement ID, and this is a mechanism to group data at the superblock level within a QS domain. It should be noted that QS domains are the items that are exposed to the system as a device nodes. This slide describes how the concepts introduced provide control over data isolation and placement. Virtual devices are groups of physical die. This provides hardware-level isolation, but scalability is limited by the number of die available on the device. QS domains provide a logical isolation managed by the device.

10:17 RB: A QS domain consists of a group of superblocks. The superblocks are never shared between QS domains, there is no mixing of data at the block level. QS domains are highly scalable into the thousands of domains per device. Closely related to data isolation is data placement. Here, we introduce the concept, the nameless write mechanism to control data placement.

Why is a new write mechanism necessary? Although system benefits can be realized with control over data placement, if physical addressing is allowed for writes, the host becomes responsible for device wear. Flash memory is a consumable, poor choices for physical data placement can wear out flash devices quickly. So, how can a host have control over data placement without the responsibility of ensuring device health? The answer is nameless write. When the superblock is required for a placement ID, or if a new superblock is manually allocated, the device chooses the optimal superblock to use from a free pool.

11:25 RB: Superblocks start out in the free pool within the storage device. As the QS domains allocate storage, a block is drawn from this pool and assigned to the domain. The device can choose any free block from the pool so it can track block wear and block health and assign the optimal block to maximize device endurance. When a superblock is released, it is returned to the free pool, so over the lifetime of the device, ownership of a superblock may change if there are multiple QS domains within the virtual device. This is the framework for the nameless write mechanism. Now, let's show how a nameless write works. Nameless write allows the device to choose where to physically write the data, but allows the host to bound the possible choices of the device. As mentioned earlier, QS demands are mapped to device nodes in the system, so the nameless write operations must supply a QS domain handle as well as either a placement ID for auto allocation mode or a superblock flash address returned by a previous manual superblock allocate command.

12:34 RB: The QS domain maps to a virtual device, which specifies which die can be used for the write operation. The placement ID or flash address specifies which superblock owned by that domain should be used for the device, for the write. The placement ID or flash address specifies which superblock owned by the domain should be used for the write operation. If a placement ID is specified, a nameless write can span superblocks, and additional superblocks will be allocated by the domain as needed.

In the manual allocation mode, nameless writes cannot span superblocks. The device is free to write the data into any open space within the bounds specified by the host, and when the write is complete the actual physical addresses is returned, enabling a direct physical read path with no address translation. The nameless write operation automatically handles media defects and direct physical read optimizes performance and minimizes latency.

13:34 RB: Similar to the nameless write operation, software-enabled flash has a nameless copy operation that can be used to move data within the device without host processing. This is useful for implementing garbage collection, wear-leveling and other administrative tasks. The nameless copy function takes as input a source superblock, a destination superblock and a copy instruction. These copy instructions are powerful primitives supporting valid data bitmaps or lists of logical block addresses or even filters of logical block addresses to select which data will be used for the source. The nameless copy operation can operate an entire superblocks with a single command. This animation illustrates the difference and impact for the host-implemented garbage collect using a standard read and write command versus that of a nameless copy.

14:31 RB: Another key feature of software-enabled flash is the advanced queuing and scheduling features. We will spend a bit of time here with several slides. Scheduling and queuing controls how die time is spent, read versus write versus copy, as well as prioritization of multiple workloads. Consider a multi-tenant environment; there may be business reasons to enforce fairness or to give certain tenants priority, and the business needs may change over time, these tenants can share a device and weighted for a queuing can be used to support the performance goals that the business needs. The host is allowed to prioritize and manage die time through the software-enabled flash API, and the device will enforce the scheduling policy.

15:19 RB: This is the basic architecture of the software-enabled flash scheduler, each virtual device has eight input FIFO queues, and since a virtual device can be as small as one individual die, this map is down to a 1-1 ratio with die. The device scheduler handles suspend resume for all program and erase operations. The host can specify a specific queue for each type of flash access command, so you could have one queue use for read operations, a separate queue use for program operations and a third queue use for erase operations on a per queue as domain basis. Every queue can specify the die time weights for the individual read, program and erase commands. And, finally, the host can override the default queue assignments as well as the default weights for each individual flash access command.

16:22 RB: Taken together, this enables a very flexible scheduling system. If all the weights are set to zero across all the input queues, it works as an eight-level priority scheduler with queue zero being the highest priority and queue seven being the lowest priority input. When all the weights are the same value, it works as a round-robin scheduler. And when all the weights are unique, it works as a die time weighted fair queueing scheduler.

This is an image taken from a demonstration of our FPGA prototype. We defined three separate virtual devices. Each of the virtual devices we defined was running a separate storage protocol in a different synthetic workload. In this example, we are showing the ZNS protocol, the standard block protocol, and a custom hyperscale FTL being run simultaneously on the same physical unit. A bit of explanation about the graph you see. This is actually a heat map on the die level for the self-unit, so labeled across the front of the diagram is the channel and the banks extend backwards.

17:53 RB: So, we define a queue S domain spanning a virtual device on the first three channels and four banks, this was running a mixed read/write workload on the ZNS protocol. You can see that the vertical bars -- blue for read and red for write -- are roughly even in their distribution. We then had a isolated workload running on separate die set in the foreground spanning channels three through five, and this was a write-dominated workload; it was therefore showing read or write operations dominating the heatmap of the die, and this was running the standard block protocol. And, finally, towards the back of the illustration, there is a third virtual device running the custom hyperscale FTL that was doing a read-dominated workload, hence the blue bars.

19:02 RB: We really don't think that this is a common use case, but it illustrates several things. It illustrates that these workloads have hardware-level isolation running on separate die sets and were not interfering with each other. And it also illustrates the same device can be reconfigured and deployed in different modes of operation. This map is very well to the hyperscale environment where they want to deploy a single device at scale and then profusion and configure the device for their dynamic needs, and as new storage protocols evolve, software-enabled flash can quickly adapt.

19:46 RB: And now, a summary. Software-enabled flash fundamentally changes the relationship between flash media and the host. It consists of purpose-built hardware that manages the low-level details of media programming and lifecycle. It has an open source flash-native API. It leverages industry-standard protocols and as illustrated it can be used as a building block to implement other drive protocols. Finally, software-enabled flash is a combination of ease of use and full host control.

20:35 RB: For more information on software-enabled flash, I advise you to visit our microsite www.softwareenabledflash.com. There, you can find a white paper that you may download, and finally, you will find a link to a GitHub repository where you can find the actual SEF API as well as its documentation. Thank you.

Dig Deeper on Flash memory and storage

Disaster Recovery
Data Backup
Data Center
Sustainability and ESG