The microprocessor democratized computing and, ever since, programmable smarts have migrated to more devices. Consequently, processing has moved closer to the data source in most devices -- think PCs, smartphones, industrial sensors and surveillance cameras. But there's one exception: the data center. Once x86 systems broke the mainframe's monopoly in the data center, application processing stopped at the server. Sure, hardware like network switches and storage arrays have powerful CPUs, but with rare exceptions, they're dedicated to operating embedded code, not running user applications.
The GPU changed all that, accelerating machine and deep learning algorithms and fueling a new generation of AI applications. And now, we're on the verge of another epochal shift toward more localized computation for data-intensive operations via computational storage drives and devices.
Computational storage embeds processors within storage devices -- typically, at the SSD or NVMe-drive level and not within NAND flash memory modules. Here, we look at how computational storage is playing out in terms of storage architectures, the drives themselves, coding requirements and available products.
Despite the idea's simplicity, the technology to make computational storage devices is relatively new and only a few companies are working on products. The market is immature and dynamic.
Nonetheless, the Storage Networking Industry Association (SNIA) is working to develop documentation and standards for computational storage technology. The most recent draft of SNIA's Computational Storage Architecture and Programming Model defines the features and capabilities of several categories of computational storage devices (CSxes), namely:
- Computational storage drive (CSD), providing persistent storage and various computational services.
- Computational storage processor (CSP), acting as a discrete storage processor without local storage but used to add local processing to a storage array.
- Computational storage array (CSA), aggregating multiple CSDs or conventional drives with a CSP.
- Computational storage service (CSS), providing access to algorithms and functions acting on a computational storage drive or a CSA.
A CSS can be programmable and include an image loader or an embedded OS or a container platform that the host can configure with custom services, such as packet filters, deep learning inference calculations or a Hadoop cluster node. A CSS can also be fixed, with built-in functions, such as data compression, encryption, redundancy calculations like RAID and erasure coding, database operations, and regular expressions.
SNIA's draft document describes two operating models that define how the host system accesses a CSD:
- Direct operation in which the host system uses the drive interface, such as PCIe, to access a computational storage drive or processor and execute functions using its API.
- Transparent operation enables the host system to access computational features using a standard storage API.
The SNIA specification outlines the type of actions a CSx should be capable of performing, including:
- Management operations, such as service discovery and device configuration, programming, initialization and monitoring.
- Operations including data storage, retrieval and interchange to adjacent CSDs.
- Security, such as authentication, authorization, encryption and auditing.
SNIA's spec deals with the logical operation of CSxes. It doesn't cover the physical interface between hosts and devices. However, most implementations use NVMe with devices attached to an NVMe interface, NVMe fabric or PCIe bus.
SNIA's Computational Storage Technical Work Group is leaning toward NVMe and PCIe as the interface standard based on computational storage use cases and member submissions, according to Scott Shadley, work group co-chair and NGD Systems vice president of marketing. However, as the technology evolves, computational storage drives with embedded OSes will likely expose a TCP/IP port to supplement or replace the NVMe PCIe interface.
Computational storage drives in practice
Of the available CSDs, all are based on NVMe drives or PCIe cards and use the same protocol to communicate between the host CPU and drive processors. The devices use either a field-programmable gate array (FPGA) or Arm system on chip (SoC) for local processing, and they vary in their functionality and flexibility. The FPGA-based products are generally more limited than Arm devices. Popular implementations include:
- PCIe drive accelerator card or 2 or M.2 NVMe drive with embedded flash or high-bandwidth memory and an FPGA preprogrammed with data and volume management functions, such as compression, erasure coding, deduplication, encryption and file system or database operations.
- PCIe accelerator card or U.2 or M.2 NVMe drive with embedded memory, controller and an unprogrammed FPGA. These products enable users to add custom functions to a CSD using a high-level language like Xilinx Vitis or low-level FPGA hardware description language.
- PCIe accelerator card or U.2 or M.2 NVMe flash drive with an embedded Arm SoC that's either preloaded with particular data transformation functions or fully programmable using custom Arm code. To our knowledge, NGD is the only vendor using Arm processors in a CSD, and these are preloaded with an embedded Linux OS.
Computational storage drive vendors like the NVMe interface because every server OS includes drivers for accessing the storage and functions. NGD, for instance, uses Linux on Arm with a standard NVMe drive, letting users access the device via a GUI, host drive interface or even an SSH tunnel to the CSD's network interface, Shadley said. The SNIA technical work group is defining new NVMe commands and creating user libraries to simplify the loading and execution of CSD programs, he added.
Although multiple CSDs can operate in parallel, they can't directly send data and compute requests among themselves. Instead, they require the host process to simultaneously send compute commands to each drive. Shadley said future versions will allow for peer-to-peer CSD communication, for example, to send the results from one device to another for further processing. Such a peer-to-peer system would be useful in deep learning networks where the output of one model layer becomes the input for another.
Code development and loading
Computational storage drives that use FPGAs must be programmed before use. Although FPGAs can be programmed after deployment -- that's the whole point of field programmability -- the process isn't as simple as loading a container image or Java Archive file. Thus, current FPGA-based CSDs can't be loaded with code on the fly. Instead, products are preloaded with fixed functions, eliminating the need for user programming.
In the case of NGD programmable CSDs, Arm code is loaded from the host CPU to the drive using the NVMe interface. "To run code in our Arm cores, the x86 CPU process has to be cross-compiled to Arm," Shadley said.
The nascent and proprietary state of computational storage means vendors don't provide public access to technical documentation or APIs. Potential buyers must see all available technical and programming documentation and acquire loaner hardware to perform a thorough evaluation before making this strategic purchase, given the lock-in potential of proprietary products.
Computational storage vendors
Companies with CSD offerings include:
- Eideticom Communications offers FPGAs preprogrammed with several functions on either a PCIe add-in card or U.2 drive.
- NGD Systems offers the Arm-based Newport Platform as a PCIe add-in card, EDSFF (ruler drive), U.2 or M.2 drive.
- Samsung SmartSSD is an FPGA that can be programmed via Xilinx Vitis development environment or Xilinx HDL. Samsung has partnered with companies to provide custom IP development.
- ScaleFlux is a preprogrammed FPGA with data compression/decompression algorithms designed to boost IOPS, reduce latency and increase capacity for database applications using Aerospike, MySQL and PostgreSQL. It is available as a PCIe card or U.2 drive.
NGD has also worked with academic researchers on the Catalina CSD, designed for high-performance computing applications, although we aren't aware of any commercial implementations.