Understand how the CXL SSD can aid performance
Although CXL SSDs are only in the research phase, their performance shows promise. When used as simple memory extensions, they should improve programs that manage large data sets.
Recent research developed and simulated SSDs that take advantage of the Compute Express Link interface to speed up their operation. They intend to provide a large persistent memory pool through the interface that's cheaper than an all-DRAM pool.
A CXL SSD is also an alternative way to take advantage of the Storage Networking Industry Association (SNIA) NVM Programming Model, since non-volatile DIMMs are costly and Intel is winding down its support of Optane. Enterprises will see benefits when CXL SSDs team with next-generation programs that use that model. This new breed of storage should provide a much-needed boost to tomorrow's computing systems.
CXL SSDs might become popular in applications with large memory requirements. However, it may be unclear why a CXL SSD would perform any differently than an NVMe SSD.
What is a CXL-based SSD?
The difference between CXL and PCIe may not be obvious. At the signal level, these two are indeed the same, but the protocols for the two are different. CXL opts for a faster protocol than PCIe, although CXL.io supports standard PCIe I/O devices.
CXL was developed to support large pools of memory off the server motherboard (far memory) to augment the memory that resides on the server motherboard (near memory). All this memory is mapped into the server's memory address space and falls under the management of the server processor chip's memory management unit.
CXL needs to manage coherency, which is unlike PCIe. The contents of any memory address, whether near or far memory, can be less current than the replica of that address within the processor's cache. That's not a big deal until another processor tries to read that memory address. CXL's memory coherency scheme is carefully spelled out to assure that old data never finds its way to a processor if a newer rendition exists in some other processor's cache.
Software accesses the memory on a CXL.mem or CXL.cache device through byte semantics -- the software treats it the same as memory on the server board itself. If an SSD is a CXL device, then it also must communicate with the software and with the CXL.mem protocol as if it's memory.
A standard SSD communicates through NVMe over the PCIe bus using block semantics. For the CXL standard, this communication has become CXL.io.
CXL manages the different timing between the necessarily slower far memory and near memory. A CXL SSD takes it to the extreme: The drive has the option of performing as slowly as a standard SSD, and the CXL channel manages the data to the processor anyway, using memory semantics. Expect to see CXL SSDs sporting relatively enormous caches to minimize such slow operation.
Samsung: This is the way
Samsung is an advocate of SSDs on the CXL interface, having demonstrated something that the company calls a "memory-semantic SSD" (MS-SSD) at the 2022 Flash Memory Summit.
The theory behind an MS-SSD is that, from a software standpoint, the persistent media in the drive is accessed through memory byte semantics rather than through the block semantics that are ordinarily used for SSDs.
I/O semantics (block semantics) are run through an interrupt-driven system, which has been the norm perhaps since the early 1980s. Back then, the software's I/O routine could add milliseconds of delay to a disk access without being noticed. It was a great way to deal with I/O devices that were significantly slower than the processor.
In the early 2000s, when SSDs first started to gain widespread acceptance, SSD users noticed that the I/O routine slowed down the SSD. A new focus on that software improved its speed, but the basic structure of disk I/O, which was managed with processor interrupts, limited the extent of the improvements.
Meanwhile, the DRAM bus, which had to run as fast as possible, had been made synchronous in the 1990s, and with that, the bus was stripped of any ability to pause for a slow memory device. Memory was the only thing that could attach to the memory channel. Storage was forced to go through an I/O channel.
The software element became an issue when persistent memory came along. This is graphically depicted by this figure.
The red portion of both the upper and lower bars represents the delay that results from the software's I/O stack. It's a manageable portion of the overall access time for the upper bar, which represents the timing of an NVMe NAND flash SSD. In the lower bar, which includes Intel's Optane persistent memory modules, the red portion is about half of the total delay.
A 50% speed loss is unacceptable, which led to the development of CXL as a faster interface. CXL runs at near-memory speeds, but unlike the double data rate memory bus, it can work with memories of different speeds.
By designing an SSD with a CXL interface, Samsung has opened the door to using SSDs as memory. As a result, all the work that SNIA and various software houses have put into supporting persistent memory like Optane can be used with inexpensive NAND flash-based storage. The SNIA NVM Programming Model, which is the protocol that software uses to harness the power of Intel's Optane persistent memory, now has an additional use: It will support NAND flash SSDs on the CXL interface.
Since NAND flash is slow, though, Samsung's design uses large DRAM to cache as much of the SSD as possible. The 2 TB prototype that Samsung demonstrated at Flash Memory Summit sported a whopping 16 GB internal DRAM cache. Such large caches will probably be the norm for CXL SSDs.
The large cache pays off, though. In the benchmarks that Samsung shared at that event, the company said it was able to improve random read performance by about 1,900% over existing SSDs. Samsung claimed that the MS-SSD can present to its host as a 2 TB extension of the server's memory. The average latency of that memory depends on the hit rate of the SSD's internal cache.
Not all of this SSD's 2 TB needs to be mapped as memory. Samsung's device can divide into memory areas and SSD areas, serviced by either CXL.mem or CXL.io, in what Samsung calls "dual-mode" operation. This is intended to provide more flexibility.
Samsung's first-generation prototype does not support memory persistence. That takes some work, since the entire 16 GB of DRAM would need to be kept alive during a power outage, either for the duration of that outage or until its entire contents could write to the SSD's NAND flash. Either one requires a large reserve of stored energy, either in a battery or supercapacitor. The company's goal is to include persistent memory operation in its second prototype, which was scheduled for completion by the end of last year.
If CXL SSDs gain acceptance as persistent memories, the storage administrator's job will develop the same way as it would have if Intel's Optane had become mainstream. Software and silicon suppliers expect automatic data management software to hide any issues from the storage administrator, but there will be considerations for data security, since sensitive data may reside in both the CXL SSD and in a conventional SSD that is mapped to a slower tier of the memory-storage hierarchy.
Other CXL SSD developments
Samsung is not the only company that plans to support SSDs on the CXL interface. Kioxia openly discusses a proof-of-concept CXL-based SSD that it has designed around the company's proprietary XL-Flash chip. XL-Flash is NAND flash with a focus on speed, presumably in response to the introduction of 3D XPoint memory, which some companies initially viewed as a threat to standard NAND flash SSDs. Kioxia's CXL SSD design focuses on the speed of 64-byte transactions, particularly 64-byte random writes.
The company explained that it expects for its design to perform almost as fast as DRAM and faster than Optane. It supports memory sizes two orders of magnitude larger than near-memory DRAM and 10 times larger than near-memory Optane persistent memory DIMMs. The Kioxia design uses a large pre-fetch buffer but differs from the Samsung design by adding hardware compression to help improve speed. With the data smaller, the SSD requires less bandwidth and can speed up writes. The company's target is to have its CXL SSD emulator available this quarter.
In addition, the Computer Architecture and Memory Systems Laboratory (CAMEL) at the Korea Advanced Institute of Science and Technology, in collaboration with RISC-V and OpenExpress, has been performing simulations of a CXL SSD to better understand the performance benefits it might provide. It programmed an emulator to simulate the operation of a 32 GB SSD based on Samsung's super-fast Z-NAND chip -- somewhat similar to the Kioxia XL-Flash -- and compared its operation to that of a system with large DRAM.
Performance was highly dependent on the locality of the data in the SSD; this is only natural, since high locality implies that the bulk of SSD accesses would hit the SSD's DRAM cache, rather than its NAND flash. For high locality, the CXL SSD provided data at only 2.4 times the latency of the CPU's local DRAM. However, at the low-locality end of the spectrum, the CXL SSD's latency was 84 times that of DRAM. Clearly, a CXL SSD brings more performance to an application whose large data set is highly localized.
CAMEL also found that an NVMe SSD had 129 times the latency of a CXL SSD in high-locality applications. However, its latency was only about 50% higher than that of the CXL SSD's in low-locality applications.
This locality sensitivity mirrors Samsung's findings, whose CXL SSD operated at about 20 million IOPS during cache hits, with sub-microsecond latency, on 128-byte random reads.