CXL: A Basic Tutorial

Here is a brief introduction to Compute Express Link (CXL). This is a new high-speed CPU interconnect that enables a high-speed, efficient performance between the CPU and platform enhancements and workload accelerators.

Hugh CurleyGuest Contributor

Published: 23 Feb 2021

00:21 Hugh Curley: Welcome to this 15-minute introduction to CXL, that new interface that runs on PCIe 5 or later. It is designed for high-end, ultra-high bandwidth and ultra-low latency demands. It is without a doubt, the interface that belongs at the Flash Memory Summit. I am Hugh Curley, consultant with KnowledgeTek, and CXL is Compute Express Link.

00:49 HC: CXL moved shared system memory in cache to be near the distributed processors that will be using it, thus reducing the roadblocks of sharing memory bus and reducing the time for memory accessors. I remember when a 1.8 microsecond memory access was considered good. Here, the engineers are shaving nanoseconds off the time to access memory.

You might say that graphic processor units (GPUs) and NVMe devices share system memory with the processor, and you are correct. But with the GPU, the sharing is one way only, from the host to a single graphic unit. With NVMe, both the host and the device can access the memory, but there is only a single instance of a given memory location. It may be on the host or on the NVMe device with both the host and device able to access it, and the memory of both the GPU and NVMe devices is controlled by the host memory manager, so no conflicts or coherency problems develop, and neither the GPU nor the NVMe devices are peers of the host.

02:03 HC: With CXL, multiple peer processors can be reading and updating any given memory location or cache location at the same time to manage coherency. If any processor writes to a memory location, all other copies of that location, are marked as invalid. Processors accessing that memory location must refetch that data before acting on it. This requires a lot of communication, and that communication is called CXL.

Why add this extra complexity and communication? CXL allows the system designer to move the memory and cache physically closer to the processor that is using it to reduce latency. When you add remote processors or processing devices, each device brings the memory and cache it needs. This allows the system owners to balance performance versus cost. System administrators can add more system memory by adding memory expansion units. There are additional requirements we must address.

03:13 HC: The reasons for CXL are high bandwidth and low latency. It must be scalable to address applications with different demands. But the elephant in this room is coherency. We are moving to new ground and it must be designed correctly. CXL is probably a bad choice for block access or where there is only a single instance of memory addresses. It is also probably a bad choice for mesh architectures with multiple accelerators, all working on the same problem. CCIX would probably be better. CXL coherency is our issue.

So how does it work? I mentioned it in one sentence earlier, which probably caused more questions than it answered. This single page may do the same, which is good. It means you'll be ready for the in-depth presentation and discussions later in this conference. Data is copied from the host processor memory to the device's memory, and there may be multiple devices with the same memory addresses.

04:19 HC: When a device updates a memory location, that location is marked as invalid in all memories or caches. When any device wants to read or write a memory location that is either invalid or not in this memory, it must read it from main memory. Memory tables and coherency logic is in the host. This keeps down the cost and complexity for device designers and manufacturers.

There are a few host development companies, but many device development companies, so this asynchronous approach should reduce incompatibilities. We mentioned the CPU and PCIe before. CXL has three protocols which we will address:

CXL.mem: Used to maintain coherency among shared memories
CXL.cache: Used to maintain coherency among shared caches. This is the most complex of the three.
CXL.io: Used for administrator functions of discovery, etcetera. It is basically PCIe 5 with a non-posted write transaction added.

05:32 HC: There are also three types of CXL topologies.

Type 1 is for cache, but no shared memories. Of course, CXL.io is used for setup and air handling. This slide shows some usages.

Type 2 is for shared cache and shared memories. "HBM" is high bandwidth memory.

Type 3 is for shared memory, but no shared cache, such as for memory expansion units. The Type 3 does not have externally visible processors or cache in the device.

One subject we must address is, who manages the memory, the host or the device, and what does management mean? The easy way to answer that is the device-managed memory is not accessible to the host. Therefore, CXL does not see it or manage it. Remember that the management, logic and tables for CXL are in the host. Host-managed memory is memory on the host or device that the host and CXL can monitor and manage.

Two other concepts are host bias coherency model and device bias coherency model. A system can use either or be switchable between them, perhaps using host bias for transferring commands and status and switching to device bias for transferring data.

07:00 HC: Notice the red line in this picture showing that the device must go to and through the host to address host managed memory on the device. This is not very efficient for the device to access data. This shows the device bias coherency model. Notice how efficiently the device can access the host-managed memory on the device now.

The purple line is to do the bias flit, if the system supports both host bias and device bias coherency models. Notice also that the host in these pictures has a coherency bridge and the device has a DCOH, which is device coherency bridge, a simplified home agent and coherency bridge that are on the device, instead of the host.

07:54 HC: What is CXL's relationship with PCIe 5? CXL physical connector is the same as PCIe and can run in a PCIe 5 slot. If either the host or device is PCIe, they can operate as PCIe. If both are CXL, they will negotiate CXL and operate with that protocol. In PCIe 5, the training sequences, TS1 and TS2 have an additional field called "alternate protocol." The first, and so far only, alternate protocol defined is CXL. If both devices claimed to support CXL, there are other fields negotiated between the host and device that define CXL parameters. The physical cables and connectors are PCIe 5. The logical sub-block in the physical layer is called the flex bus, and that it can operate as PCIe or CXL.

08:57 HC: A new block called ARB/MUX is added to arbitrate which request gets serviced first. CXL is defined as PCIe, Gen 5 x 16 lanes wide. If you drop down to eight lanes or four lanes at Gen 5, the specification calls it "bifurcation." If you drop below Gen 5 x 4 in either speed or lanes, it is called a "degraded mode" link layer and flits. All CXL transfers are 528-bit packets, made up of four slots of 16 bytes each, and two bytes of CRC, slot zero contains the header, the other three are called generic slots.

Let's see what they contain. The flit header contains information such as the type, is this a protocol transferring data, commands or status or a control flit? ACK is that the sender is acknowledging eight flits that it has received from the other device.

10:12 HC: BE -- byte enable -- is this flit accessing memory or cash on a byte level or on a slot level? Slot, what kind of information is in slot zero, one, two and three? There are six format types for header slots, H0 through H5, and seven format types for generic slots, G0 through G6. The slot encoding identifies if their contents are, for example, cash requests, cash responses, memory request, memory header or data. CRD -- credit -- how many credits are being sent? Each credit granted allows one transfer regardless of size. We will see the byte enable on the next slide.

11:04 HC: This slide shows two formats of data flits, the one on the left is used for byte updates, and the one on the right is used for updating 16 bytes of information. We covered why and how for CXL, types one, two and three. The three protocols of CXL.mem, CXL.cache and CXL.io. Host bias coherency and device bias coherency, host-managed memory and device-managed memory. PCIe alternate protocol, normal, bifurcated and degraded modes and flits. There is a lot of information. As this is recorded, you can go back and review the entire lesson or any specific part of it. I hope this has prepared you for a very beneficial Flash Memory Summit. This chart shows comparison of some new interfaces. Thank you.

CXL: A Basic Tutorial

Here is a brief introduction to Compute Express Link (CXL). This is a new high-speed CPU interconnect that enables a high-speed, efficient performance between the CPU and platform enhancements and workload accelerators.

Dig Deeper on Flash memory and storage

CXL moves forward as memory technology for AI

Memory and storage experts share highlights from FMS 2024

3D XPoint

What role does CXL play in AI? Depends on who you ask