Download the presentation: Bringing NVMe/TCP Up to Speed
I'm Sagi Grimberg. I'm CTO and co-founder for Lightbits Labs, and I've been the co-author of the NVMe over TCP standard spec.
So, we'll start with a short intro. What is NVMe over TCP? NVMe over TCP is the standard transport binding that runs NVMe on top of standard TCP/IP networks. It follows the standard NVMe specification that defines the queueing interface and the multi-queue interface that runs just on top of TCP/IP sockets. It has the standard NVMe command set, it just happens to be encapsulated in what we call NVMe/TCP PDUs that maps the TCP's streams. So, in the diagram here, we have basically the NVMe architecture, core architectures that defines the admin, I/O and other command sets.
01:14 SG: Below that we have NVMe over Fabrics that defines capsules, properties, discovery. And NVMe over TCP basically defines extra features and messaging, and also the transport mapping to the underlying fabric itself, which in our case is TCP/IP.
So how does a queue process by NVMe over TCP or how it's defined in NVMe over TCP? Basically, each queue maps to a bi-directional TCP connection and commanded data transfer usually are processed by a dedicated context, should it be in software or somehow in hardware. So, in the diagram here, we have on the left-hand side, we have the host that has a queuing interface to the NVMe transport itself, has a submission queue and a completion queue. All the submissions and completions are handled in what we call an NVMe-TCP I/O thread, or some I/O context that is either triggered from the host that's emitting I/O or from the network, usually completing I/O or receiving data.
02:24 SG: The same picture happens on the right-hand side with the controller, and basically, this is the contexts that are in charge of transferring data between the host and the controller. So each of these queues actually are mapped to dedicated CPUs usually, but not necessarily, it could be more, it could be less actually, but the point is that there is no controller-wide serialization, so every queue is not dependent on a shared tape with other queues, which makes it extremely parallel. And the diagram here is a standard diagram that's been shown before about the set of queues, you have the admin queue too, between the host and the controller, and then a set of I/O queues, queue pairs which are submission and completion queues. In NVMe/TCP, basically each of these queues map to a bi-directional TCP connection. So, if we sort of look at the latency contributions, we have a number of those that could creep in. First of all, in serialization, but in NVMe/TCP, it's pretty lightweight, it's on a per queue basis, so it scales fairly well.
This article is part of
03:43 SG: Context-switching. So, we have two at a minimum contributed by the driver itself, memory copy, usually antiques, we are able to do zero copy as a kernel level driver. On RX however, we do memory copy, it's not a huge factor, but in very, very high load, it can contribute into additional latency. Interrupts -- NIC interrupts -- are definitely impactful, they consume CPU and impact the scalability of how much a single queue can achieve. We have LRO and GRO or adaptive interrupt moderation can mitigate that a bit, but then latency could be less consistent. Then we have the socket overhead, it exists but it's really not huge, it's pretty fast given that sockets are pretty un-contended in a multi-queue interface, but in small I/O, it could make an impact. Affinitization between interrupts, applications and I/O threads definitely can impact if not configured correctly, and we'll touch more about that. Cache pollutions, obviously, resulting from memory copy, we have some, but it's not something that's that excessive in modern CPU cores that have sufficiently large caches.
05:15 SG: And then we have head-of-line blocking, which could be apparent in mixed workloads, and we'll touch about how we address that. So, let's take an example of host direct data flow. It starts when the user basically issues a file or block I/O to a device or a file descriptor, and then we ignore a lot of layers in the stack, but eventually it comes due to the queuing callback that sits in the driver itself, we're talking about Linux, of course. That callback is NVMe/TCP queue request and it prepares an NVMe/TCP PDU that places it in an internal queue for further processing, then we have the I/O work, that's the I/O context or the I/O thread, it picks up this I/O and starts processing it, actually sending it to the controller. Then the controller processes it, completes the I/O and sends back data if it's a read or end the completion to the host.
06:21 SG: Once the host receives it, first one to see it is the NIC, it generates an interrupt saying that it has more additional datagrams that should be processed by the host, then NAPI is triggered, that basically gets all these datagrams and inserts it into the TCP/IP stack and processes it. Then we have the driver as the TCP consumer gets a data ready callback, which at that point it triggers the context to basically process data and the completion, and actually complete the I/O. And the next one is basically that the user context completes the I/O. If it's an async interface, it probably will get through gate defense or a signal. Now, the stuff to notice here is, first of all, we have a context switch between the part that we're preparing the I/O and the part that has scheduled the I/O work to pick up the I/O and process it, then we have a software I/O queue between the interrupt and when NAPI is triggered and then we have an additional context, switch once the data ready is called and the I/O work context actually goes ahead and process the I/O.
07:36 SG: So, these are the three area points where we can help eliminate. Now, first optimizations that was done is to address mixed workload, as we said, head of queue clocking could be apparent in the case where we have a large write that's coming from the host needs to be sent with a queue and then a host finds itself sending a large message because NVMe/TCP defines the messaging on top of the TCP stream, and then behind it, it's a small read that cannot make forward progress until this large write, actually completes if it happens to be on a single queue.
08:16 SG: So, this problem is apparent, and the mitigation starts with separation of different queue maps that Linux block layer allows today, that can basically define the different I/O types can use different queues. So we have three different queue maps on Linux. One of them is the default, basically it will host the default set of queues, usually if only the default queue map is defined and all I/Os will use that in a queue map.
Then we have a dedicated queue map just for read I/O, such that reads will have a dedicated set of queues associated with that, and the rest will go to the default queue map, and then the third one is the pole queue map, usually when the application signals through a flag, a high priority flag I/O to the kernel that it's interested in a high priority I/O command, it will be directed to the pole queue map, and this pole queue map can actually design the host latency-sensitive I/Os.
09:38 SG: Now, what we did with that is then once we have mixed workloads, we have different readers and writers that are all hosted on the same application or even different application of the same host, these reads and writes are now steered through different queues, so they will never see a head-of-line blocking between reads and writes. So, we basically what we did is allocated additional queues in NVMe/TCP depending on the user request, and then we actually ran a benchmark by plugging . . . We plugged it into the block band, and we ran a benchmark to see what's the improvement. So, the test was basically have 16 readers, each of them are doing synchronous reads, just one by one, and in the background we had one sort of unbound writer that issues a high burst of one megabyte large writes. Now, this is an unbound, meaning that this thread, issuing one megabyte writes in a high queue depth is going to rotate between the different CPU cores and is going to end up in sharing queues with the rest of these 16 readers.
11:10 SG: So with the baseline, we see the read QoS, the read IOPS are somewhere around 80K IOPS with an average latency of 396 microseconds and the tail latency can go up to 14 milliseconds, which is obviously not great, but now that we separated read and writes to different queue maps, what we have now is that the IOPS have been more than doubled with 171K IOPS, the average read latency is less than half with 181 microseconds, and the four nines read tail latency has improved by an order of magnitude. So that's important to understand.
Next, we have affinity optimizations. So now that we have different queue maps, now comes a question of how they affinitize to I/O threads and how that affinitize to applications. So basically, every NVMe/TCP queue in Linux has an I/O CPU that defines where the I/O context runs on. What we did is that we used separate alignment for different queue maps each starting with zero, so they're accounted individually and not combined all together, which achieves a very good alignment between . . . That's what you want, between the application and the I/O thread, regardless of the queue map that the I/O ends up landing on.
12:50 SG: So we actually did a micro-benchmark to test the improvement, and in the benchmark here we tested queue depth one 4K canonical latency for read and then we used, again, a single threaded application, single threaded workload that has queue depth of 32, again, 4K reads. So, in the queue depth 1, we got a 10% improvement, which is great, and then the queue depth 32, we actually achieved more between 179K IOPS to 231. So that happens out of the box, so that was definitely a nice improvement.
Then, we'll focus on contact switching and what have been done to mitigate that. So the goal on the takes path, remember that once in the couple of slides before we showed that once the queuing callback and the driver queuing callback, NVMe/TCP queue request gets an I/O, it actually posted on its internal cue and then it triggers our context switches to the I/O context that owns the TCP strain. So, we want to eliminate that.
14:10 SG: So basically what we do here is that basically in the callback itself, we're actually going ahead and sending the I/O directly from that context eliminating the I/O thread context on that context switch. So obviously, now that we have two contexts that actually may access the TCP connection at once, they need to be serialized. So we need to add the serialization context between them, which was in a form of mute test, and also the network send is not guaranteed to be atomic, so we need to change the locking scheme in the heart and the queuing interface in the block layer changing it from RC to SRC.
But given that we have now two contexts that may access the stream together, we may not want to contend it. So we only do this optimization if the queue is empty, meaning that the context which is really unneeded, and even if the queue is empty, which is an advisory indication, only if the map CPU matches the running CPU, meaning that if we ended up on the . . . If the I/O context and the context that the submission is running on the same CPU port, meaning that if we contend with someone, it's with ourselves, so we're likely to be able to grab without to be scheduled out slipping for the lock.
15:40 SG: So, we did that, and a second optimization is adding software priority, which basically defines the metric queuing with a ADQ technology that egress traffic is basically steered to the dedicated ADQ and NCQ set.
Now, moving to the RX plane, as we mentioned, once we get an interrupt where we don't process the I/O directly from the interrupt itself, we actually fire work of I/O contexts of worker thread, to actually process the incoming data and completion. So, if the application is polling, and we mentioned that poll queue set and we mentioned the hyper priority flag that is supported in Linux.
Now we have applications that are able to poll either via direct I/O or via I/O ring. So, we added basically NVMe/TCP poll callback and plug it into the multi-queue block poll interface. Now that polling interface calls, just call leverages, the BC functionality in the networking stack, and then if we're polling and the application is polling, then we're able to identify it in data and when we get the callback and we don't schedule an extra context, we don't schedule the I/O work thread because we know that we're going to process the incoming data directly from the poll.
17:27 SG: So basically, with modern NIC CPUs interrupt moderation, it works relatively well to reduce some of the interrupts, but with ADQ it looks where interrupts are mitigated more aggressively, it works very well. So we did . . .
Once we added these two elimination of the context switches, again, we run the same benchmark. Now the baseline was with the I/O optimization applied and we achieved additional 10% of queue depth 1 canonical latency for 4K reads, and we also achieved, I think around between 5 and 10% of improvement in queue depth 32 between 230K IOPS and 247K IOPS from a single thread. Well that's good. Moving forward, of course, the latest E800 series, latest NIC from Intel introduced some ADQ improvements.
18:37 SG: So, what are the ADQ improvements that are available? In general, it's first of all, traffic isolation that you can associate a specific queue set and configure the NIC, so it can steer the specific application workflow to a dedicated queue set that is not shared with the other application of the host. So inbound, basically the configuration is to basically define the queue set configuration, apply TC flower filtering, and then adding a queue selection through RSS or flow director. The outbound is setting the software priority, which we mentioned that we sort of turn on in the NVMe/TCP controller via mode parameter, and then also configure extensions for transmit packets, steering XPS that matches the symmetric queuing.
19:36 SG: The value is that we have no noisy traffic between neighbor workloads and opportunity to customize network parameter to a specific workflow, and that is something that we leverage in NVMe/TCP. Other improvements, obviously, we're minimizing contact switching and interrupts overhead by applying polling. So, once the application is polling friendly, we have basically now dedicated queues that are configured with ADQ and they act as our polling queues in NVMe/TCP, that's the queue set. What we are doing is that we're draining the network completion within the application context as it's polling, and we process completions directly in the application context. We handle the request when we send in the response, as I said, directly with eliminating the context switch between the I/O context and the application.
We also have the ability to group multiple queues together in a single NIC hardware queue to basically streamline sharing of the NIC hardware queues without incurring needless context switches. So if we remind ourselves that in the driver we had two context switches, one the DTX, PAX, one the RxPax and we had a soft interrupt driven from the NIC interrupt, now we eliminated all three of them together with the enhancements and ADQ together. So, the value has reduced CPU utilization, and latency, and latency jitter that are low overall.
21:25 SG: So here are some of the measurements that NVB TCP with ADQ enabled or disabled, and the latency in QDAP1 achieves . . . That's, of course, I forgot to mention, it's the added latency on top of the media itself, and the transport overhead is right now, with ADQ is 17.7, it could be less depending on the CPU class also. And on QDAP32, we actually get the massive improvement of almost 30%, if not more, between 245K IOPS to 341K IOPS in a single thread. Now if you multiply this thread and apply them to the overall throughput that you can get from the host, you can basically scales pretty linearly to a point where you can achieve host saturation with just over a handful of cores, or 10 cores.
22:37 SG: Okay. Now we'll result in the final optimizations, which is almost separate from all the latency optimization in polling workloads, and we move to latency optimizations that are batching related, that are more for applications that are less concerned about canonical latency and high priority I/O, but more about having a bandwidth-oriented workload but still achieve good latency in the presence of high bandwidth. So, the optimizations are batch oriented. What we want is to leverage information that the block layer can provide in the driver about a queue building up. So, from the driver perspective you don't have a real view of all the queue that's building up in block layer.
23:35 SG: So the first thing that we want to do is, the first thing that is available, and the block layer basically can indicate to the driver if the request that's being queued is actually the last one or not. What then we're doing is that we're modifying the queue and how it handles its internal queuing. We basically make that more lightweight, moving it from a list protected by a lock into a lockless list. And then, by that, we basically . . . Between the interface of the queuing and between the I/O context that is pulling requests from that queue, we make it a batch-oriented push and pull, so we can get better utilization or less frequent atomic operation and more batch processing.
24:43 SG: To that, all this information, basically now we have a better view of the queue that's building up and we made the processing around it more efficient, then we add this information and hit the network stack through message flags on top of our send operations to have the network act more efficiently. So, for example, if we know that we're going to send a piece of data to the network and we know that we have a queue building up behind us, we'll indicate it with message more. So, we'll hint the stack that it needs to wait for more data and not necessarily go ahead and send it together, it has an opportunity to batch. But if this is the last piece of data in the queue that we're going to send and we don't know of anything that's building up behind it as well, so we actually turn on message end of record to indicate to the stack that whatever it has built up, it should go ahead and just now send it.
25:53 SG: And the last one, is a work by a team in Cornell University that built an I/O scheduler that is optimized for this sort of TCP stream batching workload and optimizing for bandwidth and latency. You can find that in the i10 paper that is available through a link, or you can just search for "i10 Cornell."
This I/O scheduler is on its way upstream and will be submitted soon. A couple of benchmark results that the Cornell team has done, we can see here on the left-hand side, this is just standard, or the upstream NVMe/TCP without the i10 I/O scheduler and the optimization. We see the 4K read IOPS that it can achieve, we see here that latency is around 100 microseconds, and once we get into 145 or just under 150, latency just starts spiking up because we don't have any more efficiency and just latency starts accumulating, we sort of hit a wall. And with i10, there is some sacrifice, small sacrifice to latency in low QDAP, but in higher QDAPs, we can see that we can push over 200K IOPS or almost 225K IOPS until we hit this wall that latency just starts increasing. So, what it shows us, that there is, first of all, higher throughput, that's for a single thread. Higher throughput and higher IOPS, and also the latency stays pretty constant.
28:01 SG: On the right-hand side, we can just see the throughput, we have 16 cores against the RAM device. We have 16 cores and we see that overall, the throughput in all request sizes until it almost saturates the NIC, is improved with i10, and that should suggest to us that basically batching and batch processing is helpful specifically when we talk about a stream-based application, which NVMe/TCP actually is. Okay. So, we're on a recording, so we don't have any questions, so thank you for listening.