E-Handbook: How to solve storage latency issues when deploying NVME-oF Article 3 of 4

Sergey Nivens - Fotolia


NVMe-oF performance monitoring best practices that work

Don't let your storage network infrastructure get in the way of delivering NVMe-oF's low-latency benefits. See how these best practices can fix problems before they occur.

NVMe promises to reduce the internal communication latency between storage devices and processors to less than 100 microseconds. And NVMe-oF should bring these same low latencies to shared storage.

As a result, NVMe-oF has an opportunity to eliminate DAS, which has been regaining popularity because of the low latency demands of AI, machine learning and big data analytics. The challenge for IT pros is to put NVMe-oF performance monitoring in place to ensure the network configuration delivers NVMe-oF's low latency.

Why NVMe-oF performance monitoring is critical

In the past, the storage network was the fastest set of components within the storage infrastructure. The applications, storage systems and storage devices were far more latent than the network's switches and adapters. A misconfigured network port, network adapter or below-grade cable would often go undetected. In most cases, the only motivation to upgrade the storage network to a higher bandwidth was to get a faster speed for the same price -- or cheaper -- than the lower rate.

Now, we have NVMe-oF, connected storage systems with NVMe storage media inside combined with enterprises using more AI, machine learning and big data analytics applications. As a result, the core of the network is under pressure to keep pace. Any misconfiguration in the network means it becomes the bottleneck that slows down the storage hardware and advanced applications. Detecting problems within the network before they impact performance is critical.

Another reason that NVMe-oF performance monitoring is so important is the incredibly high expectations of application owners. They expect applications to perform at the levels promised by the storage system. In most cases, installing a faster, higher-bandwidth, lower-latency storage system and network improves application performance, but it may not improve performance enough to meet those expectations. Unlike in the past, in most cases, the application is to blame. Still, because of history, IT infrastructure personnel must prove the network and the storage systems are configured correctly to deliver the promised performance. In other words, they must prove their innocence.

NVMe-oF performance monitoring best practices

How to monitor latency-free networks

How does IT configure the storage infrastructure correctly from the start, stay on top of changes and prove that the infrastructure is performing correctly if an application owner complains about performance?

It all comes down to collecting and understanding the telemetry data that network switches already produce. A network switch "sees" every I/O sent from the application to the storage system, but collecting that data and presenting it so busy IT professionals can quickly interpret it is often the missing link.

The low latency of NVMe and NVMe-oF combined with the I/O demands of modern workloads means poor network design and configuration can no longer hide behind the latency of other storage infrastructure components.

With low-latency networks, a lot of traffic traverses the network so quickly that the typical methods of capturing telemetry data may miss events that impact network performance. Attempts to capture every bit per second of telemetry data may affect overall infrastructure performance. Most storage network monitoring tools collect data at polling intervals by taking a snapshot of network traffic I/O and switching conditions every 10 seconds.

A massive amount of I/O can traverse an NVMe-oF network in 10 seconds. In that time, polling tools can miss critical indicators of problems. They may not be able to provide IT the information it needs to determine if an anomaly is just an anomaly or the source of a problem. However, decreasing data capture intervals increases the potential for performance impact, and the tool may not be able to store all the data it captures.

Another option is real-time telemetry capture, but here again, if done on the switch, the capture may impact performance. Today, as in the past, organizations use network taps that connect inline on the network infrastructure cabling. These taps enable a real-time feed of information back to a telemetry analysis software solution without impacting switch performance. However, the installation of taps can be disruptive. While there are workarounds, most IT professionals assume an outage during tap installation.

Instead of polling at specific intervals or going through the cost and potential outage of implementing taps, an organization may want to look for network switches with a dedicated telemetry application-specific integrated circuit (ASIC). The dedicated ASIC enables real-time telemetry data capture with no performance impact.

Telemetry capture is only half the battle

Capturing telemetry data in real time without impacting storage network performance is an essential step in monitoring high-speed, low-latency storage networks. The next step is to assemble this data into something that a busy IT professional can use to quickly diagnose any potential issues or potential upcoming shortfalls in network resources.

Look for tools that not only clearly present the telemetry data, but also use machine learning and big data analytics to help diagnose problems on the network. The long-term goal should be to train the network monitoring system to automatically take corrective action by having it watch the steps human administrators take to resolve issues.

The low latency of NVMe and NVMe-oF combined with the I/O demands of modern workloads means poor network design and configuration can no longer hide behind the latency of other storage infrastructure components. IT needs to proactively monitor storage network infrastructure design and resource consumption to make sure it's always a step ahead of the organization's I/O demands.

Real-time telemetry capture, when driven by on-switch ASICs, enables an organization to do this sort of NVMe-oF performance monitoring and see what's happening to its network at any given moment. Combined with the right analysis and presentation tool, IT should be able to proactively fix potential hot spots before they occur and plan for future infrastructure needs.

Next Steps

Making a smooth move to NVMe

Your NVMe-oF questions answered

Get to know NVMe-oF's many benefits

Dig Deeper on Flash memory and storage

Disaster Recovery
Data Backup
Data Center
and ESG