E-Handbook: How to solve storage latency issues when deploying NVME-oF Article 3 of 4

Sergey Nivens - Fotolia

Tip

NVMe-oF performance monitoring best practices that work

Don't let your storage network infrastructure get in the way of delivering NVMe-oF's low-latency benefits. See how these best practices can fix problems before they occur.

George Crump

By

George Crump

Published: 16 Jan 2020

NVMe promises to reduce the internal communication latency between storage devices and processors to less than 100 microseconds. And NVMe-oF should bring these same low latencies to shared storage.

As a result, NVMe-oF has an opportunity to eliminate DAS, which has been regaining popularity because of the low latency demands of AI, machine learning and big data analytics. The challenge for IT pros is to put NVMe-oF performance monitoring in place to ensure the network configuration delivers NVMe-oF's low latency.

Why NVMe-oF performance monitoring is critical

In the past, the storage network was the fastest set of components within the storage infrastructure. The applications, storage systems and storage devices were far more latent than the network's switches and adapters. A misconfigured network port, network adapter or below-grade cable would often go undetected. In most cases, the only motivation to upgrade the storage network to a higher bandwidth was to get a faster speed for the same price -- or cheaper -- than the lower rate.

Now, we have NVMe-oF, connected storage systems with NVMe storage media inside combined with enterprises using more AI, machine learning and big data analytics applications. As a result, the core of the network is under pressure to keep pace. Any misconfiguration in the network means it becomes the bottleneck that slows down the storage hardware and advanced applications. Detecting problems within the network before they impact performance is critical.

Another reason that NVMe-oF performance monitoring is so important is the incredibly high expectations of application owners. They expect applications to perform at the levels promised by the storage system. In most cases, installing a faster, higher-bandwidth, lower-latency storage system and network improves application performance, but it may not improve performance enough to meet those expectations. Unlike in the past, in most cases, the application is to blame. Still, because of history, IT infrastructure personnel must prove the network and the storage systems are configured correctly to deliver the promised performance. In other words, they must prove their innocence.

NVMe-oF performance monitoring best practices

How to monitor latency-free networks

How does IT configure the storage infrastructure correctly from the start, stay on top of changes and prove that the infrastructure is performing correctly if an application owner complains about performance?

It all comes down to collecting and understanding the telemetry data that network switches already produce. A network switch "sees" every I/O sent from the application to the storage system, but collecting that data and presenting it so busy IT professionals can quickly interpret it is often the missing link.

The low latency of NVMe and NVMe-oF combined with the I/O demands of modern workloads means poor network design and configuration can no longer hide behind the latency of other storage infrastructure components.

With low-latency networks, a lot of traffic traverses the network so quickly that the typical methods of capturing telemetry data may miss events that impact network performance. Attempts to capture every bit per second of telemetry data may affect overall infrastructure performance. Most storage network monitoring tools collect data at polling intervals by taking a snapshot of network traffic I/O and switching conditions every 10 seconds.

A massive amount of I/O can traverse an NVMe-oF network in 10 seconds. In that time, polling tools can miss critical indicators of problems. They may not be able to provide IT the information it needs to determine if an anomaly is just an anomaly or the source of a problem. However, decreasing data capture intervals increases the potential for performance impact, and the tool may not be able to store all the data it captures.

Another option is real-time telemetry capture, but here again, if done on the switch, the capture may impact performance. Today, as in the past, organizations use network taps that connect inline on the network infrastructure cabling. These taps enable a real-time feed of information back to a telemetry analysis software solution without impacting switch performance. However, the installation of taps can be disruptive. While there are workarounds, most IT professionals assume an outage during tap installation.

Instead of polling at specific intervals or going through the cost and potential outage of implementing taps, an organization may want to look for network switches with a dedicated telemetry application-specific integrated circuit (ASIC). The dedicated ASIC enables real-time telemetry data capture with no performance impact.

Telemetry capture is only half the battle

Capturing telemetry data in real time without impacting storage network performance is an essential step in monitoring high-speed, low-latency storage networks. The next step is to assemble this data into something that a busy IT professional can use to quickly diagnose any potential issues or potential upcoming shortfalls in network resources.

Look for tools that not only clearly present the telemetry data, but also use machine learning and big data analytics to help diagnose problems on the network. The long-term goal should be to train the network monitoring system to automatically take corrective action by having it watch the steps human administrators take to resolve issues.

The low latency of NVMe and NVMe-oF combined with the I/O demands of modern workloads means poor network design and configuration can no longer hide behind the latency of other storage infrastructure components. IT needs to proactively monitor storage network infrastructure design and resource consumption to make sure it's always a step ahead of the organization's I/O demands.

Real-time telemetry capture, when driven by on-switch ASICs, enables an organization to do this sort of NVMe-oF performance monitoring and see what's happening to its network at any given moment. Combined with the right analysis and presentation tool, IT should be able to proactively fix potential hot spots before they occur and plan for future infrastructure needs.

Next Steps

Making a smooth move to NVMe

Your NVMe-oF questions answered

Get to know NVMe-oF's many benefits

Dig Deeper on Flash memory and storage

E-Handbook: How to solve storage latency issues when deploying NVME-oF

Article3 of 4

Up Next

NVMe-oF exposes storage network latency issues

The storage network is key to solving latency problems. Get the big picture on NVMe-oF issues and the roles of performance monitoring and high-performance interconnects.

3 NVMe-oF questions to consider when deploying the technology

Deploying NVMe over fabrics is a relatively simple process. However, make sure to check out three important areas before launching this technology.

NVMe-oF performance monitoring best practices that work

Don't let your storage network infrastructure get in the way of delivering NVMe-oF's low-latency benefits. See how these best practices can fix problems before they occur.

High-performance interconnects and storage performance

Should you use InfiniBand or RDMA over Converged Ethernet as an interconnect? We discuss each technology's effect on performance features such as latency, IOPS and throughput.

Search Disaster Recovery

Building a power outage business continuity plan: Step by step
Loss of electric power presents a major risk to business continuity, and no organization is immune. Take these steps to create a ...
Business continuity in the cloud: Benefits, issues and tips
Using the cloud for business continuity helps reduce downtime, increase redundancy and simplify disaster recovery plans. Learn ...
Risk assessment matrix: Free template and usage guide
A risk assessment matrix identifies issues with the greatest potential for business disruption or damage. Use our free template ...

Search Data Backup

What is endpoint data loss prevention? A best practices guide
Today's mobile workforce puts company data at risk. Endpoint data loss prevention secures sensitive info at the source, reducing ...
12 leading courses in data backup training for IT teams
Data backup training covers key aspects of data protection that are essential for compliance and risk mitigation. Here are 12 ...
9 backup as a service (BaaS) providers in 2025
BaaS is available in public, private and hybrid varieties and from numerous vendors. Here's how to evaluate the options to find a...

Search Data Center

The increasing concern of data center land acquisition
Data center land acquisition is increasing due to the growing demand for capacity and AI workloads. By 2030, facility areas are ...
Nvidia introduces entry-level RTX Pro GPU
The company's RTX Pro 6000 Blackwell Server Edition GPU and RTX Pro Server offer companies using smaller-scale enterprise ...
Server hardware guide: Architecture, products and management
Today's server platforms offer various options for SMBs and enterprise IT buyers; it's important to learn the essentials before ...

Sustainability
and ESG

9 IT sustainability approaches to consider
Learn from these nine IT sustainability approaches and examples, including prioritizing e-waste reduction, using AI more ...
Diverse teams are smarter -- here’s why
Your company can foster an improved bottom line, better retention and a host of other benefits by supporting diverse and ...
Sustainability quiz: Test your knowledge of the basics
Have fun testing what you know about climate change basics, contributing factors and potential solutions by taking this ...

Close