Kadmy - Fotolia


Network fault management in today's complex data centers

Network designs and technologies may change, but identifying -- and remediating -- faults is a fundamental task for today's system managers. Here's what you need to know.

Detecting, identifying and correcting network faults has never been easy. Today's large data centers and cloud networks make network fault management even more challenging. It's a sharp departure from the past when client-server ruled, applications ran on a designated server and end users were connected either through in-building Ethernet, leased WAN links or other services.

Technologies change, but the end result is what matters. The question is: Are users receiving the required quality of service? The answer depends on both application and network performance.

Applications today often execute on a public, private or hybrid cloud. Applications move from server to server as loads shift. Throughput between servers and data stores varies depending on the load placed on shared links by other applications.

network performance depends on the type and capacity of the network connecting users to the application. Local users may be connected via Ethernet or Wi-Fi. Remote users connect via various WAN technologies, including the public Internet or the cell network. Each requires specialized methods to maintain required performance. Faults in any of these locations -- application or network -- can prevent user satisfaction.

Cloud fault detection

Many topologies and designs -- among them: virtualized servers, multiple virtual LANs (VLANs) and overlay networks -- complicate cloud fault detection and network fault management. A performance problem in one tenant's application may not appear to be connected to a problem affecting a different tenant, but they may stem from the same source. Each tenant's application could be executing on the same overloaded or misconfigured server, or it might be that both tenants' overlay network is routed over the same overloaded or failing link.

No human network manager could sort through the avalanche of reports generated as the result of a single fault and quickly identify the root cause.

The sheer number of servers, network components and links creates one source of faults. Modern hardware is extremely reliable. Even though each component may have a mean time between failures of years, with thousands of individual devices, hardware failures will occur.

Configuration errors are another source of problems that can be tracked by network fault management. Servers and network devices are constantly being added, upgraded or replaced. A large cloud will usually include components from many vendors, and even identical components from a single vendor may be running different software revision levels. In this environment, any change presents an opportunity for error, and a change to one component can affect others.

Simply detecting and reporting errors is not sufficient. Each fault can result in dozens of error reports. A sporadic link failure can generate hardware fault indications from switches at both ends of the link and both will issue a new report each time the link goes down and comes back up. Layer 2 and 3 protocols report route changes, as do link traffic monitors as they signal when traffic levels on alternate routes near the maximum. Meanwhile, application performance monitors are reporting problems from each of the applications that route traffic over that link.

Fault correlation and its role in the network

No human network manager could sort through the avalanche of reports generated as the result of a single fault and quickly identify the root cause. Fault correlation software is essential. It's a critical component of network management products from each of the major system vendors.

Fault correlation packages use a variety of mechanisms to detect problems, among them SNMP traps, TL1 messages, application logs and SYSLOG entries. SNMP and product-specific polling monitors load on servers, switches and links. Correlation tools also monitor such things as device temperature, power supply voltages and disk free space to anticipate future problems.

Network fault management software must maintain an accurate and up-to-date picture of the network. The software must be updated, either manually or via network mapping, to track added, removed, or updated components and links. It must maintain internal models of each component describing its configuration and capabilities and contain descriptions of network operating policies. It must also be updated with information such as service level agreements (SLAs) when applications are added.

In addition, fault correlation software must interface with cloud orchestration software to track what applications are executing, which servers they're running on and the VLANs and overlay networks associated each tenant. Network fault management software must also continually monitor application performance levels against SLAs.

When a problem occurs, correlation software pulls together all the incoming fault indications and uses its information about network topology and how data was moving prior to the fault to determine the root cause and provide a concise report to network managers.

SDN networks change equation

Clouds and data centers managed by SDN technology face the same set of potential problems as those relying on traditional techniques. They both require fault correlation software, but SDN architectures require correlation software to either be built into the network controller or tightly interfaced to it.

The reason for the difference is that traditional protocols such as Spanning Tree and Open Shortest Path First are implemented inside network devices. They reroute traffic as needed when a link or port problem blocks traffic. With SDN, all routes are determined in the controller. Fault correlation software must inform the controller about these types of problem so it can determine an alternate route.

OpenFlow-compliant white box switches support operating systems from a variety of different vendors, each with its own method of detecting and reporting faults. Operating systems from Big Switch and Pica8, for example both support SNMP, but Big Switch's controller and switch OS use OpenFlow unsolicited messages to communicate to and from devices. Correlation software communicates via interfaces to the controller to receive messages from devices and poll them for status.

Wi-Fi and WAN

Wi-Fi relies on a specialized set of tools to diagnose problems. Wi-Fi connectivity can suffer from problems such as signal interference, walls or solid objects that block the signal, and security vulnerabilities. A variety of troubleshooting products are available, spanning freeware and professional software products. Specialized hardware products are also required to diagnose some types of problems.

In the case of WAN connections owned and managed by a network service provider, the key parameters are throughput and round-trip time. Here again both free and professional products are available.

Meeting end-user performance expectations requires that all aspects of application performance operate properly. Problems will occur and network fault management and fault detection products must identify the cause so they can be quickly fixed and proper operation restored.

Next Steps

Who needs deep packet inspection?

Looking into deep packet inspection vendors

Deep packet inspection methods

This was last published in May 2016

Dig Deeper on Data Center Networking