fault management

Fault management is the component of network management concerned with detecting, isolating and resolving problems. Properly implemented, network fault management can keep connectivity, applications and services running at an optimum level, provide fault tolerance and minimize downtime. Platforms or tools designed specifically for this purpose are called fault management systems.

Faults result from malfunctions or events that interfere with, degrade or obstruct service delivery. Examples of faults include hardware failure, connectivity loss or port status change. Once a fault is detected, the management platform notifies the administrator (and any additional authorized or designated parties) using an alarm or alert. These notifications can be viewed in the fault management system's GUI, and many platforms now can forward these alerts via email, SMS and/or a mobile app.

In addition, network fault management systems may be configured to automatically resolve or even prevent certain events using programs and scripts.

Fault management is one component of FCAPS (fault management, configuration, accounting, performance and security), which is a network management framework established by the International Organization for Standardization (ISO).

fault management workflow illustrated

Important functions of fault management

As a whole, network fault management comprises a variety of functions. Here are some examples of actions and services performed by fault management systems to keep the network operational:

  • definition of thresholds for potential failure conditions;
  • constant monitoring of system status and usage levels;
  • continuous scanning for threats, such as viruses and Trojans;
  • general diagnostics;
  • remote control of system elements, including workstations and servers from a single location;
  • alarms that notify administrators and users of impending and actual malfunctions;
  • tracing the locations of potential and actual malfunctions;
  • automatic correction of potential problem-causing conditions;
  • automatic resolution of actual malfunctions; and
  • detailed logging of system status and actions taken.

Types of fault management

There are two types of network fault management: active and passive.

Active fault management uses various tools, such as ping or TCP/UDP port checks, to continually query devices and determine their status. It's akin to a person asking everyone in the room at repeated intervals, "How are you?" This allows the fault management system to proactively identify and rectify potential issues in real time -- sometimes before they even become problems -- but the tradeoff is more network chatter.

Passive fault management systems, on the other hand, monitor their network environments for events that indicate a fault or failure has occurred. This information may come from error logs or SNMP traps, among other sources. It's comparable to a person who quietly listens until someone calls out for help. While passive fault management is more conservative in its resource usage, its drawback is that it may not discover faults until it's too late.

Fault management process

Although the fault management process used in commercial platforms may vary slightly among different vendors, most generally, follow this lifecycle when issuing an alarm:

  1. Fault detection: The system discovers that service delivery has been interrupted or its performance has degraded.
  2. Fault diagnosis and isolation: The source of the fault, such as a component failure or power outage, and its location in the network topology are identified.
  3. Event correlation and aggregation: Because a single fault can cause multiple alarms, fault management systems often group related events for administrators and provide a root cause analysis.
  4. Restoration of service: The network management system automatically executes any preconfigured scripts or programs to get services up and running as soon as possible.
  5. Problem resolution: The source of the fault is corrected, repaired or replaced. Depending on the cause, manual intervention may be required.
This was last updated in February 2018

Continue Reading About fault management

Dig Deeper on Network management and monitoring

Unified Communications
Mobile Computing
Data Center