Tech Accelerator

The definitive guide to enterprise IT monitoring

This comprehensive IT monitoring guide examines strategies to track systems, from servers to software UIs, and how to choose tools for every monitoring need.

George Harrison once sang, paraphrasing the Cheshire Cat from Lewis Carroll's Alice in Wonderland: "If you don't know where you're going, any road will take you there." That very same lack of insight has plagued IT organizations since the earliest days of computing, leading to inefficiency and wasted capital, nagging performance problems and perplexing availability issues that can be costly and time-consuming to resolve.

That's why IT monitoring has long been an essential element of any enterprise IT strategy, to oversee and guide the data center and its constituent applications and services. IT administrators and business leaders track metrics over time to ensure that the IT organization maintains its required level of production, using trends to validate infrastructure updates and changes long before applications and services are affected. At the same time, real-time alerting allows administrators to respond to immediate problems that could harm the business.

Enterprise IT monitoring uses software-based instrumentation, such as APIs and agents, to gather operational information about hardware and software across the enterprise infrastructure. Such information can include basic device or application and device health checks, as well as far more detailed metrics that track resource availability and utilization, system and network response times, error rates and alarms, and other data.

IT monitoring employs three fundamental layers. The foundation layer gathers data from the IT environment, often using combinations of agents, logs, APIs or other standardized communication protocols to access data from hardware and software. The raw data is then processed and analyzed through monitoring software. From this, the tools establish trends and generate alarms. The interface layer displays the analyzed data in graphs or charts through a GUI dashboard.

Types of IT monitoring

Although the need for IT monitoring is ubiquitous, monitoring approaches have proliferated and diversified through the years. This has yielded an array of tools focused on specific aspects of monitoring, ranging from the fundamental IT infrastructure, the network, the application and even the user experience (UX). Regardless of the monitoring type, the goal is generally to answer four essential questions:

  • What is present?
  • Is it working?
  • How well is it working?
  • How much of what is present is being used?

The following graphic represents data from surveys conducted in late 2019 and early 2020, before the emergence of COVID-19, which upended all aspects of business, including IT plans. Nevertheless, the need to monitor all aspects of IT environments hasn't gone away.

Survey of enterprise IT monitoring plans for 2020, conducted prior to COVID-19.
Enterprise IT leaders surveyed from late November 2019 through mid-February 2020 indicated a wide range of IT monitoring plans, with emphasis on security, application and network performance.

IT infrastructure monitoring. The traditional foundation of compute infrastructure lies in an organization's local data center environment: fleets of physical and virtual servers as well as storage assets such as local disks and disk arrays. Server monitoring discovers and classifies the existing assets, including hardware, operating systems and drivers, and then collects and processes a series of metrics.

Metrics can track physical and virtual server availability (uptime) and performance, and measure resource capacity -- such as processors, memory, disk storage and any associated storage area network. By watching capacity metrics, a business can find unavailable, unused or underutilized resources and make informed predictions about when and how much to upgrade. Metrics also allow administrators to oversee benchmarks that gauge compute performance (actual versus normal performance), from which they can immediately identify and correct infrastructure failures. Changes in performance trends over time also might indicate potential future compute problems, such as over-stressed servers.

As organizations embrace off-premises compute environments, infrastructure monitoring has expanded to include remote and cloud infrastructures. Although cloud monitoring has some limitations, PaaS providers often allow infrastructure visibility down to the server and operating system level, including processors, memory and storage. Some native tools let IT managers dig into the details of log data. Still, public cloud infrastructures are shrouded by a virtualization layer that conceals the underlying physical assets.

Network monitoring. Servers and storage have little value without a LAN and WAN (such as the internet) to connect them, so network monitoring has evolved as an important IT monitoring type. Unique devices in the network, including switches, routers, firewalls and gateways, rely on APIs and common communication protocols to provide details about configuration, such as routing and forwarding tables. Monitoring tools yield network performance metrics -- such as uptime, errors, bandwidth consumption and latency (response time) across all of the subnets of a complex LAN. It is challenging to find a single network monitoring tool to cover all network devices and process metrics into meaningful intelligence.

Another reason network monitoring is a separate branch of the IT monitoring tree is security. A network is the road system that carries data around an enterprise and to its users. It is also the principal avenue for attack on the organization's servers and storage. That makes it essential to have an array of network security tools for intrusion detection and prevention, vulnerability monitoring, access logging and so on. Today, the notion of continuous security monitoring relies on automation. It promises real-time, end-to-end oversight of the security environment, alerting security teams to potential breaches.

Application and user experience monitoring. Even an infrastructure and network that work perfectly might provide inadequate support applications, leaving users frustrated and dissatisfied. This potentially leads to wasted investment and lost business. Ultimately, it's availability and performance --not the infrastructure -- that really matter. This has led to two relatively recent expressions of IT monitoring: application performance monitoring (APM) and UX monitoring.

APM typically uses traditional infrastructure and network monitoring tools to assess the underlying behaviors of the application's environment, but it also gathers metrics specifically related to application performance. These can include average latency, latency under peak load and bottleneck data such as delays accessing an essential database. An organization that demonstrates how an application performs within acceptable parameters and operates as expected can strengthen its governance posture. When an application's performance deviates from acceptable parameters, leaders can remediate problems quickly -- often without users ever knowing a problem existed.

APM took root in local data centers, but public cloud providers offer support tools -- such as Amazon CloudWatch and Azure Monitor -- for application-centric cloud monitoring. Organizations with a strong public cloud portfolio will need a range of cloud monitoring tools that not only track application performance, but also ensure security and calculate how efficiently resources are used. Common cloud application metrics include resource availability, response time, application errors and network traffic levels.

A recent iteration of application performance monitoring relates to applications based on a microservices architecture, which use APIs to integrate and allow communication between individual services. Such architectures are particularly challenging to monitor because of the ephemeral nature of microservices containers and the emphasis on LAN connectivity and performance. Consequently, microservices applications often rely on what's called semantic monitoring to run periodic, simulated tests within production systems, gathering metrics on the application's performance, availability, functionality and response times.

UX monitoring is closely related to APM, but the metrics are gathered from the users' perspective. For example, an application-responsiveness metric might refer to the time it takes for an application to complete a requested web page. If this metric is low, that means the request was completed quickly and the user was satisfied with the result. As the request takes longer to complete, the user is less satisfied.

IT monitoring throughout the enterprise technology stack
IT monitoring happens throughout the enterprise IT stack, gathering metrics about system and application performance, network and security updates, and user experiences.

IT monitoring and DevOps

As IT activities transition into a business service in their own right, the role of IT monitoring has expanded from an infrastructure focus to include IT processes to ensure that workflows proceed quickly, efficiently and successfully. At the same time, software development has emerged as one of the most important IT processes. Development paradigms, such as DevOps, embrace rapid cycles of iteration and testing to drive software product development.

Rapid cyclical workflows can be rife with bottlenecks and delays caused by human error, poorly planned processes and inadequate or inappropriate tools. Monitoring the DevOps workflow allows an organization to collect metrics on deployment frequency, lead time, change volume, failed deployments, defect (bug) volume, mean time to detection/recovery, service-level agreement (SLA) compliance and other steps and missteps. These metrics can preserve DevOps efficiency while identifying potential areas for improvement.

IT monitoring tools used in DevOps environments often focus on end-to-end aspects of application performance and UX monitoring. Rather than simply observe the total or net performance, the goal is to help developers and project managers delve into the many associations and dependencies that occur around application performance. This helps them determine the root of performance problems and troubleshoot more effectively.

For example, it's useful to know that it takes an average of 12 seconds to return a user request. That metric, however, doesn't identify why that action takes so long. Tools that watch the underlying infrastructure reveal that, for example, web server 18 needs 7.4 seconds to get a response from the database server, and this is ultimately responsible for the unacceptable delay. With such insight, developers and administrators can address the underlying issues to improve and maintain application performance.

IT monitoring also has adapted and expanded to support virtualized containers since many new applications are designed for containers. Virtualization and VMs are hardly new to IT infrastructure, but rising use of containers brings a host of container monitoring challenges to IT organizations. Containers are notoriously ephemeral -- they spawn within the environment and last only as long as needed, sometimes only seconds. Containers use relatively few resources compared to VMs, but they are more numerous.

Moreover, a working container environment involves an array of supporting elements -- such as container hosts, container engines, cluster orchestration and management systems, service routing and mesh services, and application deployment paradigms, such as microservices. Taken altogether, containers greatly multiply the number of objects that IT teams must monitor, rendering traditional monitoring setup and configuration activities inadequate.

Container monitoring involves familiar metrics such as resource and network utilization, but there are other metrics to contend with, including node utilization, pods versus capacity and kube-system alerts. Even container logging must be reviewed and updated to ensure that meaningful log data is collected from the application, volume and container engine for analysis.

Container KPIs
Breakdown of key performance indicators for container infrastructure monitoring

Other advances in enterprise IT monitoring include the rise of real-time monitoring and trend/predictive monitoring. Real-time monitoring isn't just a matter of agents forwarding collected data and sending alerts to IT administrators. Instead, the goal is to stream continuous real-time information where it can be collected, analyzed and used to make informed decisions about immediate events as well as assess trends over time. Data is collected from the infrastructure, but with increasingly resilient and distributed modern infrastructures, the data collected from applications can be more valuable to IT staff.

Such monitoring employs a more analytical approach to alerting and thresholds, with techniques such as classification analysis and regression analysis used to make smart decisions about normal versus abnormal system and application behaviors. Classification analysis organizes data points into groups or clusters, allowing events outside of the classification to be easily identified for closer evaluation. Regression analysis generally makes decisions or predictions based on past events or behaviors. Normal distributions plot events based on probability to determine the mean (average) and variations (standard deviations), also allowing unusual events to be found quickly and effectively.

Classification and regression analysis are also closely related to machine learning and artificial intelligence (AI). Both technologies are making inroads into IT monitoring in product categories, such as AIOps (artificial intelligence for IT operations). Machine learning uses collected data to build a behavioral model and then expands and refines the model over time to provide accurate predictions about how something will behave. Machine learning technologies have proved effective in IT tasks such as failure prediction and predictive maintenance. AI and AIOps build on machine learning to bring autonomy to the machine learning model and enable software to make -- and respond to -- informed decisions.

This video explains what AIops is and its role in enterprise IT environments.

Network monitoring poses another challenge. Frequent use of traditional Simple Network Management Protocol (SNMP) communication can be disruptive to busy modern networks, so a type of real-time monitoring called streaming telemetry pushes network operational data to data collection points. This offers better real-time monitoring than SNMP without disrupting network devices.

Building an effective IT monitoring strategy

When an organization clarifies the goals or reasons for monitoring, it narrows the choices and limits the proliferation of disparate monitoring tools deployed across the enterprise. A good strategy ultimately saves money, conserves limited IT resources, speeds troubleshooting and recovery and reduces the burden of managing multiple tools.

There are four basic IT monitoring strategies an organization can build upon:

  • Reduce or limit the number of monitoring tools. This works well for relatively homogeneous organizations that use a limited number of systems, architectures, workflows and policies. As one example, a company that does business with just one public cloud provider might use that provider's native monitoring tools along with one or two tools to support the local data center. However, this may be impractical for heterogeneous organizations with broad mixes of hardware, architectures and workflow models.
  • Develop monitoring that closely ties to the application architecture. For example, create a UI for application data using public cloud, serverless and managed services. This approach is intended for newer application architectures, which can be designed and supported from the ground up; it doesn't work well for legacy or heterogeneous architectures.
  • Develop an in-house monitoring environment. One common example is to use log aggregation and analytics tools to create a central repository of operational data, and analyze, report and even predict alerts. This strategy can integrate multiple monitoring tools along with database, data integration, monitoring and visualization tools to create a custom monitoring resource. Be aware that the DIY approach can be time-consuming and expensive to create and maintain.
  • Adopt an autonomous operations platform. Tools such as Moogsoft, Datameer, VictorOps, OpsGenie and AlertOps use data integration and machine learning to effectively create a unified monitoring system with a growing level of intelligence and autonomy to help speed IT incident responses.

Once a strategy is clear, an organization can make more granular choices about approaches and tools that define the monitoring implementation. There are plenty of choices.

Agents versus agentless monitoring: This is the process of collecting, processing and reporting data. But which data is collected -- and how -- can vary dramatically. A truly effective monitoring tool sees each target hardware or software object and can query details about each of them. In most cases, this requires the installation of agents on each object to be discovered and monitored. While they produce extremely detailed monitoring data, agents must be patched, updated and otherwise managed. They also require processing and network overhead, potentially harming the performance of the object on which the agent operates.

Agentless monitoring foregoes the use of agents and instead collects data through standardized communication protocols, such as intelligent platform management interface, SNMP or interoperable APIs. Agentless monitoring sheds the disadvantages of agents, but the data it collects tends to be limited in both quantity and detail. Many monitoring products support both agent and agentless data collection.

Reactive monitoring versus proactive monitoring: This is another expression of real-time versus trend monitoring. Collection and reporting on real-time statistics and data -- such as processor and memory utilization -- to overall service health and availability is a time-tested, proven approach for alerting and troubleshooting in a 24/7 data center environment. In this style of monitoring IT, administrators react to an event once it occurs.

Proactive monitoring seeks to look ahead and make assessments and recommendations that can potentially prevent problems from occurring. For example, if a monitoring tool alerts administrators that memory was not released when a virtual machine was destroyed, it could help prevent a memory leak in the VM application before the affected server runs out of memory and crashes. Proactive monitoring depends on reactive tools to collect data and create trends for the proactive tool to analyze, and is increasingly augmented with machine learning technologies to help spot abnormal behaviors and cyclical (recurring) events.

Distributed applications: Applications that traditionally run in the local data center are increasingly distributed across multiple computing infrastructure models, such as remote data centers as well as hybrid cloud and multi-cloud environments. For example, an application may run multiple instances in the public cloud, where ample scalability is readily available, but rely on other applications or data still hosted in the local data center. This adds tremendous monitoring complexity because each component of the overall application must be monitored to ensure that it operates properly.

One key choice in such complex environments is centralization or decentralization. Centralizing collects monitoring data from local and cloud platforms into a single tool to present a single, unified view. This is best to provide end-to-end monitoring across cloud and local infrastructures, although it requires careful integration of cloud and local monitoring. By contrast, decentralization continues the use of cloud and local tools without coordination or interdependency. This is simpler to manage and maintain with few dependencies, but organization and analysis of multiple monitoring data sources can be a challenge.

Monitoring and virtualization: Virtualization is a staple of cloud and local data centers, and is responsible for vastly improved resource utilization and versatility through "software-defined" technologies, such as software-defined networks. Monitoring must account for the presence of virtualization layers, whether hypervisors or container engines, to see the underlying physical layer wherever possible. Modern monitoring tools are typically virtualization-aware, but it's important to validate each tool's behavior.

For example, network virtualization divides a physical network into many logical networks, but it can mask performance or device problems from traditional monitoring tools. Proper monitoring at the network level may require monitoring individual VMs and hypervisors to ensure a complete performance picture.

The role of machine learning and AI: Enterprise IT monitoring involves a vast amount of information. There's real-time data and streaming telemetry to watch for current events and track trends over time, and countless detailed logs generated by servers, devices, operating systems and applications to sort and analyze for event triggers and root causes. Many monitoring alarms and alerts are false positives or have no consequent impact on performance or stability. It can be daunting for administrators to identify and isolate meaningful events from inconsequential ones.

Consider the issue of anomaly detection. Common thresholds can trigger an alert, but human intervention determines whether the alert is important. Monitoring tools increasingly incorporate AI and machine learning capabilities, which apply math and trends to flag events as statistically significant and help administrators separate the signal from the noise. In effect, AI sets thresholds automatically to reduce false positives and identify and prioritize the most important incidents.

Machine learning also aids anomaly detection in log analytics , a monitoring practice that is particularly effective for root cause analysis and troubleshooting. Here, machine learning uses regression analysis and event correlation to flag potential anomalies and predict future events, and can even adjust for seasonal or daily variations in trends to reduce false positives.

For an example of machine learning and AI in monitoring, consider the vast amounts of network traffic that an organization receives. Divining an attempted hack or other attack from that volume of traffic can be extremely challenging. But anomaly detection techniques can combine a view of traffic content, behaviors and log reporting to pinpoint likely attacks and take proactive steps to block the activity while it is investigated.

While machine learning provides powerful benefits for IT monitoring, the benefits are not automatic. Every business is different, so there is no single algorithm or model for machine learning to operate upon. This means IT administrators and software developers must ultimately create the model that drives machine learning for the organization, using a vast array of metrics -- such as network traffic volumes, source and target IP address, memory, storage, application latency, replication latency, message queue length and numerous other potential data points. A practical machine learning exercise might involve Apache Mesos and the K-means clustering algorithm for data clustering and analysis.

Best practices for IT monitoring: IT monitoring is a dynamic process that requires regular attention and support of data monitoring, thresholds and alerts, visualization or dashboard setup as wells as integrations with other tools or workflows, such as CI/CD and AIOps. Machine learning and AI can help to alleviate some of the routine tasks involved, but regular attention is essential to maintain the automated workflows and to validate the evolving machine learning model.

Consider the simple importance of thresholds in IT monitoring. Monitoring can employ static and dynamic thresholds. Static thresholds are typically set based on worst-case situations, such as maximum processor or memory utilization percentages, and can typically be adjusted from any default thresholds included with the monitoring tool. A static threshold is rarely changed and doesn't account for variations in the environment. It applies to every instance, so it's easy to wind up over- or under-reporting critical issues, resulting in missed problems or false positives.

By comparison, dynamic thresholds generally use machine learning to determine what is normal and generate alerts only when the determined threshold is exceeded. Dynamic thresholds can adjust for seasonal or cyclical trends, and can better separate real events from false positives. Thresholds are adjusted automatically based on cyclical trends and new input. Dynamic thresholds are imperfect, and they can be disrupted when activity occurs outside of established patterns. Thus, dynamic thresholds still require some human oversight to ensure that any machine learning and automation proceeds in an acceptable manner.

Overall, the best practices for enterprise IT monitoring and responses can be broken down into a series of practical guidelines.

  1. Focus on the system and apps. There are countless metrics that can be collected and analyzed, but the only metrics that most IT administrators should worry about are the metrics related to system (infrastructure) and application performance -- everything else is extraneous or cannot readily be acted upon by IT. For example, a metric such as cost per transaction has little value to IT monitoring, but a metric such as transaction latency can be vital to adequate performance and SLA compliance.
  2. Carefully configure alerts. Thresholds and alerts are typically the first line of defense when issues arise. Direct alerts to the most appropriate team members and then be sure to hold those staffers accountable. Ideally, IT should know about any problem before a supervisor -- or a customer. Integrate alerts into an automated ticketing or incident system, if possible, to speed assignment and remediation.
  3. Be selective with alerts and reports. Don't overwhelm IT staff with needless or informational alerts. Only configure alerts for metrics that pertain directly to IT operations, and turn off alerting for metrics over which the IT staff has no control. This reduces noise and stress, plus it allows staff to focus on the most relevant alerts.

IT monitoring tools

IT administrators only know and act on what they see, and what they see is ultimately enabled through tools. Organizations can employ a multitude of tools to oversee and manage infrastructure and services, but tools have various limitations in scope, discovery, interoperability and capability.

An IT team needs a clear perspective on criteria -- what problems are they trying to solve through the use of tools? For example, a business concerned with network performance or traffic analysis needs a network monitoring tool; a tool intended for server monitoring may offer some network insights, but that data likely is not meaningful enough to be useful.

In the end, an IT staff team faces a difficult decision: deploy a suite or framework that does everything to some extent, or use tools from a variety of vendors that provide detailed information but in a pieced-together arrangement that can be hard to integrate, learn and maintain.

Sometimes, new and innovative technologies offer powerful opportunities for monitoring, optimization and troubleshooting. One example of this innovation is the emergence of log analytics tools. Almost every system produces log files that contain valuable data about events, changes and errors. But logs can be huge, difficult to parse and challenging to correlate, making it almost impossible for humans to find real value in logs.

A relatively new classification of log analytics tools can discover, aggregate, analyze and report insights gleaned from logs across the infrastructure and applications. The recent addition of machine learning and AI capabilities to log analytics allows such tools to pinpoint anomalous behaviors and even predict potential events or issues. In addition to logs, the ability to access and aggregate vast amounts of monitoring data from other tools allows products such as Grafana or Datadog to offer more comprehensive pictures of what's happening in an environment.

Organizations with a local data center typically adopt some form of server monitoring tool to oversee each server's health, resources and performance. Many tools provide server and application or service management features. Tools include Cacti, ManageEngine Applications Manager, Microsoft System Center Operations Manager, Nagios, Opsview, SolarWinds Server and Application Monitor, Zabbix and more.

IT must also decide between vendor-native or third-party monitoring tools. Third-party tools such as SolarWinds Virtualization Manager and Veeam One monitor virtualized assets, such as VMs, and potentially provide superior visualizations and integrations at a lower cost than native hypervisor offerings, such as Microsoft's System Center 2019 or VMware vRealize Operations 8.0.

Extensibility and interoperability are critical when selecting an IT monitoring tool. Plugins, modules, connectors and other types of software-based interfaces allow tools to discover, configure, manage and troubleshoot additional systems and services. Adding a new plugin can be far easier and cheaper than purchasing a new tool. One example is the use of modules to extend a tool such as SolarWinds for additional IT operations tasks.

Interoperability is critical in building a broader monitoring and automation umbrella, and some tools are rising to the challenge. For example, the Dynatrace AIOps engine now collects metrics from the Kubernetes API and Prometheus time-series monitoring tool for Kubernetes clusters. Ideally, such integration improves detection of root cause events in Kubernetes; more broadly, the implications for integration and IT automation portend powerful advancements for AI in operations.

The ability to process and render vast amounts of infrastructure data at various levels, from dashboards to graphs, adds tremendous value to server and system monitoring. Sometimes, a separate visualization tool is most appropriate. Examples include Kibana, an open source log analysis platform that discovers, visualizes and builds dashboards on top of log data; and Grafana, a similar open source visualization tool, which is used with a variety of data stores and supports metrics.

Grafana alerting dashboard
This is how Grafana presents an alerting dashboard.

The shift of infrastructure and applications to the cloud means organizations need to track those resources as part of their enterprise IT monitoring efforts. Public cloud providers have opened their traditionally opaque infrastructures to accommodate this, and service providers offer their own native tools for cloud monitoring. Google Stackdriver (now folded into the Google Cloud Console portfolio) monitors Google Cloud as well applications and VMs that run on AWS Elastic Compute Cloud, Microsoft Azure Monitor collects and analyzes data and resources from the Azure cloud, and AWS users have Amazon CloudWatch. Additional options include Oracle Application Performance Monitoring Cloud Service and Cisco CloudCenter, as well as tools such as Datadog for cloud analytics and monitoring and New Relic to track web applications.

Another major class of IT monitoring tools focuses on networks and security. Such tools can include physical devices, such as firewalls, and software, such as load balancers. They watch network activity for traffic patterns and performance between servers, systems and services.

A typical network monitoring tool -- such as Zabbix, Nagios, Wireshark, Datadog or SolarWinds' Network Performance Monitor -- will offer automatic discovery, automatic node and device inventory along with automatic and configurable trouble alerts and reporting. The interface should feature easy-to-read dashboards or charts, and it should include the ability to generate a network topology map. Virtualization and application awareness allow the tool to support advanced technologies such as network virtualization and application performance monitoring. Network monitoring can use agents, but may not need agents for all devices or applications. Graphing and reporting should ideally support interoperability with data visualization, log analytics and other monitoring tools.

Finally, organizations can leverage a variety of application and UX monitoring tools, such as New Relic, to ensure application performance and user satisfaction. These tools gather metrics on application behaviors, analyze that data to identify errors and troublesome transaction types, and offer detailed alerting and reporting to illustrate application and user metrics, as well as highlight SLA assessments. Others in the APM and UX segments with products to assist with monitoring include Datadog, Dynatrace, AppDynamics and Splunk.

Dig Deeper on IT systems management and monitoring

Software Quality
App Architecture
Cloud Computing
Data Center