IT monitoring is the process to gather metrics about the operations of an IT environment's hardware and software to ensure everything functions as expected to support applications and services.
Basic monitoring is performed through device operation checks, while more advanced monitoring gives granular views on operational statuses, including average response times, number of application instances, error and request rates, CPU usage and application availability.
How IT monitoring works
IT monitoring covers three sections, called the foundation, software and interpretation.
Foundation. The infrastructure is the lowest layer of a software stack and includes physical or virtual devices, such as servers, CPUs and VMs.
Software. This part is sometimes referred to as the monitoring section and it analyzes what is working on the devices in the foundation, including CPU usage, load, memory and a running VM count.
Interpretation. Gathered metrics are presented through graphs or data charts, often on a GUI dashboard. This is often accomplished through integration with tools that specifically focus on data visualization.
IT monitoring can rely on agents or be agentless. Agents are independent programs that install on the monitored device to collect data on hardware or software performance data and report it to a management server. Agentless monitoring uses existing communication protocols to emulate an agent, with many of the same functionalities.
For example, to monitor server usage, an IT admin installs an agent on the server. A management server receives that data from the agent and displays it to the user via the IT monitoring system interface, often as a graph of performance over time. If the server stops working as intended, the tool alerts the administrator, who can repair, update or replace the item until it meets the standard for operation.
Real-time vs. trends monitoring
Real-time monitoring is a technique whereby IT teams use systems to continuously collect and access data to determine the active and ongoing status of an IT environment. Measurements from real-time monitoring software depict data from the current IT environment, as well as the recent past, which enables IT managers to react quickly to current events in the IT ecosystem.
Historical monitoring data enables the IT manager to improve the environment or identify potential complications before they occur, because they identify a pattern or trend in data from a period of operation. Trend analysis takes a long-term view of an IT ecosystem to determine system uptimes, service-level agreement adherence and capacity planning.
Two extensions of real-time monitoring are reactive monitoring and proactive monitoring. The key difference is that reactive monitoring is triggered by an event or problem, while proactive monitoring seeks to uncover abnormalities without relying on a trigger event. The proactive approach can enable an IT staff to take action to address an issue, such as a memory leak that could crash an application or server, before it becomes a problem.
Point-in-time vs. time-series monitoring: Point-in-time analysis examines one specific event at a particular instant. It can be used to identify a problem that must be fixed immediately, such as a 100% full disk drive. Time-series analysis plots metrics over time to account for seasonal or cyclical events and more accurately recognize abnormal behavior. Point-in-time analysis relies on fixed thresholds, while time-series analysis employs variable thresholds to paint a broader picture and better detect and even predict anomalies.
IT infrastructure monitoring
IT infrastructure monitoring is a foundation-level process that collects and reviews metrics concerning the IT environment's hardware and low-level software. Infrastructure monitoring provides a benchmark for ideal physical systems operation, therefore easing the process to fine-tune and reduce downtime, and enabling IT teams to detect outages, such as an overheated server.
Server monitoring and system monitoring tools review and analyze metrics, such as server uptime, operations, performance and security.
As more organizations embrace cloud computing, cloud monitoring capabilities and options have expanded as well. Cloud customers can get visibility into certain metrics, such as CPU, memory and storage usage, to gauge how well their applications perform, but the nature of cloud infrastructure limits the view into the physical assets on which cloud workloads run.
Network monitoring seeks out issues caused by slow or failing network components or security breaches. Metrics include response time, uptime, status request failures and HTTP/HTTPS/SMTP checks.
Security monitoring focuses on the detection and prevention of intrusions, typically at the network level. This includes monitoring for vulnerabilities, logging network access and identifying traffic patterns in real time to look for potential breaches.
Application performance monitoring
Application performance monitoring (APM) gathers software performance metrics based on both end user experience and computational resource consumption. Examples of APM-provided metrics include average response time under peak load, performance bottleneck data and load and response times.
Cloud providers largely support APM capabilities with their own native tools. Cloud customers can also choose from many third-party APM tools to see metrics on resource availability, response times and security.
Application monitoring is within the scope of application performance management, a concept that involves more broadly controlling an application's performance levels.
IT monitoring tool options
Some APM vendors also offer IT infrastructure monitoring capabilities, and vice versa. Other tools are designed specifically to watch over the network or CPU performance and so on. Some monitoring tools incorporate AI capabilities.
The following lists show just some examples of various monitoring tool types. These lists are not comprehensive, however, and many tools incorporate capabilities typically seen in other segments, such as AI or the ability to track cloud and on-premises infrastructure.
APM tools. BMC TrueSight, Cisco AppDynamics, Datadog, Dynatrace, ManageEngine Applications Manager, Microsoft Azure Application Insights, New Relic and SolarWinds APM.
IT infrastructure tools. LogicMonitor, ManageEngine OpManager, Microsoft System Center Operations Manager (SCOM), Nagios XI, SolarWinds, VMware vRealize Operations and Zabbix.
Cloud monitoring tools. Amazon CloudWatch, Google Stackdriver (now folded into Google Cloud Console), Microsoft Azure Monitor, Cisco CloudCenter and Oracle Application Performance Monitoring Cloud Service.
Containers/microservices/distributed app monitoring tools. Confluent Kafka, Jaeger, LightStep and Prometheus.
AIops tools. BigPanda, Datadog, Dynatrace, Moogsoft and New Relic.
Log monitoring tools. Elastic Stack, Fluentd, Splunk and Sumo Logic.
Network security monitoring tools. Cisco DNA Analytics and Assurance, LiveAction LiveNX, LogRhythm and PRTG Network Monitor.