Reducing false positives in network monitoring

Network managers need to be aware of the importance of choosing the right tool to monitor their system's internal counters. In this tip, Brien Posey narrows in on alert mechanisms for Windows servers and helps you understand how to reduce false positives in network monitoring.

I have always thought that the Performance Monitor was one of the coolest tools included with Windows. It's great to be able to look at the various counters and see exactly what is going on inside the system. Even so, Microsoft seems to have really dropped the ball when it comes to the Performance Monitor's alert mechanism.

As I'm sure you know, the Performance Monitor is designed to allow you to monitor the values of various counters. Most of the Performance Monitor's counters have a threshold value. The threshold value is the point at which the value reflects an impending problem within the system.

The idea behind the Performance Monitor's Alert mechanism is that it can generate an alert whenever a threshold value is exceeded. The alert can be written to the event log, or an email notification can be sent to the system administrator. You can see an example of the alert interface in Figure A.

The Performance Monitor can produce an alert whenever a threshold value is exceeded.

This type of alerting sounds good in theory, but it is fundamentally flawed. To see why this is the case, consider the % Processor Time counter. This counter reflects what percentage of the processor's total capacity is being used at any given moment. A generally accepted threshold value for this counter is 80%. If the counter's average value is in excess of 80%, the processor is being overworked and may be insufficient for the task being performed.

In that last sentence, the keyword is average. It is perfectly normal for the % Processor Time counter's value to spike to 100%. In fact, spikes up to 100% are very common. These spikes are harmless as long as the processor's average workload remains below 80%. Now, imagine that you had configured the Performance Monitor to generate an alert every time the % Processor Time counter's threshold value was exceeded. You could potentially receive thousands of alerts every single day, even if the system is working perfectly.

My point is that the Performance Monitor is a great tool for seeing what is going on "under the hood," but it isn't always the best tool for proactively monitoring your server. Keep in mind that I'm not saying that the Performance Monitor's alert mechanism is worthless. I'm just saying that it works better for some counters than others. For example, the alert mechanism may not be the best choice for monitoring a server's CPU, but it is perfectly fine for monitoring a server's available disk space. You could easily set up an alert that sends you an email if the Performance Monitor's LogicalDisk\Free Megabytes counter indicated that there was less than 1 GB of free disk space on the system drive. Since the system's available disk space probably doesn't routinely dip below this threshold value, you could safely assume that this type of alert wouldn't give you a lot of false positives.

So the million-dollar question is: How can you reduce false positives for counters that tend to fluctuate a lot? I have seen some administrators try to reduce the sampling frequency in an effort to reduce false positives. Indeed, this technique may reduce false positives, but it still has the same result. Counters that fluctuate a lot will still produce false positive alerts.

A better solution is to use other Microsoft products to monitor Performance Monitor counters that tend to fluctuate. One example of such a product is Microsoft Exchange Server. Most people don't realize it, but Exchange Server 2003 has server monitoring tools built in.

Obviously, if you are not already running Exchange, it would be silly to deploy Exchange Server just to be able to monitor a server. If you do happen to be running Exchange, though, these tools are worth checking out.

You can access Exchange Server's server monitoring tools by opening the Exchange System Manager and navigating through the console tree to Administrative Groups | your administrative group | Servers | your server. Right click on the server that you want to monitor and select the Properties command from the resulting shortcut menu. When you do, you will see the server's properties sheet. This contains a Monitoring tab that acts as a sort of miniature Performance Monitor.

As you can see in Figure B, there is one very important difference between the Performance Monitor's alert mechanism and the one that's included with Exchange Server. The Exchange Server version contains a duration field. What this means is that an alert won't be generated unless the condition has persisted for a specific length of time. This prevents random spikes of activity from producing false positive alerts.

The UPU Utilization Threshold dialog box found in Exchange Server allows you to set a duration.

Of course, even Exchange Server's alert mechanism has its limits. For starters, it works only for servers that are running Exchange Server. Another drawback is that only a tiny subset of the Performance Monitor counters are represented. If you truly want to be able to monitor all of the Performance Monitor counters without having to worry about false positives, you need to use a product like Microsoft Operations Manager 2005 (MOM 2005).

The idea behind MOM 2005 is that it watches all of the Performance Monitor counters for you. MOM is intelligent enough to know what the threshold values should be for the various counters and what types of conditions indicate a problem. When a problem is detected, an alert is generated and the server's status is displayed through a dashboard-type display similar to the one that's shown in Figure C.

Microsoft Operations Manager 2005 is better suited than Performance Monitor to generating counter related alerts.

In this article, I have explained that there are several ways of monitoring your system's internal counters, but it is important to choose the right tool for the job in order to avoid being flooded with false positives.

About the author: Brien M. Posey, MCSE, is a Microsoft Most Valuable Professional for his work with Windows 2000 Server and IIS. He has served as CIO for a nationwide chain of hospitals and was once in charge of IT security for Fort Knox. As a freelance technical writer, he has written for Microsoft, TechTarget, CNET, ZDNet, MSD2D, Relevant Technologies and other technology companies. You can visit his personal Web site at www.brienposey.com.

Dig Deeper on Network management and monitoring

Unified Communications
Mobile Computing
Data Center