Improve container monitoring with these strategies and tools What semantic monitoring can and can't do for microservices
Tip

How to respond to 4 common IT alerts

IT alerts can announce server failures, congested networks and more. When they pop up, administrators need to act. Take steps to look for the source of trouble.

Like the tip of an iceberg, an IT alert is only part of what administrators see. What's under the surface can be something much larger. Administrators hardly ever see or touch some IT infrastructure that runs in the data center. These components can suffer from problems that can't be seen from the outside, such as server failure, lack of disk space and congested networks.

When users can't reach the resources they need, they are likely to call the service desk. By then, it's already too late. Organizations should have alerts in place that can detect problems when they occur or before they happen by using trend analysis.

Such alerts can send emails, text messages or alert administrators in other ways so they can respond before problems occur or before they grow out of control.

How to respond to 4 common IT alerts

IT alerts often fall into four main categories: capacity warnings, performance problems, availability issues and security incidents. Admins must be able to find the root cause by starting at the alert.

1. Capacity trouble

Let's say the operations team learns that a key server or IT infrastructure is running low on space. With virtualized workloads, it's simple enough to increase space. Most IT infrastructures, however, do not run out of space without showing a steady space utilization curve that would develop over weeks or months. IT infrastructures that suddenly see a surge in space usage and trigger alerts need to be checked out.

Was the infrastructure recently patched? Was it upgraded? An admin might have left behind GBs in upgrade software without cleaning up. This is a case where correcting the issue hastily has a lot of ramifications -- even if it is as simple as left-behind installers. This additional space usage affects backups and other disaster recovery abilities, not to mention the cost of cloud resources.

The key when it comes to capacity issues is the trend. If an average growth rate has few spikes, that's likely normal behavior. The correct solution would be to add capacity. It is important to investigate spikes because once admins begin to address these issues by expanding capacity, it's almost impossible to stop. Kneejerk reactions might fix things temporarily, but they won't resolve whatever caused that sudden capacity problem.

2. Sluggish performance

It is considered a general fault alert when an application seems to take forever to respond. It can be one of the most complex problems to track down. Many applications use a variety of IT infrastructures, so the source of the problem could be in any number of places.

Understand all the pieces that an application touches along its path to a user to begin to see an overall picture. This allows admins to address the problem in bites. The downside to this type of response is that it takes time.

Moment-in-time performance stats can't always solve what is going on, but they can help to identify possible places to start. When combined with historical data, performance stats might reveal the source of the problem. This data will direct attention and get closer to a fix, even if it doesn't reveal the root cause.

3. Availability issues

While hardware and other systems can fail abruptly, they rarely do. A big challenge when something goes down is to determine why. That information can be lost when the IT staff works quickly to restore services because those reboots and restores sometimes lose the data about why something failed. It's critical to capture whatever data you can before beginning restoration. This can be something as simple as taking a picture of an error code or dump screen. While all errors should be captured in log files, that doesn't always happen.

While a change in an IT infrastructure often triggers an availability problem, lack of change can also have an effect. Alerts in place to handle DNS, dynamic host configuration protocols, key management services and so forth perform their roles without daily care and are easy to forget. If they aren't rebooted, patched or maintained, these critical services can succumb to memory leaks and crashes. Losing a Microsoft key management server or something similar will have wide-ranging effects on all Microsoft products in an environment. That type of problem can be incredibly hard to track down, which is why admins must be good at understanding the flow of their applications.

4. Security incidents

Security incidents are a growing concern for IT administrators. Security-related issues can lead to any of the other three problems. For example, a denial-of-service attack can lead to capacity or performance problems and in some cases, it can lead to no availability when no action is taken.

New IT infrastructures need to be available to be alerted of security breaches in the IT environment. Many vendors offer intrusion detection (IDS) and prevention (IPS) tools that can generate alerts or remediate the environment. Completely integrated services such as VMware NSX can combine the firewall with IDS/IPS, malware prevention and detect suspicious behavior on the network. The network detection and response feature offers a holistic approach to provide insights into what's happening and how to respond.

The good and the bad of IT alerts

Alerts in IT are both helpful and annoying. Too many alerts will cause staff to ignore warnings. With too few alerts, staff might miss the chance to react before a small problem becomes a large one.

Some alerts will signal the start of something major, while others will indicate a less serious matter that can wait until Monday. Seeing the difference comes from knowing the tools in use and understanding environments at a deep level.

This article was originally written by Brian Kirsch and expanded by Rob Bastiaansen.

Rob Bastiaansen is an independent trainer and consultant based in the Netherlands specializing in VMware and Linux. He writes articles for several print and online publications, and is founder of VMwarebits.com, a site dedicated to technical content related to VMware.

Dig Deeper on IT systems management and monitoring

Software Quality
App Architecture
Cloud Computing
SearchAWS
TheServerSide.com
Data Center
Close