How to respond to 3 common IT alerts
When those IT alerts pop up, the ops team needs to respond. Take steps to deal with the problems -- but also look out for possible sources of the trouble.
Like the tip of an iceberg, an IT alert is the part you see. What's under the surface can be something much larger. Good operations teams need to know how to react.
IT alerts often fall into three main categories: capacity warnings, performance problems and availability failures. The key is to see how the alert is generated by a tool that is part of a larger IT system; you must be able to follow the flow from the alert to effects and finally to resolution at the root cause.
1. Capacity trouble
Let's say the operations team learns that a key server or system is running low on space. With virtualized workloads, it's simple enough to increase space. That's a quick fix. Most systems, however, do not run out of space without showing a steady space utilization curve that you should be able to see develop for weeks or months. Systems that suddenly see a surge in space usage and trigger alerts need to be checked out.
Was the system recently patched? Was it upgraded? An admin might have left behind GBs in upgrade software without cleaning up. This is a case where correcting the issue quickly has a lot of ramifications -- even if it is as simple as left-behind installers. This additional space usage affects backups and other disaster recovery abilities, not to mention what it could do if we are talking about cloud resources where you pay for everything you use.
The key when it comes to capacity issues is the trend. If you see an average growth rate with few spikes, that's likely normal behavior. The correct solution would be to add capacity. Investigate spikes, though, because once you begin to address these issues by expanding capacity, it's almost impossible to stop. You need to investigate. Kneejerk reactions might fix things for now, but they won't resolve whatever caused that sudden capacity problem.
2. Sluggish performance
When an application seems to take forever to respond, this is considered a general fault alert. And it can be one of the most complex problems to track down. Many applications use a variety of IT systems, so the source of the problem could be in any number of places.
It's essential to understand the flow of the application. When you know all the pieces that it touches along its path to a user, you can begin to see an overall picture. This allows you to address the problem in bites. The downside to this type of response is that it takes time. And when people are complaining, any delay will seem excessive.
Trending again will be key here. Moment-in-time performance stats can't always solve for what is going on, but they can help you to identify possible places to start. And, when combined with historical data, performance stats might reveal the source of your problem. This data will direct your attention and get you closer to a fix, even if doesn't show you the root cause.
3. Availability questions
While hardware and other systems can fail abruptly, it's rare that they do. A big challenge when something goes down is to determine why. That information can be lost when the IT staff works quickly to restore services because those reboots and restores sometimes lose the data about why something failed. It's critical to capture whatever data you can before you begin restoration. This can be something as simple as taking a picture of an error code or dump screen. While all errors should be captured in log files, in reality, that doesn't always happen.
While a change in an IT system often triggers an availability problem, lack of change can also have an effect. It's easy for a busy IT shop to neglect some systems, particularly ones that are not customer-facing. Systems in place to handle domain name system, dynamic host configuration protocols, key management services and so forth perform their roles without daily care and are easy to forget about. If they aren't rebooted, patched or maintained, these critical services can succumb to memory leaks and crash. Losing a Microsoft key management server or something similar will have wide-ranging effects on all Microsoft products in your environment. That type of problem can be incredibly hard to track down, which is why you must be good at understanding the flow of your applications.
The good and the bad of IT alerts
Alerts in IT are both helpful and annoying. Too many alerts will cause staff to ignore warnings. With too few alerts, staff might miss the chance to react before a small problem becomes a large one.
Some alerts will signal the start of something major, while others will indicate a less-serious matter that can wait until Monday. Seeing the difference comes from knowing the tools in use and understanding environments at a deep level.
AWS monitoring best practices extend beyond CloudWatch
Compare Grafana vs. Datadog for IT monitoring
Learn how New Relic works, and when to use it for IT monitoring