everythingpossible - Fotolia

Forget monitoring alerts, turn to IT root cause analysis

Alerts get your attention, but they don't always tell you where the core of a problem is to be found. Maybe it's time to shift your IT management strategy.

It wasn't that long ago that there were two primary ways to see what was going on across a technical platform: Put in place a complex and expensive framework product, such as IBM Tivoli or HP OpenView; or just plow through the alert logs. In fact, many IT staffs still wait for a problem and then turn to the logs.

Both of these methods, however, are somewhat like looking for the proverbial needle in a haystack. The meaningful information point is hidden within all the other items that were flagged -- just in case.

These days alerts are -- if only -- a little better. Many monitoring systems are now modular, so you don't need to establish massive frameworks. Most come with red/amber/green traffic light indicators, to alert when something is likely to be of interest and filter out much of the rubbish.

Find the actual problem

The Holy Grail is a monitoring system that can see a problem, identify its root cause and let an administrator know about that -- preferably with capabilities to auto-remediate the issue wherever possible.

The need for good IT root cause analysis can be seen with a couple of simple examples:

  • An app is running slowly. Does the problem result from bad code, a Memory leak, resource constraints? Searching through multiple monitoring systems may prolong the problem for a long time. Steps taken to fix it may work -- or they may not.
  • Data is getting lost somewhere. Is it a wayward app, a corrupt storage system or something else? Maybe the instance was automatically spun up and the virtual LUN is wrong. Or maybe a virtual storage vault has been filled.

What's needed is a monitoring system that aggregates the event logs coming from across a platform and then tries to make sense of all that. Aggregation used to be the key strength of Splunk, which worked mainly alongside other monitoring and remediation systems. As some of these systems started to bring in their own aggregation capabilities, Splunk shifted its strategy to become more of a remediation system.

Shifting to IT root cause analysis

Systems management companies such as SolarWinds, Kaseya and ManageEngine all have monitoring and remediation systems built into them.

The Holy Grail is a monitoring system that can see a problem, identify its root cause and let an administrator know about that.

Some of these tools incorporate artificial intelligence. They also present a far more intuitive interface so that they no longer require a highly skilled (and highly expensive) systems administrator.

Often, problems are now indicated with the familiar red/yellow/green indicators. Clicking on the problem, though, takes the user through the layers to where the actual problem is, sometimes with advice on what to do or an offer of automatic remediation to solve the problem.

As such capabilities come through, admins will spend less time trying to identify and deal with what are often simple issues. These capabilities also should reduce the number of problems admins inadvertently introduce while trying something just to see if it solves the original trouble. Higher uptimes and better performance will be achieved. And more time can be spent on adding actual value to the IT platform.

So, where does this leave us? We should stop looking at alerts and pay less attention to filtered alerts. Now is the time to look ahead to IT monitoring tools that get us closer to actual root cause analysis and remediation.

Next Steps

AWS monitoring best practices extend beyond CloudWatch

Compare Grafana vs. Datadog for IT monitoring

Learn how New Relic works, and when to use it for IT monitoring

Dig Deeper on IT systems management and monitoring

Software Quality
App Architecture
Cloud Computing
Data Center