Tips and tools for making the most of event log files
Nearly every part of your system reports events can be logged. This tip discusses which events you should log and best practices for making sense out of that information once it is logged.
These days, nearly every part of your system reports events that can be logged somewhere. Applications, networks, servers and storage arrays report events. In fact, the first major problem with event reporting is deciding which events you want to log.
The second major problem is trying to decide how to make sense of all that information. Consider that a single action, even something as simple as a read or write, can produce up to a dozen log entries. Even a moderately large system can pump out a million or more reportable events every day. Many of these logs have the potential to be important (at least some of the time) to someone. Log files can be used for many different types of system analysis, but I'll focus on using them to monitor storage system performance.
While all these events are reported, the vast majority of them are only logged if you tell the system to log them using a tool like Windows Performance Monitor (or perfmon). Because there are so many events, many people simply don't keep log files.
This strategy of blissful ignorance is almost guaranteed to come back and bite you when there's a problem. Log files are critical to understanding what's going on in your storage system, finding problems and applying the correct fixes.
Pruning your log files
Deciding what to trim down in your log files is tricky because a lot of the time you can't know what's going to be important when you're deciding what to log. The best approach is to log the items related to what you're interested in, such as storage performance, and trim down what you look at depending on what you need.
Most event logging tools will let you pare down their reports to focus on what's important on a post hoc basis. That is, if it's in the log, you can filter it out based on what seems to be happening. So you can be more inclusive in deciding what to log and still not get buried in useless data. Many of the tools will let you specify filters based on patterns. A single event of a particular type may not be significant, but if you see dozens or hundreds of the same kind of event, or other interesting patterns, you may want to take a closer look.
Pick the right log formats
Programs generate logs in a lot of different formats. Although most common event reports are based on ASCII, many of them have special formatting or additional information that takes them beyond simple text files. Fortunately, most event logging programs can convert those event reports into one or more of about a dozen standard formats. Often, you get a choice of which format you want the output in.
While the format that's most useful depends on the tools you're using to analyze the data, CSV and SQL are probably the most useful formats for manipulating log file data. CSV is supported by spreadsheets like Excel so you can easily analyze the output. SQL format can be used to generate elaborate queries and reports using an SQL-aware application like Log Parser, a free tool from Microsoft, or an SQL database like Access or SQL Server.
Don't forget your applications
Although storage performance analysis usually focuses on system statistics, it can be important to know what your applications are doing, too. This is especially true if an application is taking up a lot more storage than it should.
Most applications (and a lot of hardware) can report events, and some create their own logs, but they don't always do it in ways that are completely compatible with your operating system logs. For example, many applications put their logs in their own folder by default rather than in the system folder. So, it may take more work to make application and hardware logs accessible, but it's usually worth it.
Use alarms and reports
The key to dealing with log files is to concentrate on what's important. In log file analysis, organizing data is at least as important as the data itself. With most tools you can set your system to produce regular reports summarizing important activity, with events of interest highlighted. Many storage administrators use these reporting abilities to give them reports at regular intervals -- such as a high-level view first thing in the morning. Some of the fancier third-party tools have dashboards that can display data from current reports at any time with a couple of mouse clicks.
If something goes really off -- such as running out of disk space because you forgot to dump the old log files -- you don't want to wait for the next report. Most of the tools can issue warnings and alarms based on thresholds you set. If you have set the thresholds properly (i.e., to give you enough warning without burying you in false alarms), you can get a jump on the situation before it becomes a real problem.
Trend and graph your data
Trending and graphing are two of the most important features you can have for keeping up with what's happening. Graphing is an easy way to view a lot of data at once, while trends show you what's happening over time.
For storage, you want to watch performance and capacity trends to get an early warning of the need to add more storage, failing disks, poor performance because of network congestion and other issues. If you look at today's numbers, things may look fine, but if you have a chart showing the figures over the last couple of months you may see a very different story.
Rotate your log files
You need to set up a regular rotation schedule to purge old log files. Log files can get very large very quickly. If you're serious about log files you need to establish a retention policy for your files. Usually logs are rotated based on time interval or size. For example, you might want to rotate your files daily at the same time, or you might swap them out when the file hits a certain size. Rotation doesn't have to be an either-or proposition. If a log file contains significant events, you can set up your system to save it longer or archive it in secondary storage.
Use rules of thumb to help spot problems
Not everything your log files will show you is cut and dried. While experience with your system is the best teacher, there are also commonly available heuristics to help guide you in analyzing event logs. For example, the average disk request queue length should stay below 3.0 even on a heavily loaded system. If the system is heavily loaded, you may see occasional spikes as high as 20. But if you see too many spikes, or if the average length on a heavily loaded system goes above 2.5 to 3.0, you most likely have a problem.
One source of rules of thumb and other useful information for judging storage performance is Microsoft's paper: "Disk Subsystem Performance Analysis for Windows."
Six log analysis tools
Perfmon and similar logging applications have some log management abilities, but they tend to be pretty limited. You're almost certainly going to want more visibility into your system than that.
While you can manage your log files by writing your own scripts and applications, there are a lot of available tools to help analyze and report on your log files. They range from elaborate analysis applications, like Log Parser and Splunk (a free download at Splunk.com), to simple scripts people have written for their own use and are freely available on the Internet.
When choosing event log analysis tools, keep in mind that not all log management tools are aimed at performance analysis. In fact, the hot areas today are Web statistics, security and compliance. Vendors like LogLogic Inc. and products like Prism Microsystems Inc. EventTracker are aimed at security and compliance. Some, such as LogRhythm are designed for all three markets.
Also remember that not all log analyzers work on all systems. For example, Microsoft's Server Performance Advisor is a very useful tool -- if you happen to be running Microsoft Server 2003. It won't work on other Windows operating systems.
Log Parser is the Swiss Army knife of log analysis tools. This package, which is available free from Microsoft, may well be the best value in log analysis tools for SMBs running Windows. Although Log Parser was originally written to analyze IIS files, it has been expanded to cover log analysis of all sorts. It is built around a simple SQL database and uses a subset of SQL to generate reports, as well as graphing and other reporting functions.
Log Parser not only supports about a dozen log and report formats out of the box, you can add your own formats using OLE automation. One particularly nice format for instant analysis is Datagrid, which presents results of a SQL query in a Windows dialog box.
If Log Parser is the Swiss Army knife of log tools, Splunk is a three-in-one lathe-mill-drill machine tool. Billed as a search engine for IT, Splunk lets you handle event logs from many servers, networks and other sources and reduce them to elaborate reports, alarms, graphs and other goodies to help you keep track of what's going on with your system. Like a three-in-one machine tool, Splunk isn't cheap. Prices for licenses start at $5,000 or greater, depending on the amount of data.
LogRhythm is another very powerful log analysis tool -- a CNC machining center of log analysis. It uses a dedicated (hardware) appliance to collect, process and analyze event logs from the operating systems and applications across the entire business. LogRhythm bills itself as an enterprise-class tool and it has the features needed to support large companies.
Event Analyst from Dorian Software is an inexpensive ($69.99 for a single server license) tool that handles basic log file analysis and reporting.
Event Log Explorer
Event Log Explorer is a simple freeware utility that monitors event logs.
Event Log Watchdog
Event Log Watchdog is a utility that sends an email alert if a specified event shows up in your event log. You define the events you want to be alerted to via email.
About the author: Rick Cook specializes in writing about issues related to storage and storage management.