Machine learning can elevate an IT organization's performance monitoring strategy -- and, given the wide array of mature machine learning algorithms and frameworks now available, it's not as complicated to learn as it used to be.
Editor's note: Machine learning plays a growing role in IT organizations. Check out these articles on how IT teams can use this type of artificial intelligence for log analysis and anomaly detection. Organizations not ready to adopt machine learning can also consider time-series monitoring.
Let's review some of the key concepts related to machine learning in IT performance monitoring, in general, and then walk through an example using Apache Mesos and the K-means clustering algorithm.
Collect and define metrics
Most monitoring systems vacuum up logs, parse individual fields and then display them on a dashboard. But to predict or detect an outage, or anticipate a surge in demand, IT teams require metrics from a wide variety of systems -- business, technical and external -- fed into an algorithm.
However, each application and business is different, so there cannot be one single algorithm built into performance monitoring tools.
Instead, IT admins must write this code themselves. This process isn't terribly complicated, but does require knowledge of machine learning, as well as programming skills. An organization's existing monitoring systems provide most of the data it needs.
Think of machine learning algorithms as a black box: You throw a wealth of data into it and then hope something useful comes out the other end. Machine learning models work best when they process hundreds or thousands of data points on which to draw conclusions.
There are metrics that affect IT system performance, such as spikes in traffic volume, and metrics that reflect them, such as web page latency. Machine learning enables admins to use both, which is another improvement over the log-scraping-in-isolation approach to IT monitoring. Some metrics an IT admin could plug into that "black box" include:
- network traffic volume, by source and target IP address;
- storage in use;
- end-to-end app latency;
- replication latency; and
- message queue length.
An admin with a spreadsheet of this data would deduce which data points might be correlated. Data science eliminates the guesswork, as it points out which items actually correlate, and provides the tools to flag anomalies and make predictions regarding system health and demand. And it can do this with hundreds of metrics, whereas a human with a spreadsheet can look at only two or three.
The challenge with labelled data
There are two primary kinds of machine learning algorithms: supervised and unsupervised. Supervised machine learning algorithms enable predictive models to, for example, predict system outages before they happen. This is possible as an outage is usually preceded by a cascading series of events.
To support these predictive models, however, IT teams require classified or labelled data. This data captures relationships between a cause and an effect -- such as a certain IT metric that results in a certain system status. A lack of comprehensive labelled data sets remains one of the biggest hurdles to the use of machine learning in IT performance monitoring.
Let's look at an example of how to use machine learning for IT performance monitoring using Apache Mesos and the K-means clustering algorithm for data clustering and analysis.
Mesos is a cluster management tool and resource manager. The metrics it provides span the entirety of an application, rather than pertain to a single attribute of a single application.
Mesos calls these observability metrics, of which there are approximately 100 -- a selection of them are shown below. Some are gauges -- metrics that measure an instant in time -- and others are counters -- metrics that accumulate. Mesos gathers that data at both the agent and master level (editor's note: The industry's terminology around master-agent architectures is still in flux). The master is the primary machine that coordinates the work of the agents, which are servers and containers throughout the organization's IT environment.
To determine which of these metrics would work best for your IT monitoring needs, plug them all into the machine learning model -- in this case, the K-means clustering algorithm -- described below. The build process for these models is iterative: Part of tuning the model requires admins to remove certain metrics. Machine learning enables admins to identify the relative importance of each metric.
To apply machine learning to this scenario, take snapshots of this data over time and then plug it into an unsupervised machine learning model. This will require two programmers or two sets of skills.
First, these Mesos metrics are not saved to storage, so enter each data point into a database. For example, Elasticsearch is a log collection device that is also a database. Second, a data science programmer must write code, and one of the easiest models to use is K-means clustering.
How K-means clustering works
Plug all those Mesos data points into a K-means clustering algorithm. It will find patterns in data by grouping it into clusters, as in the graph below.
There is no labelled data in unsupervised machine learning algorithms, which means the data above is meaningless to the viewer. This graph only illustrates that the three clusters of data are somehow different from one another.
While this does not indicate issues with a specific application, the data does point to a subset of machines, networks and systems to investigate further. Because IT performance monitoring aims to identify events that are outside of the norm, start with those metrics in the smallest cluster.
Then, feed those same metrics into that same machine learning model and divide that cluster again. Or, change the mix of metrics fed into the model by either dropping some metrics or adding others. Continue to build on the results and drop metrics until the data is visually recognizable -- such as isolating an issue to a subnet.
As mentioned above, this is an iterative process: The machine learning models only focus the IT admin's view onto what is important. It separates the signal from the noise.
Every day, IT must repeat this process. Choose a cluster on which to focus. Drop or add metrics. Write code to drop records from the data that might not be relevant -- machine learning algorithms will point those out. Query individual machines and metrics. Evaluate the model. Repeat.
These steps collect into a machine learning pipeline, which should run in a continual loop until it produces the data that suits your organization's situation and mix of machines and applications. The goal is to be able to derive a single metric: system status.
Write a K-means clustering program for Apache Mesos
The availability of machine learning frameworks has made it much easier to write machine learning programs. Most of these frameworks are built with Python. The scikit-learn Python machine learning framework is one of the easiest to use -- and one of the most widely used. The creation of the chart, as shown below, might be the most complex part of this process. Use the matplotlib plotting library for that task.
Make REST API calls to the Mesos Observability endpoints and save those metrics in a database.
Run this command run against the master:
curl http: //172.31.47.43:5050/metrics/snapshot
It will echo JSON lines that look like the output below. This is the information you save to the database.
In JSON, this is a superset of the data shown in the master screen below:
This output also displays the information on each of the agents: