E-Handbook: Practical advice on cloud application management Article 2 of 4

Metrics that matter in cloud application monitoring

Error rates, compute cost, requests per minute -- there are many metrics to look at as part of a strategy for cloud application monitoring. Which ones should you prioritize?

For more than two decades, IT teams have been deploying application performance management tools to monitor and manage on-premises applications and infrastructure. But when organizations move to the cloud, these APM strategies need to evolve.

Cloud APM requires an organization to track more metrics than on-premises APM. There are also additional considerations to weigh for collecting and analyzing metrics data when dealing with cloud-based environments.

What's different about cloud APM?

At first glance, cloud environments and on-premises environments may not seem radically different as far as application monitoring is concerned. Cloud applications still run on servers -- usually, anyway -- and handle transactions in ways that are typically similar to on-premises apps.

You can use certain monitoring approaches both on premises and in the cloud. For example, the RED method emphasizes the collection of metrics related to transaction rate, error and duration.

Cloud environments, however, pose additional challenges. When planning which metrics to monitor, IT teams will need to account for the following:

  • Distributed architectures. Cloud environments are more likely to include dozens or even hundreds of individual servers, with applications distributed across them. This makes it more important to monitor not just individual servers, but entire clusters. What matters most in the cloud is the health of your cluster -- not each server in it.
  • Shared ownership. In cloud environments, users don't typically have total control over host servers and operating systems, which are instead managed by the cloud provider. This can make it more difficult to collect certain types of data. For example, you can't pull OS logs from most cloud-based serverless compute services because you don't have access to the operating system.
  • Cost. Overprovisioned cloud environments can bloat cloud computing bills. This makes it more important to use cloud monitoring to help support cost optimization in addition to performance optimization. Of course, cost matters on premises as well -- but overprovisioning is less problematic in that context, given that the bulk of on-premises costs result from capital expenditures rather than operating expenditures.
  • Latency. Achieving low latency should be a goal for any type of application. When dealing with cloud-based apps, however, latency may pose more of a challenge. If the cloud data center is located far from your users, there's a higher risk of latency problems.
  • Load balancing. Although you may sometimes use load balancers for on-premises applications, it's more common to use them in the cloud to direct traffic between multiple instances of your application. This adds another layer of complexity to network and traffic monitoring.
  • Multiple clouds. If you use a multi-cloud or hybrid cloud architecture, it's harder to consolidate your APM tool chain around a single set of tools. For example, you can't use AWS CloudWatch alone to monitor all of your resources if you spread them across multiple clouds; CloudWatch works only with AWS.

All of these differences impact the approach that teams need to take to monitor and manage applications in the cloud.

Key cloud metrics to track

For virtually any type of cloud environment, you'll want to track the following types of metrics:

  • Requests per minute. By tracking how many requests a cloud application receives per minute, you will know the times of day or days of the week when request rates deviate from historical baselines. This enables an organization to more accurately predict when to increase the capacity for cloud resources. You can also use this type of metric to help identify problems such as distributed denial-of-service (DDoS) attacks.
  • Time to acknowledge. Tracking average time to acknowledge, which refers to the time it takes for your cloud-based app to start responding to a request, may reveal issues related to load balancers that fail to forward requests quickly enough. A slow time to acknowledge could also indicate an application that is underprovisioned and is struggling to handle all of its requests. For the best visibility, monitor and compare time-to-acknowledge metrics for each cloud region or individual cloud that you use, rather than analyzing them only in the aggregate. This will help you pinpoint latency issues that may be specific to one cloud region or cloud. Comparing acknowledgement time when a given request is and is not handled by a content delivery network (CDN) will also help you understand how best to minimize latency.
  • Response duration. Response duration, or the total time it takes the application to complete its response to a request, is also an indicator of whether your application has sufficient resources to handle the traffic directed at it. In addition, problems with response duration could indicate bugs or internal communication issues -- like the failure of one microservice to communicate efficiently with another -- inside the application itself. Response duration should also be tracked on a per-region and per-cloud basis in order to achieve the greatest visibility into latency.
  • Error rates. How often does a request result in an error? Which types of errors are most frequent? These metrics offer further visibility into the overall health of your application, as well as the cloud environment that is hosting it. Errors could reflect an application issue, but they can also indicate problems with your cloud environment itself, such as the unavailability of a cloud service -- which is typically an issue that the cloud provider needs to resolve -- or improperly configured access credentials for services running within your cloud environment.
  • Servers/nodes available. For distributed cloud environments, you should track how many servers or nodes within your cluster are up and available as a percentage of the total servers you have deployed. Although your cloud orchestration and automation tools may do a good job of automatically redistributing workloads from one node to another if a server goes down, they can only do that for so long before running out of healthy servers. You'll want to know if the number of available servers decreases beyond about 90 percent of the total deployed, which could indicate a serious problem with your cloud server instances.
  • Average compute cost. Tracking the total average cost of your cloud-based compute resources, such as virtual machines or serverless functions, in a given period will help you control costs. A spike in compute cost that can't be explained by a corresponding increase in application demand could signal an overprovisioned environment, for instance, which will waste money until it is corrected.
  • Average storage cost. You can also track the average cost of your cloud storage resources, including databases, object storage and block storage. Here again, storage cost increases that aren't tied to actual application needs could indicate a problem, such as improper data lifecycle management or inefficient use of data storage tiers.
When you have a detailed look at what's happening, you'll be in a better position to prevent complications.

Additional cloud metrics to consider

Depending on how you deploy and manage your applications, you may also want to consider the following types of metrics to help monitor your cloud applications and optimize the end-user experience:

  • Deployments per week (or day). If you use a CI/CD pipeline to deploy applications continuously into the cloud, measuring how many deployments you achieve per week -- or day, if you deploy particularly often -- will help you understand the overall health of your CI/CD operations.
  • Time to feature release. Along similar lines, tracking how long it takes your team to take a new feature from idea to deployment provides visibility into the efficiency of your CI/CD pipeline.
  • Mean time to resolve. Mean time to resolve metrics, which measure how long it takes your engineers to respond to incidents that occur in your environment, are important to track in any type of environment. But given the complexity of cloud environments, they can be especially important to watch when dealing with cloud-based apps.

The specific metrics you'd collect in each category will depend on which types of cloud services you use and which metrics they expose. These metrics vary from one cloud platform to the next, but they are usually well documented by cloud providers. You can read all about the metrics exposed by Amazon EC2 or Azure Virtual Machines, to name just two basic examples of cloud services.

Whatever the specific cloud metrics you ingest into your APM tools, your key focus should be to collect information that helps you understand the state of complex, distributed cloud environments.

You should also strive to correlate data of different types and compare data across different clouds and services. This way, you can achieve full visibility into the performance -- and cost -- problems that may arise in the cloud.

When you have a detailed look at what's happening, you'll be in a better position to prevent complications and improve the performance of a cloud deployment.

Dig Deeper on Cloud app development and management

Data Center