As organizations seek improved reporting to inform security, FinOps and IT decisions, they need access to operational data from across the DevOps toolchain. Metrics are essential to DevOps success. But many teams face challenges when it comes to extracting timely and actionable data from Kubernetes.
One of the most important metrics that a team can track focuses on the Kubernetes cluster health, such as state, running containers, network I/O and resource consumption indicators. The performance of the cluster serves as a health barometer for the deployment and should be used in conjunction with other metrics, such as CPU, memory and disk utilization at both the node and pod levels.
Control plane metrics -- such as etcd data stores, API servers, controller life cycles and scheduler -- are also essential to track. Lastly, FinOps metrics, such as Kubernetes cost per environment, product, team or cluster, are also increasingly important.
Tools and plugins for collecting Kubernetes metrics
Organizations have a range of tools and plugins at their disposal to collect Kubernetes metrics. Here are six common options to consider:
This agent is a core component of Kubernetes that runs on each node and manages the containers. With kubelet, teams can expose Kubernetes metrics such as CPU usage, memory consumption and network usage statistics. After exposing those metrics with kubelet, IT teams can scrape the data with an external monitoring system -- such as Prometheus -- to generate reports for stakeholders.
Often referred to as CAdvisor, this open-source tool collects individual container resource usage and performance metrics on Kubernetes nodes. Prometheus also integrates with CAdvisor to collect and monitor container-level metrics, including CPU, memory, disk and network usage.
As an open-source monitoring and alerting toolkit, Prometheus is popular in the Kubernetes ecosystem due in part to its native support for Kubernetes. Prometheus has extensive monitoring capabilities, including Kubernetes metrics related to API server, nodes and containers. The tool also offers a flexible query language, powerful alerting rules, and a large ecosystem of exporters and integrations.
DevOps teams often pair Prometheus with Grafana, an open-source data visualization and monitoring tool, because of their strong integration capabilities.
One of the most common problems for many organizations is creating dashboards and reports. DevOps teams can use Grafana to create interactive dashboards and charts to visualize Kubernetes metrics that Prometheus collects. Teams that want to experiment with Kubernetes metric reporting and require support for various data sources often turn to Grafana to help generate reports for stakeholders.
Datadog automatically monitors the nodes of Kubernetes platforms. The Datadog agent collects metrics, events and logs from cluster components, workloads and other Kubernetes objects. However, Datadog isn't an open-source tool and will have a cost overhead for teams to consider.
Dynatrace monitors the availability and health of applications and processes, dependencies and connections among hosts, containers and cloud instances. It's a full-stack monitoring option for Kubernetes. The tool is a Datadog competitor and uses events, traces, metrics and behavioral information to reveal the inner workings of Kubernetes applications.
Challenges of drawing metrics from Kubernetes
Cloud and cloud-native tools like Kubernetes put DevOps and site reliability engineering teams in a new era of actionable data. Efficiently and reliably collecting the data that Kubernetes generates, especially at the cluster and pod level, requires additional tools and configurations to convert the data into a usable, actionable format.
First, teams must determine which metrics are relevant for their organization based on specific requirements and use cases. Organizations must choose the Kubernetes metrics that provide useful data on CPU usage, memory consumption and network throughput to see if they're meeting organizational goals.
Another important consideration is the granularity of the Kubernetes metrics. Kubernetes metrics span cluster wide, node level and pod level, and each provide valuable insights. The depth to which a team collects these metrics will determine how difficult or easy this undertaking will be. Further understanding and interpreting those metrics is also a complex task because all come with varying aggregation methods.
Once teams obtain these metrics, the next consideration involves storing and retaining the data for long-term analysis and troubleshooting. The team will now have to factor in storage management and requirement concerns for the data. Without these requirements and a policy, organizations will undoubtedly encounter storage space issues as metrics data accumulates unfettered.
Scalability is another major challenge with Kubernetes metrics. Can the systems handle metric collection and monitoring with changing workloads and constant pod/node addition and subtraction? If not, then the team will need to revamp its infrastructure to handle these stresses while maintaining performance and reliability.
From a staffing standpoint, teams will need IT professionals with expertise in building effective dashboards, creating alerts and setting up anomaly detection in their system. They'll also need team members to meaningfully analyze and visualize the data to gain the right insights and choose tools that align with the company's goals.
Custom metrics for specific use cases
Now that teams have identified and collected their Kubernetes metrics, it's time to put them to good use with specific use cases.
Application performance monitoring
APM metrics let IT teams monitor and measure the performance of applications running on Kubernetes, including common issues like bottlenecks. Examples of APM metrics include response times, latency, error rates, throughput and resource utilization.
Resource monitoring and optimization
Custom metrics enable teams to track and analyze resource use within Kubernetes environments. IT teams can collect CPU, memory, disk I/O and network utilization metrics at the application, pod, container or node level. Teams can then use these metrics to identify resource-intensive components and optimize resource allocation to ensure efficient utilization and cost optimization.
Autoscaling is a crucial element for many cloud and Kubernetes FinOps strategies. IT ops teams can define and collect application-specific metrics based on their analysis of workload demands to trigger autoscaling actions.
Service-level agreement and service-level objective monitoring
Teams can offset challenges associated with monitoring and enforcing SLAs and SLOs for applications by establishing custom metrics that measure availability, response times, error rates or any other metrics that align with service-level commitments. Such custom metrics give IT ops teams the actionable data they require to be proactive and ensure service quality.