kentoh - Fotolia
The containerization and microservices ecosystem is flooded with tooling options, but Kubernetes is still the most popular container management and orchestration tool. With Kubernetes, developers, teams and enterprises can deploy resilient and distributed container applications at scale.
Apart from how it simplifies the management of containerized applications and microservices, Kubernetes can be a double-edged sword. When organizations build multiple layers of abstraction, they also add more components and moving parts to monitor.
But before we jump into the what, why and how of Kubernetes observability, let's first define what observability is.
Observability is a property of the IT ecosystem that gives a comprehensive understanding of a system's infrastructure through monitorable metrics, logs and tracing data.
- Monitoring, or metrics, gathers telemetry data from applications and services.
- Logging captures detailed error messages, as well as debugging logs and stack traces for troubleshooting.
- Tracing collects transactions between users or microservices in a single or distributed system that helps to trace an issue or performance bottleneck in a distributed or microservices-based ecosystem.
The benefits of Kubernetes monitoring
The most effective approach to anticipate issues before they affect the application's health or availability is to monitor the current state and health of the application running on Kubernetes. But the growing adoption of microservices infrastructure, which is diverse and distributed, makes tracing issues difficult and complicates logging and monitoring. The ephemeral and transient nature of containers increases this challenge.
Monitoring provides insights into Kubernetes, such as a high-level overview of cluster function, health, performance metrics and resource counts. This tool reports any deviations from the normal state -- and alerts admins to issues as early as possible to ensure necessary actions are taken before an outage or bottleneck happens.
Monitoring Kubernetes also yields findings about resource utilization and the number of active nodes and pods. This data informs admins when to scale containers before a resource crunch or performance bottleneck occurs. It also provides historical data about resource consumption, so admins can tweak the resource limits to avoid overcommitting system resources and use the underlying resources more efficiently.
Key metrics to monitor in Kubernetes
Monitoring Kubernetes involves surveying all main components: clusters, nodes, pods, deployments and services.
The health of the entire Kubernetes cluster is critical. Monitoring the cluster helps admins understand which resources the cluster uses for capacity planning, and the number of applications running on each node. These are some of the most useful metrics:
- Node status. Current health status and availability of the node.
- Node resource usage metrics. Disk and memory utilization, CPU and network bandwidth.
- Deployment status. Current and desired status of the deployments in the cluster.
- Number of pods. Kubernetes internal processes and components use this information to handle the workload and to schedule the pods.
There are three main metrics to monitor at the pod level:
- Kubernetes metrics. These metrics apply to the number and types of resources inside a pod. But this metric also includes resource limit tracking to avoid running out of system resources. Additionally, these metrics ensure the continued, stable health and desired state of all pods running on Kubernetes.
- Container metrics. These metrics capture resource utilization at the container level, such as CPU, memory and network usage.
- Application metrics. Application performance monitoring oversees metrics like availability, end user experience and health. Such metrics include the number of active or online users and response times. An application usually exposes those metrics, but it varies between applications.
Tools and monitoring
The right tool is not only essential for observability and monitoring, but is also needed to track and fix system health issues and resource bottlenecks before they become dire. A good monitoring tool should not only capture and produce reliable metrics, it should also enable users to troubleshoot issues, trace logs and visualize components and overall system health. Some popular tools among Kubernetes users and developers include:
- Prometheus. An open source project from the Cloud Native Computing Foundation that provides powerful query, visualization and alerting capabilities. Setup is straightforward and easy.
- Elastic Stack. This tool, also called the ELK Stack, has three parts: Elasticsearch, Logstash and Kibana. Elasticsearch is a popular open source, scalable, distributed search engine that enables developers to implement free text search and analytics in their applications. Logstash collects logs from various input sources, executes transformations on the data and then ships the transformed data to various output destinations. In simpler words, it is an ETL -- extract, transform, load -- tool. Kibana is a presentation layer for ELK stack, which enables users to create data visualization and dashboards to represent logs, time series data and application metrics.
- Grafana. Grafana is an open source analytics and visualization tool. Like Kibana, it can connect to various data sources and present different types of data, logs and metrics in beautiful and meaningful visualizations. Organizations can use Prometheus and other tools with Grafana to visualize monitoring data from within their Kubernetes environment.
- Amazon CloudWatch and Azure Monitor. Both AWS and Microsoft enable out-of-the-box monitoring for containers running on Amazon Elastic Kubernetes Service (EKS) or Azure Kubernetes Service (AKS) using container insights. CloudWatch monitors EKS on AWS and Azure Monitor collects and analyzes telemetry data from AKS.
Other popular third-party applications and tools, such as Heapster and cAdvisor, are deployed inside Kubernetes environments and then monitor Kubernetes and the applications running. So, if a cluster fails, monitoring will also fail. And when production applications are down, monitoring is essential -- mostly to debug, trace logs and identify the issue. Hence, logging application data outside the cluster might be a decent idea, unless systems have failover capabilities.