Availability and UX are valuable attributes of modern enterprise workloads. When availability falters due to software or infrastructure issues, customers and businesses can suffer.
But traditional troubleshooting techniques can't keep up with today's dynamic and ephemeral application environments. Microservices and other containerized applications are often complex, scalable and highly distributed -- and some containers might live for less time than it takes for an IT admin to see a trouble ticket.
To address container failures on demand, platforms such as Kubernetes are using container orchestration and automation capabilities to implement practical self-healing for containerized environments.
What is self-healing Kubernetes?
The idea behind self-healing Kubernetes is simple: If a container fails, Kubernetes automatically redeploys the afflicted container to its desired state to restore operations.
Self-healing Kubernetes has four capabilities:
- restart failed containers;
- replace containers that require updates, such as a new software version;
- disable containers that don't respond to predefined health checks; and
- prevent containers from appearing to users or other containers until they are ready.
Ideally, container detection and restoration should be seamless and immediate, minimize application disruption and mitigate negative UX. Organizations can specify how Kubernetes performs health checks and what actions it should take after it detects a problem.
How does self-healing work with Kubernetes?
Kubernetes clusters are composed of pods -- logical entities where containers deploy. A pod has five possible states:
- Pending. The pod has been created but is not running.
- Running. The pod and its containers are running without issue.
- Succeeded. The pod completes its container lifecycle properly -- it runs and stops normally.
- Failed. At least one container within the pod has failed, and the pod is terminated.
- Unknown. The pod's state and disposition cannot be determined.
Kubectl commands can obtain the pods and their status for a given application.
Kubernetes uses two types of probes to gauge each pod's condition:
- A liveness probe finds the running status of each container. If a container fails the liveness probe, Kubernetes terminates it and creates a new container according to internally established policies.
- A readiness probe verifies a container's ability to service requests or handle traffic. If a container fails the readiness probe, Kubernetes removes its IP address from the corresponding pod. This makes it unavailable until it is terminated and restarted.
The combination of probes and probe responses makes self-healing possible by enabling Kubernetes to restore the declarative state of the container cluster.
To accomplish this, Kubernetes frequently checks the status of pods and their containers. If Kubernetes determines that a container has failed or is unresponsive, it terminates and restarts or reschedules the pod as soon as possible -- assuming there is sufficient infrastructure available to do so. Detecting a failed container application or component can take up to five minutes.
Advantages and disadvantages of self-healing Kubernetes
The idea of self-healing applications has haunted application and infrastructure engineers since the dawn of IT. Self-healing Kubernetes promises more reliable application management for container-based applications and components -- but it should be approached with a careful consideration of tradeoffs.
With self-healing Kubernetes, complex container environments can continue to function around the clock with virtually no need for human intervention when issues occur. Container problems are detected promptly and addressed using policies tailored by organizations. This strengthens reliability and speeds up issue resolution in ephemeral container environments -- which, in turn, can improve business outcomes.
Even though self-healing is a default capability of the Kubernetes platform, it still requires oversight. Self-healing maintains a container environment in a desired state, but management must first define that desired state by, for example, creating a pod template and updating it as configurations and needs change over time.
Moreover, Kubernetes self-healing operates at the application layer, which limits its capabilities. Consequently, additional tools might be necessary to monitor and remediate issues lower in the application stack.
Kubernetes self-healing best practices
Because self-healing is a straightforward feature integral to Kubernetes' normal operations, there are few direct best practices to optimize self-healing behavior.
Self-healing detects and restores deviations from a declarative desired state. Therefore, IT teams should implement policies and processes to document that state. This includes using version control in various contexts:
- container development and release;
- pod templates and other configuration files; and
- application and infrastructure documentation.
Clearly define the configuration state the business expects Kubernetes to maintain, and treat that configuration state as a version-based instance. Any changes to the configuration state should trigger a version update.
Not only does careful version control ensure compliance with current configurations, but it also enables precise rollbacks or restorations to previous configuration states if needed -- for example, in the event of an unforeseen bug in a container application update.
Limitations of self-healing Kubernetes
While Kubernetes brings valuable self-healing capabilities to container-based applications, it does have limitations.
A typical containerized environment includes three major layers:
- The application layer houses the container entity, along with its code and dependencies.
- The Kubernetes component layer is the OS for containers. This layer includes the kubelet, kube-proxy and container runtime components that make Kubernetes work.
- The infrastructure layer is where servers, disks with container image files and network connectivity operate.
Self-healing operates at the application -- or top -- layer, where Kubernetes deploys and manages containers. If a pod crashes, Kubernetes can reschedule it.
Unfortunately, however, Kubernetes has no provision or mechanism to enable infrastructure self-healing. A problem with Kubernetes itself or the infrastructure, such as a failed disk or network switch, could therefore disrupt a containerized application beyond Kubernetes' ability to repair.
Organizations that implement self-healing Kubernetes should also integrate some form of application performance monitoring to oversee Kubernetes, as well as comprehensive infrastructure monitoring to alert IT admins to issues in the component and infrastructure layers.