Node Health in Thalassa Cloud Kubernetes
Maintaining healthy nodes is fundamental to running reliable applications in Kubernetes. Nodes are the worker machines that run your containerized workloads, and their health directly impacts the availability and performance of your applications. This guide explains how Kubernetes monitors node health, how Thalassa Cloud automatically recovers from node failures, and how you can monitor and troubleshoot node issues in your clusters.
About Kubernetes Node Health
In Kubernetes, a node represents a single machine in your cluster—typically a virtual machine in cloud environments like Thalassa Cloud. Each node runs several components: the kubelet (which communicates with the control plane), a container runtime (like containerd), and the kube-proxy (which handles networking). For your applications to run smoothly, these components must function correctly, and the node must have sufficient resources.
Kubernetes continuously monitors each node’s health through a system of conditions. These conditions reflect the current state of the node, including whether it’s ready to accept workloads, whether it’s experiencing resource pressure, and whether it can communicate with the cluster. When conditions indicate problems, Kubernetes takes action to protect your workloads.
Node Conditions
Kubernetes tracks node health through a set of standard conditions. Each condition has a status of True, False, or Unknown, along with a reason and message that provide context about the node’s state. Understanding these conditions helps you diagnose issues and understand what Kubernetes is doing to maintain cluster health.
The Ready Condition
The most important condition is Ready, which indicates whether the node is healthy and can accept new pods. When a node is Ready, it means the kubelet is functioning, the node can communicate with the API server, and there are no critical issues preventing workload scheduling. If Ready is False or Unknown, Kubernetes stops scheduling new pods on that node and may begin evicting existing pods.
You can check the Ready status of all nodes in your cluster:
kubectl get nodesThis shows a summary of all nodes, including their status. A node showing “Ready” is healthy. If you see “NotReady” or other statuses, the node has issues that need attention.
Resource Pressure Conditions
Kubernetes monitors three types of resource pressure that can affect node health. When a node experiences pressure, Kubernetes may prevent new pods from being scheduled there and might evict lower-priority pods to free resources.
- The
MemoryPressurecondition indicates that the node is running low on memory. When this condition isTrue, Kubernetes avoids scheduling new pods that would consume additional memory. In severe cases, Kubernetes may evict pods to prevent the node from running out of memory entirely, which could cause system instability. - Similarly,
DiskPressureindicates that the node’s disk space is running low. This can prevent pods from writing logs, creating temporary files, or using ephemeral storage. Kubernetes responds by preventing new pods from being scheduled and may evict pods to free disk space. - The
PIDPressurecondition indicates that the node has too many processes running, approaching the system’s process ID limit. This is less common but can occur in clusters running many small containers. When PID pressure exists, Kubernetes avoids scheduling additional pods.
Network Conditions
The NetworkUnavailable condition indicates that the node’s network is not properly configured. This typically appears briefly when a node is first joining the cluster, but if it persists, it indicates a networking problem that prevents the node from communicating with other nodes or the control plane.
Inspecting Node Health
To get detailed information about a node’s health, use the kubectl describe command:
kubectl describe node <node-name>This command shows information about the node, including all conditions, resource usage, and events. The conditions section shows the current status of each condition:
Conditions:
Type Status Reason Message
---- ------ ------ -------
NetworkUnavailable False CiliumIsUp Cilium is running on this node
MemoryPressure False KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False KubeletHasSufficientPID kubelet has sufficient PID available
Ready True KubeletReady kubelet is posting ready statusIn this example, all conditions show healthy statuses. The Ready condition is True, and all pressure conditions are False, indicating the node is functioning normally.
The output also shows resource capacity and allocation, helping you understand how much CPU and memory the node has and how much is currently in use. This information helps you plan capacity and identify nodes that might be approaching resource limits.
Monitoring Node Health
Regular monitoring helps you identify issues before they impact your applications. Kubernetes provides several ways to monitor node health, and Thalassa Cloud integrates additional monitoring capabilities.
Using kubectl
The simplest way to monitor nodes is through kubectl commands. The kubectl get nodes command provides a quick overview, while kubectl top nodes shows current resource usage if the metrics server is enabled:
kubectl top nodesThis displays CPU and memory usage for each node, helping you identify nodes that are approaching capacity limits.
Node Events
Kubernetes generates events when node conditions change. You can view events related to a specific node:
kubectl get events --field-selector involvedObject.kind=Node --sort-by='.lastTimestamp'These events show when nodes transition between states, when pods are evicted due to resource pressure, and other significant changes. Monitoring these events helps you understand what Kubernetes is doing to maintain cluster health.
Thalassa Cloud Monitoring
Thalassa Cloud provides built-in monitoring through the metrics server, which collects resource usage data from nodes and pods. This data is available through the Kubernetes API and can be integrated with external monitoring solutions. For more details about monitoring capabilities, see the Metrics Server documentation.
Auto-Healing in Thalassa Cloud
One of Kubernetes’ most powerful features is its ability to automatically detect and respond to node failures. Thalassa Cloud Kubernetes enhances this capability with automatic node recovery, ensuring that your workloads remain available even when individual nodes fail.
How Auto-Healing Works
The auto-healing process begins when Kubernetes detects that a node is unhealthy. The kubelet on each node sends regular heartbeats to the control plane. If these heartbeats stop, or if the node stops responding to health checks, Kubernetes marks the node as unreachable.
Once a node is marked as unreachable, Kubernetes begins protecting your workloads. The first step is to mark all pods on that node for eviction. Pods that are part of a Deployment, StatefulSet, or other controller are automatically recreated on healthy nodes. This process happens automatically—you don’t need to intervene.
For pods that aren’t managed by a controller, Kubernetes marks them for deletion after a grace period. This ensures that failed nodes don’t leave orphaned pods that can’t be recovered.
Node Recovery
When a node becomes unreachable, Thalassa Cloud’s infrastructure monitoring detects the issue and attempts to recover the node. If the node is experiencing a transient issue, such as a network partition or temporary resource exhaustion, it may recover automatically. Once the node is healthy again and can communicate with the control plane, Kubernetes reintroduces it to the cluster, and it begins accepting new workloads.
If the node cannot be recovered automatically, Thalassa Cloud may replace it entirely. This ensures that persistent hardware or software issues don’t leave your cluster with permanently unhealthy nodes. The replacement process provisions a new node, joins it to the cluster, and the cluster returns to full capacity.
Workload Rescheduling
During the auto-healing process, Kubernetes reschedules pods from unhealthy nodes to healthy ones. This rescheduling respects pod disruption budgets, affinity rules, and resource constraints, ensuring that your applications maintain their desired configuration even as nodes are replaced.
The rescheduling happens automatically and typically completes within minutes. For stateless applications, this process is transparent—users may not notice any interruption. For stateful applications, the process depends on how your application handles pod restarts and whether it uses persistent storage.
Troubleshooting Node Issues
When nodes show unhealthy conditions, understanding how to diagnose and resolve issues helps you maintain cluster stability. Common problems include resource exhaustion, network connectivity issues, and component failures.
Investigating Resource Pressure
If nodes show MemoryPressure, DiskPressure, or PIDPressure conditions, the first step is to understand what’s consuming resources. Check pod resource usage:
kubectl top pods --all-namespacesThis shows which pods are using the most CPU and memory. You may need to adjust resource requests and limits, scale down resource-intensive workloads, or add more nodes to your cluster.
For disk pressure, check what’s consuming disk space. Kubernetes uses disk space for container images, logs, and ephemeral storage. If disk pressure persists, consider cleaning up unused images or increasing node disk capacity.
Network Connectivity Issues
If a node shows NetworkUnavailable or cannot communicate with the control plane, check network connectivity. Verify that the node can reach the API server and that security groups or network policies aren’t blocking required traffic. In Thalassa Cloud, ensure that your VPC and subnet configurations allow the necessary communication.
Component Failures
If the kubelet or other node components are failing, check the node’s system logs. You can access logs from the node if you have SSH access, or use kubectl to view component logs:
kubectl logs -n kube-system <component-pod-name>Common issues include misconfigured kubelet settings, problems with the container runtime, or conflicts with system services. Thalassa Cloud manages these components, but understanding what to look for helps you diagnose issues reported by the platform.
Node Not Ready
When a node shows as NotReady, start by checking the node conditions and events as described earlier. The reason and message fields in the conditions provide clues about what’s wrong. Common causes include kubelet failures, network issues, or resource exhaustion.
If a node remains NotReady for an extended period and doesn’t recover automatically, Thalassa Cloud will typically replace it. You can also manually drain and remove unhealthy nodes if needed, though the automatic recovery usually handles this.
Integration with Other Components
Node health monitoring integrates with several other Kubernetes features. The Node Problem Detector component can provide additional monitoring for kernel-level issues and system problems that might not be caught by standard Kubernetes health checks.
Node health also affects cluster autoscaling. If nodes are consistently at high resource utilization, you might need to enable the Node Pool Autoscaler to automatically add nodes when demand increases.
For applications that need to run across multiple zones for high availability, node health in each zone affects your application’s resilience. See the Highly Available Deployments guide for strategies to ensure applications remain available even when nodes fail.
Further Reading
- Kubernetes: Nodes Architecture & Health Monitoring: Official documentation covering node components, health checks, and troubleshooting.
- Thalassa Cloud: Nodes Documentation: Detailed guide on managing nodes within Thalassa Cloud clusters.
- Node Problem Detector: Learn how to extend node health monitoring for kernel and OS-level issues.
- Highly Available Deployments Guide: Strategies to maintain application uptime across node failures.
- Node Pool Autoscaler: Automatically adjust node resources in response to workload demand.