Node Health

Node Health in Thalassa Cloud Kubernetes

Maintaining healthy nodes is fundamental to running reliable applications in Kubernetes. Nodes are the worker machines that run your containerized workloads, and their health directly impacts the availability and performance of your applications. This guide explains how Kubernetes monitors node health, how Thalassa Cloud automatically recovers from node failures, and how you can monitor and troubleshoot node issues in your clusters.

About Kubernetes Node Health

In Kubernetes, a node represents a single machine in your cluster—typically a virtual machine in cloud environments like Thalassa Cloud. Each node runs several components: the kubelet (which communicates with the control plane), a container runtime (like containerd), and the kube-proxy (which handles networking). For your applications to run smoothly, these components must function correctly, and the node must have sufficient resources.

Kubernetes continuously monitors each node’s health through a system of conditions. These conditions reflect the current state of the node, including whether it’s ready to accept workloads, whether it’s experiencing resource pressure, and whether it can communicate with the cluster. When conditions indicate problems, Kubernetes takes action to protect your workloads.

Node Conditions

Kubernetes tracks node health through a set of standard conditions. Each condition has a status of True, False, or Unknown, along with a reason and message that provide context about the node’s state. Understanding these conditions helps you diagnose issues and understand what Kubernetes is doing to maintain cluster health.

The Ready Condition

The most important condition is Ready, which indicates whether the node is healthy and can accept new pods. When a node is Ready, it means the kubelet is functioning, the node can communicate with the API server, and there are no critical issues preventing workload scheduling. If Ready is False or Unknown, Kubernetes stops scheduling new pods on that node and may begin evicting existing pods.

You can check the Ready status of all nodes in your cluster:

kubectl get nodes

This shows a summary of all nodes, including their status. A node showing “Ready” is healthy. If you see “NotReady” or other statuses, the node has issues that need attention.

Resource Pressure Conditions

Kubernetes monitors three types of resource pressure that can affect node health. When a node experiences pressure, Kubernetes may prevent new pods from being scheduled there and might evict lower-priority pods to free resources.

The MemoryPressure condition indicates that the node is running low on memory. When this condition is True, Kubernetes avoids scheduling new pods that would consume additional memory. In severe cases, Kubernetes may evict pods to prevent the node from running out of memory entirely, which could cause system instability.
Similarly, DiskPressure indicates that the node’s disk space is running low. This can prevent pods from writing logs, creating temporary files, or using ephemeral storage. Kubernetes responds by preventing new pods from being scheduled and may evict pods to free disk space.
The PIDPressure condition indicates that the node has too many processes running, approaching the system’s process ID limit. This is less common but can occur in clusters running many small containers. When PID pressure exists, Kubernetes avoids scheduling additional pods.

Network Conditions

The NetworkUnavailable condition indicates that the node’s network is not properly configured. This typically appears briefly when a node is first joining the cluster, but if it persists, it indicates a networking problem that prevents the node from communicating with other nodes or the control plane.

Inspecting Node Health

To get detailed information about a node’s health, use the kubectl describe command:

kubectl describe node <node-name>

This command shows information about the node, including all conditions, resource usage, and events. The conditions section shows the current status of each condition:

Conditions:
  Type                 Status  Reason                       Message
  ----                 ------  ------                       -------
  NetworkUnavailable   False   CiliumIsUp                   Cilium is running on this node
  MemoryPressure       False   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    KubeletReady                  kubelet is posting ready status

In this example, all conditions show healthy statuses. The Ready condition is True, and all pressure conditions are False, indicating the node is functioning normally.

The output also shows resource capacity and allocation, helping you understand how much CPU and memory the node has and how much is currently in use. This information helps you plan capacity and identify nodes that might be approaching resource limits.

Monitoring Node Health

Regular monitoring helps you identify issues before they impact your applications. Kubernetes provides several ways to monitor node health, and Thalassa Cloud integrates additional monitoring capabilities.

Using kubectl

The simplest way to monitor nodes is through kubectl commands. The kubectl get nodes command provides a quick overview, while kubectl top nodes shows current resource usage if the metrics server is enabled:

kubectl top nodes

This displays CPU and memory usage for each node, helping you identify nodes that are approaching capacity limits.

Node Events

Kubernetes generates events when node conditions change. You can view events related to a specific node:

kubectl get events --field-selector involvedObject.kind=Node --sort-by='.lastTimestamp'

These events show when nodes transition between states, when pods are evicted due to resource pressure, and other significant changes. Monitoring these events helps you understand what Kubernetes is doing to maintain cluster health.

Thalassa Cloud Monitoring

Thalassa Cloud provides built-in monitoring through the metrics server, which collects resource usage data from nodes and pods. This data is available through the Kubernetes API and can be integrated with external monitoring solutions. For more details about monitoring capabilities, see the Metrics Server documentation.

Auto-Healing in Thalassa Cloud

One of Kubernetes’ most powerful features is its ability to automatically detect and respond to node failures. Thalassa Cloud Kubernetes enhances this capability with automatic node recovery, ensuring that your workloads remain available even when individual nodes fail.

How Auto-Healing Works

The auto-healing process begins when Kubernetes detects that a node is unhealthy. The kubelet on each node sends regular heartbeats to the control plane. If these heartbeats stop, or if the node stops responding to health checks, Kubernetes marks the node as unreachable.

Once a node is marked as unreachable, Kubernetes begins protecting your workloads. The first step is to mark all pods on that node for eviction. Pods that are part of a Deployment, StatefulSet, or other controller are automatically recreated on healthy nodes. This process happens automatically—you don’t need to intervene.

For pods that aren’t managed by a controller, Kubernetes marks them for deletion after a grace period. This ensures that failed nodes don’t leave orphaned pods that can’t be recovered.

Node Recovery

When a node becomes unreachable, Thalassa Cloud’s infrastructure monitoring detects the issue and attempts to recover the node. If the node is experiencing a transient issue, such as a network partition or temporary resource exhaustion, it may recover automatically. Once the node is healthy again and can communicate with the control plane, Kubernetes reintroduces it to the cluster, and it begins accepting new workloads.

If the node cannot be recovered automatically, Thalassa Cloud may replace it entirely. This ensures that persistent hardware or software issues don’t leave your cluster with permanently unhealthy nodes. The replacement process provisions a new node, joins it to the cluster, and the cluster returns to full capacity.

Workload Rescheduling

During the auto-healing process, Kubernetes reschedules pods from unhealthy nodes to healthy ones. This rescheduling respects pod disruption budgets, affinity rules, and resource constraints, ensuring that your applications maintain their desired configuration even as nodes are replaced.

The rescheduling happens automatically and typically completes within minutes. For stateless applications, this process is transparent—users may not notice any interruption. For stateful applications, the process depends on how your application handles pod restarts and whether it uses persistent storage.

Troubleshooting Node Issues

When nodes show unhealthy conditions, understanding how to diagnose and resolve issues helps you maintain cluster stability. Common problems include resource exhaustion, network connectivity issues, and component failures.

Investigating Resource Pressure

If nodes show MemoryPressure, DiskPressure, or PIDPressure conditions, the first step is to understand what’s consuming resources. Check pod resource usage:

kubectl top pods --all-namespaces

This shows which pods are using the most CPU and memory. You may need to adjust resource requests and limits, scale down resource-intensive workloads, or add more nodes to your cluster.

For disk pressure, check what’s consuming disk space. Kubernetes uses disk space for container images, logs, and ephemeral storage. If disk pressure persists, consider cleaning up unused images or increasing node disk capacity.

Network Connectivity Issues

If a node shows NetworkUnavailable or cannot communicate with the control plane, check network connectivity. Verify that the node can reach the API server and that security groups or network policies aren’t blocking required traffic. In Thalassa Cloud, ensure that your VPC and subnet configurations allow the necessary communication.

Component Failures

If the kubelet or other node components are failing, check the node’s system logs. You can access logs from the node if you have SSH access, or use kubectl to view component logs:

kubectl logs -n kube-system <component-pod-name>

Common issues include misconfigured kubelet settings, problems with the container runtime, or conflicts with system services. Thalassa Cloud manages these components, but understanding what to look for helps you diagnose issues reported by the platform.

Node Not Ready

When a node shows as NotReady, start by checking the node conditions and events as described earlier. The reason and message fields in the conditions provide clues about what’s wrong. Common causes include kubelet failures, network issues, or resource exhaustion.

If a node remains NotReady for an extended period and doesn’t recover automatically, Thalassa Cloud will typically replace it. You can also manually drain and remove unhealthy nodes if needed, though the automatic recovery usually handles this.

Integration with Other Components

Node health monitoring integrates with several other Kubernetes features. The Node Problem Detector component can provide additional monitoring for kernel-level issues and system problems that might not be caught by standard Kubernetes health checks.

Node health also affects cluster autoscaling. If nodes are consistently at high resource utilization, you might need to enable the Node Pool Autoscaler to automatically add nodes when demand increases.

For applications that need to run across multiple zones for high availability, node health in each zone affects your application’s resilience. See the Highly Available Deployments guide for strategies to ensure applications remain available even when nodes fail.