Edit

Share via


Troubleshoot Node Not Ready failures that are followed by recoveries

Summary

This article provides a guide to troubleshoot and resolve Node Not Ready problems in Azure Kubernetes Service (AKS) clusters. When a node enters a NotReady state, it disrupts the application's functionality and causes it to stop responding. Typically, the node recovers automatically after a short period. However, to prevent recurring problems and maintain a stable environment, you need to understand the underlying causes so you can implement effective resolutions.

Cause

Several scenarios can cause a NotReady state:

  • The unavailability of the API server. This condition causes the readiness probe to fail. This failure prevents the pod from being attached to the service so that traffic is no longer forwarded to the pod instance.

  • Virtual machine (VM) host faults. To determine whether VM host faults occurred, check the following information sources:

Resolution

To resolve this issue, follow these steps:

  1. Run kubectl describe node <node-name> to review detailed information about the node's status. Look for any error messages or warnings that might indicate the root cause of the problem.
  2. Check the API server availability by running the kubectl get apiservices command. Make sure that the readiness probe is correctly configured in the deployment YAML file.
  3. Verify the node's network configuration to make sure that there are no connectivity problems.
  4. Check the node's resource usage, like CPU, memory, and disk, to identify potential constraints. For more information, see Monitor your Kubernetes cluster performance with Container insights.

For further steps, see Basic troubleshooting of Node Not Ready failures.

Prevention

To prevent this issue, take one or more of the following actions:

  • Make sure that you pay for your service tier.
  • Reduce the number of watch and get requests to the API server.