Debug School

Cover image for VMware Common Issues
Suyash Sambhare
Suyash Sambhare

Posted on

VMware Common Issues

What if the most consumed host fails?

This health check simulates the failure of the host with the most resources consumed and shows the cluster resource consumption then. First, the resources on that host would be no longer available. Second, vSAN will attempt to re-protect all objects that were now running with reduced redundancy. Re-protection of all objects is impossible if utilization is more than 100%.

Limits Health – After one additional host failure check

In addition to the basic limit health check, there is also a simulation of how resources would look after an ESXi host failure has occurred. If a single ESXi host fails, two things can happen. First, the resources on that ESXi host are no longer available. Second, vSAN attempts to re-protect all components belonging to objects that are now running with reduced redundancy due to the failure.
This health check simulates both actions described above. If the ESXi host with the most resources consumed fails, this health check calculates how much resources would be used from the remaining hosts in the cluster, and how much resources would still be available.
If there is already a failure in the cluster, this test will report one additional failure. Therefore, this test reports on the results of the current failure and the additional failure that it introduces.

Error State

If this check reports that after a host failure, more than 100% of resources will be used, it means that re-protection fails for some objects because there are not enough resources available.
This health check simulation is very simple. It only looks at cluster aggregate resources, so just like the basic limits check, it does not consider the distribution and placement rules.
However, this simple simulation will verify that, after a failure, a vSAN cluster has been configured with enough resources to operate in an operationally safe manner after a re-protection. This test does not check for balance and fault domains, so these need to be considered independently of this test.
A user may enforce an operational business policy to have no less than 25% free disk space under normal conditions and no less than 15% free disk space after one failure. This check can be used to implement such a policy and to verify that this is indeed the case.

Troubleshooting and Fix

It is primarily for information only. If this health check fails, you may wish to add additional resources to the cluster to facilitate a successful rebuild after a failure. If you feel that there should be enough capacity in the cluster to rebuild after a failure, check to see if any of the components such as Disk drives are in a failed state.

Physical Disk Health - Congestion check

Congestion in vSAN happens when the lower layers fail to keep up with the I/O rate of the higher layers. If this health status is not green (OK), vSAN is still using the disk, but it is in a state of reduced performance, manifesting in low throughput/IOPS and high latencies for vSAN objects using this disk group. Congestion in these cases will apply to all objects on the disk group.

Error state

Typical reasons for congestion are bad or badly sized hardware, misbehaving storage controller firmware, bad controller drivers, a low queue depth on the controller, or some problems in the software. For example, if the flash cache device is not sized correctly, virtual machines performing a lot of write operations could fill up write buffers on the flash cache device. These buffers need to be destaged to magnetic disks in hybrid configurations. To facilitate the now very frequently occurring destaging operations, congestion might be used to slow down the writes from the virtual machine.
One common scenario is a high read cache miss rate, which can also lead to congestion and slow down virtual machine read I/O.

Troubleshooting and Fix

Under high load, when vSAN is operating at its maximum performance, a low amount of congestion (typically under a value of 200) is expected and is not a cause of concern. However, any value of congestion above 0 combined with low throughput/IOPS is an indication of an issue. This health check will be green (OK) for congestion values below 200, yellow (warning) for values between 200 and 220, and red (alert) for values above 220. The maximum value for congestion is 255.
The threshold value for earlier versions to 6.7 U1, would continue to be 32 (Yellow) and 64 (Red).
High congestion could be the root cause of virtual machine storage performance degradation, operation failures, or even ESXi hosts going unresponsive.

VMware

Data Health – vSAN Object Health check

The object health checks are designed to provide two aspects at a very fast glance.
It provides a cluster-wide overview by summarizing all objects in the cluster.
It categorizes object health to help you assess not only if an object is healthy or unhealthy, but whether an administrator should take action or whether an environment is at risk.

Error state

These are the possible states that an object may have when it is not healthy.

  • remoteAccessible: This status is only applicable for the client vSAN cluster after mounting the remote vSAN datastore and indicates the object is accessible from all hosts in the client cluster. The actual object health status like reduced availability needs to be queried from the server cluster which the client cluster is mounting from.
  • Data move: vSAN is building data on the ESXi hosts and storage in the cluster either because you requested some form of maintenance mode or evacuation, or because of re-balancing activities. Objects in this state are fully compliant with their policy and are healthy, but vSAN is actively rebuilding them. You should not be worried, as the object is not at risk. However, a performance impact can be expected while objects are in this state. You can cross-reference to the re-syncing components view to learn more about active data sync activities.
  • Healthy: The object is in perfect condition, exactly aligned with its policy, and is not currently being moved or otherwise worked on.
  • Inaccessible: An object has suffered more failures (permanent or temporary) than it was configured to tolerate and is currently unavailable and inaccessible. If the failures are not temporary (For example An ESXi host reboot), you should work on the underlying root cause such as a failed ESXi hosts, failed network, removed disks, and so on as quickly as possible to restore availability, as virtual machines that are using these objects cannot function correctly while in this inaccessible state.
  • Non-availability related in compliance: This is a catch-all state when none of the other states apply. An object with this state is not compliant with its policy but is meeting the availability (NumberOfFailuresToTolerate) policy. There is currently no documented case where this state would be applicable.
  • Non-availability related reconfig: vSAN is rebuilding data on the ESXi hosts and storage in the cluster because you requested a storage policy change that is unrelated to availability. In other words, such an object is fully in compliance with the NumberOfFailuresToTolerate policy and the data movement is to satisfy another policy change, such as NumberOfDiskStripesPerObject. You do not need to worry about an object in this state, as it is not at risk.
  • Reduced availability - active rebuild: The object has suffered a failure, but it was configured to be able to tolerate the failure. I/O continues to flow and the object is accessible. vSAN is actively working on re-protecting the object by rebuilding new components to bring the object back to compliance.
  • Reduced availability with no rebuild: The object has suffered a failure, but VSAN was able to tolerate it. For example: I/O is flowing and the object is accessible. However, VSAN is not working on re-protecting the object. This is not due to the delay timer (reduced availability - no rebuild - delay timer) but due to other reasons. This could be because there are not enough resources in the cluster, or this could be because there were not enough resources in the past, or there was a failure to re-protect in the past and VSAN has yet to retry. Refer to the limits health check for a first assessment if any resources may be exhausted. You have to resolve the failure or add resources as quickly as possible to get back to being fully protected against a subsequent failure.
  • Reduced availability with no rebuild - delay timer: The object has suffered a failure, but vSAN was able to tolerate it. I/O is flowing and the object is accessible. However, vSAN is not yet working on re-protecting the object, as it is waiting for the 60-minute (default) delay timer to expire before issuing the re-protect. You can choose to issue an explicit request to skip the delay timer and start re-protect immediately if it is known that the failed entity cannot be recovered within the delay period. However, if you know that the failed host is actively rebooting or knows that a wrong drive is incorrectly pulled and is being reinserted, then it is advisable to just wait for those tasks to finish, as that will be the quickest way to fully re-protect the object.
  • Reduced Availability With Paused Rebuild: The object has suffered a failure or its policy was recently changed to have a higher availability requirement. However, the object rebuild is paused because of a lack of available resources.
  • Reduced Availability With Policy Pending: The object policy was recently changed but has not yet been applied to the object. The object's current availability is less than what is expected by the new policy. Note it's a transient status and will either transit to 'healthy' or 'Reduced Availability With Policy Pending Failed' eventually depending on if the new policy can be accepted or not due to resource limitation. Depending on how much transient capacity is being used in the cluster, the object will stay in the status from minutes to hours. No user action is needed for this status.
  • Reduced Availability With Policy Pending Failed: Object policy has been changed but failed to apply to the object because of a lack of available resources. Users need to add more resources to the cluster so that vSAN can re-apply the new availability policy to the object automatically to make it fully compliant.
  • Non-availability Related In-compliance With Policy Pending: The object policy was recently changed and has not yet been applied. The object is still fully compliant with the new availability policy, but not compliant with the new non-availability related policies. Note it's a transient status and will either transit to 'healthy' or 'Non-availability Relate In-compliance With Policy Pending Failed' status eventually depending on if the new policy can be accepted or not due to resource limitation. Depending on how much transient capacity is being used in the cluster, the object will stay in the status from minutes to hours. No user action is needed for this status.
  • Non-availability Relate In-compliance With Policy Pending Failed: The object policy was recently changed but failed to apply to the object because of a lack of resources. The object is still fully compliant with the new availability policy. Users need to add more resources to the cluster so that vSAN can re-apply the new non-availability-related policy to the object automatically to make it fully compliant.
  • Non-availability Related In-compliance With Paused Rebuild: The object is not compliant with its current policy, but is meeting the availability (NumberOfFailuresToTolerate) policy. However, the object rebuild is paused because of a lack of available resources.

Troubleshooting and Fix

By reviewing the object state from the above list, you know what activities are occurring on the vSAN cluster from an object perspective, and whether any corrective actions should be taken.

Ref: https://kb.vmware.com/s/article/2108743

Top comments (0)