introduction
in previous article : we have seen how to configure and test HA
in this article : we will advanced feature of HA and expert recommendation
just to mention
you are NOT obligated to practice all advanced HA features below
just try practice whatever you see important for your environment
HA considerations
- Recovery time Object RTO 5 minutes : when host down then it will take 5 minutes to fully run VM
- Enabling HA will take 4-9 minutes to successfully applied
- HA has small impact of 3% to 9% on throughput. when host down
- HA is designed to support LAN networks with up to 10 ms latency between HA nodes
- Sometime there could be conflict between DRS rules and HA [since DRS could require that VM must working in specific host , or rule required that 2 VM should run in different or same host ] à so we have to take care of this point
types of host failure:
- A host stops functioning.
- A host becomes network isolated.
- A host loses network connectivity with the master host
Determining Responses to Host Issues
- If a host fails and its virtual machines must be restarted, you can control the order in which the virtual machines are restarted with the VM restart priority setting.
- You can also configure how vSphere HA responds if hosts lose management network connectivity with other hosts by using the host isolation response setting.
Proactive HA failure
- Occurs when a host component fails, which results in a loss of redundancy or a NON catastrophic failure.
- However, the functional behavior of the VMs residing on the host is not yet affected.
- For example, if a power supply on the host fails, but other power supplies are available, that is a Proactive HA failure.
- If a Proactive HA failure occurs, you can automate the remediation action taken in the vSphere Availability section of the vSphere Client. The VMs on the affected host can be evacuated to other hosts and the host is either placed in Quarantine mode or Maintenance mode
Host isolation response
- Host isolation response determines what happens when a host in a vSphere HA cluster loses its management network connections, but continues to run.
- You can use the isolation response to have vSphere HA power off virtual machines that are running on an isolated host and restart them on a nonisolated host
- . Host isolation responses require that Host Monitoring Status is enabled. If Host Monitoring Status is disabled,à host isolation responses are also suspended.
- A host determines that it is isolated when it is unable to communicate with the agents running on the other hosts, and it is unable to ping its isolation addresses. à THEN The host then executes its isolation response.
- you must install VMware Tools in the guest operating system of the virtual machine to make host isolation response
- The responses are : either [Power off and restart VMs] or [Shutdown and restart VMs]
VM and Application Monitoring
- VM Monitoring restarts individual virtual machines if their VMware Tools heartbeats are not received within a set time.
- Similarly, Application Monitoring can restart a virtual machine if the heartbeats for an application it is running are not received.
- You can enable these features and configure the sensitivity with which vSphere HA monitors non-responsiveness
VM Component Protection
- If VM Component Protection (VMCP) is enabled, vSphere HA can detect DataStore accessibility failures and provide automated recovery for affected virtual machines.
- VMCP provides protection against DataStore accessibility failures that can affect a virtual machine running on a host in a vSphere HA cluster.
- When a DataStore accessibility failure occurs,> the affected host can no longer access the storage path for a specific DataStore.
- You can determine the response that vSphere HA will make to such a failure, ranging from the creation of event alarms to virtual machine restarts on other hosts.
Types of DataStore Failure
PDL (Permanent Device Loss)
- is an unrecoverable loss of accessibility that occurs when a storage device reports the DataStore is no longer accessible by the host. This condition cannot be reverted without powering off virtual machines
APD (All Paths Down)
- represents a transient or unknown accessibility loss or any other unidentified delay in I/O processing. This type of accessibility issue is recoverable.
How to Configuring VMCP
- VM Component Protection is configured in the vSphere Client.
- Go to the Configure tab and click vSphere Availability and Edit.
- Under Failures and Responses you can select DataStore with PDL or DataStore with APD
DataStore Heart beating
- When the master host in a VMware vSphere® High Availability cluster cannot communicate with a subordinate host over the management network, > the master host uses DataStore heartbeating to determine whether the that host has failed, is in > [network partition], or is [network isolated]
- If the host has stopped DataStore heartbeating, >it is considered to have failed and its virtual machines are restarted elsewhere
HA Interoperability [working with other products]
You can use vSAN with a vSphere HA cluster only if the following conditions are met:
- All the cluster’s ESXi hosts must be version 5.5 or later.
- The cluster must have a minimum of three ESXi hosts
In a cluster using DRS and vSphere HA with admission control turned on,
- virtual machines might not be evacuated from hosts that [ entering maintenance mode].
- This behavior occurs because of the resources reserved for restarting virtual machines in the event of a failure.
- You must manually migrate the virtual machines off of the hosts using vMotion
In some scenarios, vSphere HA might not be able to fail over virtual machines because of resource constraints. This can occur for several reasons.
- HA admission control is disabled and Distributed Power Management (DPM) is enabled. àThis can result in DPM consolidating virtual machines onto fewer hosts and placing the empty hosts in standby mode leaving insufficient powered-on capacity to perform a failover.
- VM-Host affinity (required) rules might limit the hosts on which certain virtual machines can be placed.
- There might be sufficient aggregate resources but these can be fragmented across multiple hosts à so that they cannot be used by virtual machines for failover.
VM Component Protection (VMCP) has the following interoperability issues and limitations with HA :
- VMCP does not support vSphere Fault Tolerance. If VMCP is enabled for a cluster using Fault Tolerance,à the affected FT virtual machines will automatically receive overrides that disable VMCP.
- VMCP does not detect or respond to accessibility issues for files located on vSAN DataStore. : If a virtual machine’s configuration and VMDK files are located only on vSAN DataStore, [they are not protected by VMCP] à VMCP does not detect or respond to accessibility issues for files located on Virtual Volume DataStore.
- If a virtual machine’s configuration and VMDK files are located only on Virtual Volume DataStore, they are not protected by VMCP. à VMCP does not protect against inaccessible Raw Device Mapping (RDM)s.
IPv6 and HA
- The cluster contains only ESXi 6.0 or later hosts.[earlier is NOT supported ]
- The management network for all hosts in the cluster must be configured with the same IP version, either IPv6 or IPv4.
- [The network isolation addresses] used by vSphere HA must match the IP version used by the cluster for its management network.
- IPv6 cannot be used in vSphere HA clusters that also utilize vSAN.
Conclusion
some IT administrator find some difficulty to master advanced HA configuration from first time
my opinion as Maher islaieh : is just to practice basic HA in part II ,
then gradually to test advanced features as much as possible