2021-08-02

Availability Group unexpected Failover due to LeaseTimeout and how to solve it

SQL Server Availability Group (AG) high relies on Windows Server Failover Cluster (WSFC) in failover detection. Frequent communication/heartbeat/ping called "Lease mechanism" is used to ensure that the two sides (the Cluster Service and SQL Server service) are in frequent contact, checking each other's state and ultimately preventing a split-brain scenario. By the way, if your AG primary replica is too heavily-loaded, it may occasionally just too busy to signal the WSFC service that it is still alive, in this case the WSFC will decide the SQL Server service is down and trigger a failover. You will see the below error message in the SQL Server Error Log:

Windows Server Failover Cluster did not receive a process event signal from SQL Server hosting availability group '<AG Group Name>' within the lease timeout period.
The state of the local availability replica in availability group '<AG Group Name>' has changed from 'PRIMARY_NORMAL' to 'RESOLVING_NORMAL'.  The state changed because the lease between the local availability replica and Windows Server Failover Clustering (WSFC) has expired.  For more information, see the SQL Server error log, Windows Server Failover Clustering (WSFC) management console, or WSFC log.

We can also read the same error from the Windows Cluster log:

SQL Server Availability Group <AG Group Name>: Lease renewal failed with timeout error.
SQL Server Availability Group <AG Group Name>: The lease is expired.

The lease is a heartbeat that detects the SQL Server process health hosting the primary replica. A thread runs at priority inside the SQL Server and communicates via a Windows event with the SQL Server resource DLL-hosted process (RHS.EXE) - if that thread does not respond within the lease timeout period, the SQL Server resource DLL reports a lease timeout and reports availability group no longer "looks alive", and the availability group resource transitions to RESOLVING state and fails over if configured to do so. Although the lease is primarily a synchronization mechanism between the primary instance and the cluster, it can also create failure conditions where there was otherwise no need to fail over. For example, high CPU, out-of-memory conditions (low virtual memory, process paging), SQL process not responding while generating a memory dump, system not responding, cluster (WSFC) going offline (e.g. due to quorum loss) can prevent lease renewal from the SQL instance and causing a restart or failover which is unexpected.
One resolution is to increase the LeaseTimeout value (default 20000, which means 20 seconds), for example, set it to 60000 (60 seconds). According to the Microsoft document, 1/2 * LeaseTimeout must be less than SameSubnetThreshold * SameSubnetDelay, and SameSubnetThreshold \<= CrossSubnetThreshold and SameSubnetDelay \<= CrossSubnetDelay should be true of all SQL Server clusters. So in order to set LeaseTimeout to let's say 60 secs, you must also increase the SubnetThreshold values and SubnetDelay values, by running the following PowerShell commands:

(get-cluster).SameSubnetDelay = 2000
(get-cluster).SameSubnetThreshold = 20
(get-cluster).CrossSubnetDelay = 4000
(get-cluster).CrossSubnetThreshold = 40
Get-Cluster | fl *subnet*

After you executed the above PowerShell commands, you set the LeaseTimeout value in Failover Cluster Manager > Roles > Availability Group Name > Other Resources > Right-click the AG resource at the bottom of the window (under Other Resources) and select Properties > properties tab. Follow these steps provided by Microsoft.

The new value of property 'LeaseTimeout' will take effect after the resource is taken offline and brought online again.