2025-08-04

Streamline the MSSQL High Availability to improve business continuity

This short document proposes an exploration into the possibility of refining the SQL Server high availability architecture to improve business continuity.

The issue we aim to address

Since March 2025, the Production SQL Server has faced outages 2 times. DBA team determined that the cause was the Windows Server Failover Cluster (WSFC) losing its quorum, which resulted in both the primary SQL availability replica in the US and the secondary SQL availability replica in the DK site being in a "Resolving" state, rendering them inaccessible for both reading and writing. The DBA team cannot fix this issue alone and needs help from the IT team to recover the WSFC quorum.

Current architecture for production SQL Server High Availability

Displayed below is the current WSFC architecture of the production SQL Server.

To mitigate the performance effects of network latency between the US and DK sites, the primary SQL replica and the secondary replica are configured to synchronize data in Asynchronous-Commit mode. This configuration limits the SQL availability group to Manual Failover only, where Automatic Failover is not supported, a fact that should already be known and accepted by us.

The table below has been taken from Microsoft Online regarding WSFC Quorum:

Based on the information provided by Microsoft, we can derive the following disaster recovery scenarios for our existing configuration:

Primary Server

 Cloud Witness

Secondary Server

Database accessibility within the availability group

Up

Up

Up

The primary database server allows read and write access.

Up

Up

Down

The primary database server allows read and write access.

Up

Down

Up

The primary database server allows read and write access.

Up

Down

Down

According to dynamic quorum behavior of WSFC, if the secondary server and the witness are taken down one at a time, the quorum continues to exist, and the primary database server is still accessible. However, if both the secondary server and the witness go down simultaneously from the primary server's perspective—imagine the internet connection to both is severed—the databases in the primary server's availability group will become inaccessible, and immediate IT support is needed to recover the quorum. If the secondary server is connected to the witness, we might be able to reinstate the database services on the secondary server through a forced manual failover.

Down

Up

Up

DBA must conduct a manual failover to the secondary server to resume read-write access to the databases in availability group.

Down

Down

Up

Quorum is lost; the databases are inaccessible, and immediate IT support is needed to recover the quorum.

 

The new architecture we present aims to tackle the highlighted scenarios.

The proposed architecture – Clusterless Availability Group

SQL Server 2017 introduces read-scale availability groups without a cluster.

A diagram of a group of objects

AI-generated content may be incorrect.
In the same way as a conventional clustered availability group, a Read-Scale availability group (often referred to as Clusterless; the term Read-Scale is merely a marketing label by Microsoft) allows for data synchronization between primary and secondary replicas to be configured in either Synchronous-Commit mode or Asynchronous-Commit mode at any time without causing service interruptions. A clusterless availability group only allows for manual failover, and does not support automatic failover. Using a clusterless availability group offers the benefit of removing the necessity for WSFC quorum and the cloud witness, thereby simplifying maintenance. Moreover, the disaster recovery scenarios for our SQL Servers will be less complex as described below:

Primary Server

Secondary Server

Database accessibility within the availability group

Up

Up

Primary database server is Read-Write accessible.

Up

Down

Primary database server is Read-Write accessible.

Down

Up

DBA must conduct a Manual Failover to the Secondary Server to resume Read-Write access to the databases in Availability Group.

 

Following Actions

DBA team is looking to obtain a new set of SQL Server VMs, one in the US and the other in DK, to set up a lab environment for assessing the clusterless availability group, focusing on disaster recovery and data integrity. After we secure satisfactory testing results, we can develop a migration plan for the clusterless architecture or include it in the forthcoming SQL Server upgrade initiative (upgrading from version 2019 to 2022/2025).

No comments:

Post a Comment