This short document proposes an exploration into the possibility of refining the SQL Server high availability architecture to improve business continuity.
The issue we aim to address
Since March 2025,
the Production SQL Server has faced outages 2 times. DBA team determined that
the cause was the Windows Server Failover Cluster (WSFC) losing its quorum,
which resulted in both the primary SQL availability replica in the US and the
secondary SQL availability replica in the DK site being in a
"Resolving" state, rendering them inaccessible for both reading and
writing. The DBA team cannot fix this issue alone and needs help from the IT
team to recover the WSFC quorum.
Current
architecture for production SQL Server High Availability
Displayed below is
the current WSFC architecture of the production SQL Server.
To mitigate the performance effects of network latency
between the US and DK sites, the primary SQL replica and the secondary replica
are configured to synchronize data in Asynchronous-Commit mode. This
configuration limits the SQL availability group to Manual Failover only,
where Automatic Failover is not supported, a fact that should already be known
and accepted by us.
The table below has been taken from Microsoft
Online regarding WSFC Quorum:
Based on the information provided by Microsoft, we can
derive the following disaster recovery scenarios for our existing
configuration:
Primary Server |
Cloud Witness |
Secondary Server |
Database accessibility within the availability group |
Up |
Up |
Up |
The primary database server allows
read and write access. |
Up |
Up |
Down |
The primary database server allows
read and write access. |
Up |
Down |
Up |
The primary database server allows
read and write access. |
Up |
Down |
Down |
According to dynamic quorum behavior of WSFC, if the
secondary server and the witness are taken down one at a time, the quorum
continues to exist, and the primary database server is still accessible.
However, if both the secondary server and the witness go down simultaneously
from the primary server's perspective—imagine the internet connection to both
is severed—the databases in the primary server's availability group will
become inaccessible, and immediate IT support is needed to recover the quorum.
If the secondary server is connected to the witness, we might be able to reinstate
the database services on the secondary server through a forced manual
failover. |
Down |
Up |
Up |
DBA must conduct a manual failover
to the secondary server to resume read-write access to the databases in availability
group. |
Down |
Down |
Up |
Quorum is lost; the databases are
inaccessible, and immediate IT support is needed to recover the quorum. |
The new architecture we present aims to tackle the highlighted
scenarios.
The proposed architecture – Clusterless Availability
Group
SQL Server 2017 introduces read-scale
availability groups without a cluster.
In the same way as a conventional clustered availability group, a Read-Scale
availability group (often referred to as Clusterless; the term Read-Scale is
merely a marketing label by Microsoft) allows for data synchronization between
primary and secondary replicas to be configured in either Synchronous-Commit
mode or Asynchronous-Commit mode at any time without causing service
interruptions. A clusterless availability group only allows for manual
failover, and does not support automatic failover. Using a clusterless
availability group offers the benefit of removing the necessity for WSFC quorum
and the cloud witness, thereby simplifying maintenance. Moreover, the disaster
recovery scenarios for our SQL Servers will be less complex as described below:
Primary Server |
Secondary Server |
Database accessibility within the
availability group |
Up |
Up |
Primary database server is
Read-Write accessible. |
Up |
Down |
Primary database server is
Read-Write accessible. |
Down |
Up |
DBA must conduct a Manual Failover
to the Secondary Server to resume Read-Write access to the databases in
Availability Group. |
Following Actions
DBA team is looking to obtain a new set of SQL Server VMs,
one in the US and the other in DK, to set up a lab environment for assessing
the clusterless availability group, focusing on disaster recovery and data
integrity. After we secure satisfactory testing results, we can develop a
migration plan for the clusterless architecture or include it in the
forthcoming SQL Server upgrade initiative (upgrading from version 2019 to
2022/2025).