2025-08-04

Streamline the MSSQL High Availability to improve business continuity

This short document proposes an exploration into the possibility of refining the SQL Server high availability architecture to improve business continuity.

The issue we aim to address

Since March 2025, the Production SQL Server has faced outages 2 times. DBA team determined that the cause was the Windows Server Failover Cluster (WSFC) losing its quorum, which resulted in both the primary SQL availability replica in the US and the secondary SQL availability replica in the DK site being in a "Resolving" state, rendering them inaccessible for both reading and writing. The DBA team cannot fix this issue alone and needs help from the IT team to recover the WSFC quorum.

Current architecture for production SQL Server High Availability

Displayed below is the current WSFC architecture of the production SQL Server.

To mitigate the performance effects of network latency between the US and DK sites, the primary SQL replica and the secondary replica are configured to synchronize data in Asynchronous-Commit mode. This configuration limits the SQL availability group to Manual Failover only, where Automatic Failover is not supported, a fact that should already be known and accepted by us.

The table below has been taken from Microsoft Online regarding WSFC Quorum:

Based on the information provided by Microsoft, we can derive the following disaster recovery scenarios for our existing configuration:

Primary Server

 Cloud Witness

Secondary Server

Database accessibility within the availability group

Up

Up

Up

The primary database server allows read and write access.

Up

Up

Down

The primary database server allows read and write access.

Up

Down

Up

The primary database server allows read and write access.

Up

Down

Down

According to dynamic quorum behavior of WSFC, if the secondary server and the witness are taken down one at a time, the quorum continues to exist, and the primary database server is still accessible. However, if both the secondary server and the witness go down simultaneously from the primary server's perspective—imagine the internet connection to both is severed—the databases in the primary server's availability group will become inaccessible, and immediate IT support is needed to recover the quorum. If the secondary server is connected to the witness, we might be able to reinstate the database services on the secondary server through a forced manual failover.

Down

Up

Up

DBA must conduct a manual failover to the secondary server to resume read-write access to the databases in availability group.

Down

Down

Up

Quorum is lost; the databases are inaccessible, and immediate IT support is needed to recover the quorum.

 

The new architecture we present aims to tackle the highlighted scenarios.

The proposed architecture – Clusterless Availability Group

SQL Server 2017 introduces read-scale availability groups without a cluster.

A diagram of a group of objects

AI-generated content may be incorrect.
In the same way as a conventional clustered availability group, a Read-Scale availability group (often referred to as Clusterless; the term Read-Scale is merely a marketing label by Microsoft) allows for data synchronization between primary and secondary replicas to be configured in either Synchronous-Commit mode or Asynchronous-Commit mode at any time without causing service interruptions. A clusterless availability group only allows for manual failover, and does not support automatic failover. Using a clusterless availability group offers the benefit of removing the necessity for WSFC quorum and the cloud witness, thereby simplifying maintenance. Moreover, the disaster recovery scenarios for our SQL Servers will be less complex as described below:

Primary Server

Secondary Server

Database accessibility within the availability group

Up

Up

Primary database server is Read-Write accessible.

Up

Down

Primary database server is Read-Write accessible.

Down

Up

DBA must conduct a Manual Failover to the Secondary Server to resume Read-Write access to the databases in Availability Group.

 

Following Actions

DBA team is looking to obtain a new set of SQL Server VMs, one in the US and the other in DK, to set up a lab environment for assessing the clusterless availability group, focusing on disaster recovery and data integrity. After we secure satisfactory testing results, we can develop a migration plan for the clusterless architecture or include it in the forthcoming SQL Server upgrade initiative (upgrading from version 2019 to 2022/2025).

2025-07-05

Out of Sync status for the AlwaysOn Availability Group Replica

 Last week, a customer mentioned to me that their MSSQL server is running slowly. I verified that the CPU usage of the SQL Server was low, and the memory showed a high Page Life Expectancy value. This application is a vendor package software, with a low workload. Before informing the customer about the health of their SQL Server, I executed Paul Randal's Wait Statistics script. This script indicates that the primary wait type is HADR_SYNC_COMMIT, which signifies that the server is waiting for transaction commit processing for the synchronized secondary databases to harden the log. The screenshot below indicates that 75% of the wait time is attributed to HADR_SYNC_COMMIT.

The customer stated that the Availability Group consists of two replicas: the primary one in Shanghai and the secondary one in Hong Kong, with a network bandwidth of 10Mb/s. They requested that I demonstrate how frequently the AAG Replicas are out of sync. I provided them with the hadr_db_partner_set_sync_state event from the alwayson_health Extended Event session, as shown in the two screenshots below (sync_state NOT indicate out-of-sync, while sync_state LOG signifies resumed):


Given that the out-of-sync situation arises often, I can easily show how to recognize it using the Availability Group Dashboard, as depicted in the image below.

To address this issue, I recommended that the customer switch the AAG sync mode from Synchronous mode to Asynchronous mode; however, the trade-off is that the AAG will not be able to perform automatic failover.

2025-06-05

Obtain all permissions and access rights for every user in the database

SET NOCOUNT ON;

SET XACT_ABORT ON;

BEGIN TRAN

/*

Security Audit Report

1) List all access provisioned to a sql user or windows user/group directly 

2) List all access provisioned to a sql user or windows user/group through a database or application role

3) List all access provisioned to the public role


Columns Returned:

UserName        : SQL or Windows/Active Directory user account.  This could also be an Active Directory group.

UserType        : Value will be either 'SQL User' or 'Windows User'.  This reflects the type of user defined for the 

                  SQL Server user account.

DatabaseUserName: Name of the associated user as defined in the database user account.  The database user may not be the

                  same as the server user.

Role            : The role name.  This will be null if the associated permissions to the object are defined at directly

                  on the user account, otherwise this will be the name of the role that the user is a member of.

PermissionType  : Type of permissions the user/role has on an object. Examples could include CONNECT, EXECUTE, SELECT

                  DELETE, INSERT, ALTER, CONTROL, TAKE OWNERSHIP, VIEW DEFINITION, etc.

                  This value may not be populated for all roles.  Some built in roles have implicit permission

                  definitions.

PermissionState : Reflects the state of the permission type, examples could include GRANT, DENY, etc.

                  This value may not be populated for all roles.  Some built in roles have implicit permission

                  definitions.

ObjectType      : Type of object the user/role is assigned permissions on.  Examples could include USER_TABLE, 

                  SQL_SCALAR_FUNCTION, SQL_INLINE_TABLE_VALUED_FUNCTION, SQL_STORED_PROCEDURE, VIEW, etc.   

                  This value may not be populated for all roles.  Some built in roles have implicit permission

                  definitions.          

ObjectName      : Name of the object that the user/role is assigned permissions on.  

                  This value may not be populated for all roles.  Some built in roles have implicit permission

                  definitions.

ColumnName      : Name of the column of the object that the user/role is assigned permissions on. This value

                  is only populated if the object is a table, view or a table value function.                 

*/


--List all access provisioned to a sql user or windows user/group directly 

SELECT  DB_NAME() AS DB,

    [UserName] = ulogin.[name],

    [UserType] =ulogin.type_desc,  

    [DatabaseUserName] = princ.[name],       

    [Role] = null,      

    [PermissionType] = perm.[permission_name],       

    [PermissionState] = perm.[state_desc],       

    [ObjectType] = obj.type_desc,--perm.[class_desc],       

    [ObjectName] = OBJECT_NAME(perm.major_id),

    [ColumnName] = col.[name]

FROM    

    --database user

    sys.database_principals princ  

LEFT JOIN

    --Login accounts

    sys.server_principals ulogin on princ.[sid] = ulogin.[sid]

LEFT JOIN        

    --Permissions

    sys.database_permissions perm ON perm.[grantee_principal_id] = princ.[principal_id]

LEFT JOIN

    --Table columns

    sys.columns col ON col.[object_id] = perm.major_id 

                    AND col.[column_id] = perm.[minor_id]

LEFT JOIN

    sys.objects obj ON perm.[major_id] = obj.[object_id]

WHERE 

    princ.[type] in ('S','U')

UNION

--List all access provisioned to a sql user or windows user/group through a database or application role

SELECT    DB_NAME() AS DB,

[UserName] = ulogin.[name],

    [UserType] =ulogin.type_desc,

    [DatabaseUserName] = memberprinc.[name],   

    [Role] = roleprinc.[name],      

    [PermissionType] = perm.[permission_name],       

    [PermissionState] = perm.[state_desc],       

    [ObjectType] = obj.type_desc,--perm.[class_desc],   

    [ObjectName] = OBJECT_NAME(perm.major_id),

    [ColumnName] = col.[name]

FROM    

    --Role/member associations

    sys.database_role_members members

JOIN

    --Roles

    sys.database_principals roleprinc ON roleprinc.[principal_id] = members.[role_principal_id]

JOIN

    --Role members (database users)

    sys.database_principals memberprinc ON memberprinc.[principal_id] = members.[member_principal_id]

LEFT JOIN

    --Login accounts

    sys.server_principals ulogin on memberprinc.[sid] = ulogin.[sid]

LEFT JOIN        

    --Permissions

    sys.database_permissions perm ON perm.[grantee_principal_id] = roleprinc.[principal_id]

LEFT JOIN

    --Table columns

    sys.columns col on col.[object_id] = perm.major_id 

                    AND col.[column_id] = perm.[minor_id]

LEFT JOIN

    sys.objects obj ON perm.[major_id] = obj.[object_id]

UNION

--List all access provisioned to the public role, which everyone gets by default

SELECT    DB_NAME() AS DB,

    [UserName] = '{All Users}',

    [UserType] = '{All Users}', 

    [DatabaseUserName] = '{All Users}',       

    [Role] = roleprinc.[name],      

    [PermissionType] = perm.[permission_name],       

    [PermissionState] = perm.[state_desc],       

    [ObjectType] = obj.type_desc,--perm.[class_desc],  

    [ObjectName] = OBJECT_NAME(perm.major_id),

    [ColumnName] = col.[name]

FROM    

    --Roles

    sys.database_principals roleprinc

LEFT JOIN        

    --Role permissions

    sys.database_permissions perm ON perm.[grantee_principal_id] = roleprinc.[principal_id]

LEFT JOIN

    --Table columns

    sys.columns col on col.[object_id] = perm.major_id 

                    AND col.[column_id] = perm.[minor_id]                   

JOIN 

    --All objects   

    sys.objects obj ON obj.[object_id] = perm.[major_id]

WHERE

    --Only roles

    roleprinc.[type] = 'R' AND

    --Only public role

    roleprinc.[name] = 'public' AND

    --Only objects of ours, not the MS objects

    obj.is_ms_shipped = 0

ORDER BY

    princ.[Name],

    OBJECT_NAME(perm.major_id),

    col.[name],

    perm.[permission_name],

    perm.[state_desc],

    obj.type_desc--perm.[class_desc] 

ROLLBACK

2025-05-07

Comparing Agent Jobs among SQL Server instances utilizing PowerShell

Utilizing dbatools along with the Compare-Object PowerShell cmdlet allows for the comparison of SQL Server Agent jobs across two SQL Server instances.

Set-ExecutionPolicy Unrestricted

Set-PSRepository -Name PSGallery -InstallationPolicy Trusted

Import-Module dbatools

Set-DbatoolsConfig -FullName sql.connection.trustcert -Value $true -Register

Set-DbatoolsConfig -FullName sql.connection.encrypt -Value $false -Register

$RefServer = Get-DbaAgentJob -SqlInstance REFERENCE_SQL -SqlCredential sa

$DiffServer = Get-DbaAgentJob -SqlInstance DIFFERENCE_SQL -SqlCredential sa

Compare-Object $RefServer $DiffServer

The result of the comparison indicates whether a property value appeared only in the reference object (<=) or only in the difference object (=>).



2025-04-09

Retrieve Logins and their associated Database Roles

To obtain the list of accounts at the database level along with their corresponding roles, execute the SQL script provided below:

USE tempdb

GO

SELECT GETDATE() AS [now];

SELECT @@SERVERNAME AS servername, [name] AS LoginName, [type_desc], create_date, sid,

CASE WHEN EXISTS (

SELECT 1 FROM sys.server_role_members AS M

JOIN sys.server_principals AS R ON M.role_principal_id = R.principal_id

WHERE LOWER(R.name) = 'sysadmin' AND M.member_principal_id = P.principal_id) THEN 1 ELSE 0 END AS is_sysadmin

INTO #tmp_Logins

FROM sys.server_principals AS P WHERE [type] IN ('S', 'U', 'G') AND is_disabled = 0

AND UPPER([name]) NOT LIKE 'NT SERVICE\%' AND UPPER([name]) NOT LIKE 'NT AUTHORITY\%';


SELECT servername, LoginName, [type_desc], create_date, is_sysadmin FROM #tmp_Logins;


EXEC sp_MSforeachdb 'USE [?];

IF DB_NAME() NOT IN (''master'', ''model'', ''tempdb'', ''msdb'')

BEGIN

SELECT DB_NAME() AS [database], SL.LoginName AS LoginName, R.[name] AS RoleName

FROM sys.database_principals AS DP JOIN #tmp_Logins AS SL ON DP.sid = SL.sid

LEFT JOIN sys.database_role_members AS M ON M.member_principal_id = DP.principal_id

LEFT JOIN sys.database_principals AS R ON R.principal_id = M.role_principal_id AND R.is_fixed_role = 1

END

';

DROP TABLE #tmp_Logins;

Here is an example of the result:


2025-04-08

Query Store Runtime Statistics and Wait Statistics

This SQL script queries the Query Store Catalog Views to extract Runtime and Wait Statistics for the slowest query in a stored procedure, specifically within a designated time frame. The database must have the Query Store feature enabled first.

SELECT SYSDATETIMEOFFSET();

USE AppDB; -- DB name here

GO

SET XACT_ABORT ON;

SET NOCOUNT ON;

BEGIN TRAN

SELECT OBJECT_NAME(qsq.object_id) AS ObjName,

qsq.query_id, [plan].plan_id, qsqt.query_text_id, qsqt.query_sql_text, RS.runtime_stats_id, Interval.runtime_stats_interval_id,

qsq.last_execution_time AS query_last_exec_time, [plan].last_execution_time AS plan_last_exec_time,

Interval.start_time AS interval_start, Interval.end_time AS interval_end,

RS.first_execution_time AS RS_1st_exec_time, RS.last_execution_time AS RS_last_exec_time, RS.count_executions,

RS.avg_duration/1000.0/1000.0 AS avg_secs, RS.max_duration/1000.0/1000.0 AS max_secs, RS.min_duration/1000.0/1000.0 AS min_secs, RS.execution_type_desc,

RS.avg_logical_io_reads, RS.max_logical_io_reads, RS.avg_rowcount, RS.max_rowcount, RS.avg_logical_io_writes, RS.max_logical_io_writes, RS.avg_physical_io_reads, RS.max_physical_io_reads, RS.avg_tempdb_space_used, RS.max_tempdb_space_used

INTO #Temp1_QS_Perf_Analysis

FROM sys.query_store_query qsq

JOIN sys.query_store_query_text qsqt

ON qsq.query_text_id = qsqt.query_text_id

JOIN sys.query_store_plan [plan]

ON qsq.query_id = [plan].query_id

JOIN sys.query_store_runtime_stats RS

ON RS.plan_id = [plan].plan_id

JOIN sys.query_store_runtime_stats_interval Interval

ON RS.runtime_stats_interval_id = Interval.runtime_stats_interval_id

WHERE

/* Modify Criteria Here */

qsq.object_id IN (

OBJECT_ID('AppSP')) -- stored proc name here

AND qsq.last_execution_time >= '20250408 00:00'

AND Interval.start_time >= '20250408 00:00' AND Interval.end_time <= '20250408 04:00'

AND [plan].last_execution_time >= '20250408 00:00'

;

 

DECLARE @SlowQueryId bigint, @SlowPlanId bigint, @SlowIntervalId bigint, @FastPlanId bigint, @FastIntervalId bigint;

SELECT TOP 1 'SLOWEST', * FROM #Temp1_QS_Perf_Analysis ORDER BY max_secs DESC;

SELECT TOP 1 @SlowQueryId=query_id, @SlowPlanId=plan_id, @SlowIntervalId=runtime_stats_interval_id FROM #Temp1_QS_Perf_Analysis ORDER BY max_secs DESC;

SELECT TOP 1 'FASTEST', * FROM #Temp1_QS_Perf_Analysis WHERE query_id=@SlowQueryId AND avg_rowcount>0 ORDER BY min_secs;

SELECT TOP 1 @FastPlanId=plan_id, @FastIntervalId=runtime_stats_interval_id FROM #Temp1_QS_Perf_Analysis WHERE query_id=@SlowQueryId ORDER BY min_secs;

SELECT * FROM #Temp1_QS_Perf_Analysis WHERE query_id=@SlowQueryId ORDER BY max_secs;


-- WAIT

SELECT 'SLOW Wait', wait_category_desc, execution_type_desc, avg_query_wait_time_ms/1000 AS avg_wait_time_Seconds

FROM sys.query_store_wait_stats

WHERE plan_id=@SlowPlanId AND runtime_stats_interval_id=@SlowIntervalId

--GROUP BY wait_category_desc, execution_type_desc

ORDER BY avg_wait_time_Seconds DESC;


SELECT 'FAST Wait', wait_category_desc, execution_type_desc, avg_query_wait_time_ms/1000 AS avg_wait_time_Seconds

FROM sys.query_store_wait_stats

WHERE plan_id=@FastPlanId AND runtime_stats_interval_id=@FastIntervalId

--GROUP BY wait_category_desc, execution_type_desc

ORDER BY avg_wait_time_Seconds DESC;

 

DROP TABLE #Temp1_QS_Perf_Analysis;

ROLLBACK

Demonstrative result:


2025-03-04

Side-by-side SQL Server Always-On cluster upgrade using Log Shipping

 Log Shipping 1st Run:

  1. Transfer SQL Server logins from the old SQL Server 2012&2016 to the new SQL Server 2022 primary and secondary servers.
  2. Migrate SQL Server Agent jobs from the old SQL Server 2012&2016 to the new SQL Server 2022 primary and secondary servers.
  3. Transfer Linked Servers (if any) from the old SQL Server 2012$&2016 to the new SQL Server 2022 primary and secondary servers.
  4. Disable existing Transaction-Log Backup jobs on the old SQL Server 2012&2016 to prevent interference with Log Shipping.
  5. Establish a shared folder as the log shipping backup file destination path, ensuring it has read-write permissions for the SQL service account and is accessible by the new SQL Server 2022 servers.
  6. Configure Log Shipping from the primary AG replica of the old SQL Server 2012&2016 to both the primary and secondary AG replicas of the new SQL Server 2022:
    1. Perform a compressed full database backup on the primary AG replica of the old SQL Server 2012&2016.
    2. Copy the backup file to both the primary and secondary AG replicas of the new SQL Server 2022.
    3. Restore the backups on the new SQL Server 2022 servers using the WITH NORECOVERY option.
    4. Enable Log Shipping on the databases in the primary AG replica of the old SQL Server 2012/&2016.
    5. Configure the log shipping jobs, selecting the created shared folder as the backup path.
    6. Add both the primary and secondary AG replicas of the new SQL Server 2022 as the Log Shipping destination servers.
    7. Verify the Log Shipping status using SQL Server 2022 standard reports --> Transactional Log Shipping Status.
    8. Repeat the above steps for each application database.

Bring SQL Server 2022 Database Online for Application Testing:
  1. Disable Log Shipping on the old SQL Server 2012&2016.
  2. Re-enable existing log backup jobs on the old SQL Server 2012&2016 that were disabled during the first run of Log Shipping.
  3. Bring databases online in the primary replica of the new SQL Server 2022 (RESTORE LOG <db> WITH RECOVERY).
  4. Configure Always-On Availability Group between the primary and secondary replicas of the new SQL Server 2022.

Log Shipping 2nd Run:
  1. Disable existing Transaction-Log Backup jobs on the old SQL Server 2012&2016 to prevent interference with Log Shipping.
  2. Remove the Always-On Availability Group in the new SQL 2022 servers.
  3. Configure Log Shipping from the primary AG replica of the old SQL Server 2012&2016 to both the primary and secondary AG replicas of the new SQL Server 2022:
    1. Perform a compressed full database backup on the primary AG replica of the old SQL Server 2012&2016.
    2. Copy the backup file to both the primary and secondary AG replicas of the new SQL Server 2022.
    3. Restore the backups on the new SQL Server 2022 servers using the WITH NORECOVERY option.
    4. Enable Log Shipping on the databases in the primary AG replica of the old SQL Server 2012&2016.
    5. Configure the log shipping jobs, selecting the created shared folder as the backup path.
    6. Add both the primary and secondary AG replicas of the new SQL Server 2022 as the Log Shipping destination servers.
    7. Verify the Log Shipping status using SQL Server 2022 standard reports --> Transactional Log Shipping Status.
    8. Repeat the above steps for each application database.

Switch-over:
  1. Terminate all client connections to the old SQL Server 2012&2016 by disabling SQL logins.
  2. Execute LS-Backup jobs on the old SQL Server 2012&2016.
  3. Execute LS-Copy and LS-Restore jobs on the new SQL Server 2022 servers.
  4. Disable Log Shipping on the old SQL Server 2012&2016.
  5. Bring databases online in the primary replica of the new SQL Server 2022 (RESTORE LOG <db> WITH RECOVERY).
  6. Configure Always-On Availability Group between the primary and secondary replicas of the new SQL Server 2022.
  7. Estimated switch-over time: 15 minutes.