2022-01-07

Server Low Memory causes SQL Server hang and Availability Group failover

This month blog post I would like to share a case that I encountered and was solved by calling Microsoft premier support. I have a SQL Server 2016 primary instance that configured high availability with other 3 nodes on an availability group, hosting a mission critical database total 6TB, the primary sql instance only had 64GB hardware memory. Client programs sometimes run a batch job which involves high volume of data import and read on that mission critical database. Database inaccessible occurs occasionally while the batch program running, made the batch failed and needed to restart again. Users complained on this problem. Checking the sql server error log, error messages were found as below:

Process 0:0:0 (0x5748) Worker 0x000001DD933D0160 appears to be non-yielding on Scheduler 7. Thread creation time: 13283945227971. Approx Thread CPU Used: kernel 130515 ms, user 0 ms. Process Utilization 62%. System Idle 33%. Interval: 130527 ms.

***Stack Dump being sent to D:\Program Files\Microsoft SQL Server\MSSQL13.HKRDBSP1\MSSQL\LOG\SQLDump0017.txt

Always On Availability Groups connection with secondary database terminated for primary database 'XXX' on the availability replica 'XXX' with Replica ID: {b94d35e1-6b50-48f3-8421-93fe6bf7c8b0}. This is an informational message only. No user action is required.

Unable to access availability database 'XXX' because the database replica is not in the PRIMARY or SECONDARY role. Connections to an availability database is permitted only when the database replica is in the PRIMARY or SECONDARY role. Try the operation again later.

As an application DBA, when we see dump was generated by SQL Server, what we can do is only call Microsoft for help. Below is the investigation done by Microsoft support engineer:

Below are non-yielding call stacks of the dump. We found SQL SERVER was hung when allocating memory. Kernel time is the majority which means the slowness is in windows OS.

# Call Site

00 sqldk!SOS_MemoryBlockAllocator::CommitBlockAndCheckNumaLocality

01 sqldk!SOS_MemoryBlockAllocator::AllocateBlock

02 sqldk!SOS_MemoryWorkSpace::AllocatePage

03 sqldk!MemoryNode::AllocatePagesInternal

04 sqldk!MemoryNode::AllocatePages

05 sqldk!MemoryClerkInternal::AllocatePagesWithFailureMode

06 sqldk!MemoryClerkInternal::AllocatePages

07 sqldk!TVarPageMgr<0>::PviNewVarPage

08 sqldk!TVarPageMgr<0>::PbAllocate

09 sqldk!CMemObj::InlineAlloc

0a sqldk!CMemObj::Alloc

  

IncidentType                          NonYieldingScheduler

OffenderThreadId                 ~256

IsPreemptive                          False

Wall_clock_time_ms             70215

Kernel_time_ms                     70203

User_time_ms                        0

passes                                     15

Suggestion

1. Please enable ‘Lock pages in memory’ for this SQL SERVER. ‘Lock pages in memory’ will use AWE api to allocate memory. It will be more efficient.

Please follow below official article. Add “Lock pages in memory” privilege to “SQL Service account”.

https://technet.microsoft.com/en-us/library/ms190730(v=sql.105).aspx

2. Please increase physical memory for this VM to 128GB, and increase the sql server max server memory setting to 112GB.