Troubleshooting Mirroring: Send/Receive Ack Time exceeded threshold

by Snehashish Ghosh

In this environment “Send/Receive Ack Time” threshold has been setup using the database mirroring monitor and alert is generated using SQL Agent.

What is Send/Receive Ack Time?

Milliseconds that messages waited for acknowledgement from the partner, in the last second.This counter is helpful in troubleshooting a problem that might be caused by a network bottleneck, such as unexplained failovers, a large send queue, or high transaction latency. 

In such cases, we can analyse the value of this counter to determine whether the network is causing the problem.

http://msdn.microsoft.com/en-us/library/ms189931(v=sql.105).aspx

As Books Online mentions, “Send/Receive Ack Time” can be used to measure network latency between the principal and mirror servers.

This counter is useful when trying to determine if we are experiencing database mirroring issues due to network latency.If this value is larger than normal it means that there is a network bottleneck between the principal and mirror servers.So we should focus our effort to find cause of Network bottleneck and fix the cause.

Initial data to collect

1. What is the average CPU utilization on the principal server P01DB001 & Mirror Server P01DB002? How many physical processors are present?

This is to have an overview of the overall system performance

2. What is the operating mode of mirroring?                                                                  

Asynchronous Database Mirroring (High-Performance Mode)

or

Synchronous Database Mirroring (High-Safety Mode)

 Knowing this will help in trying different mirroring operating mode (like if it is synchronous, then try asynchronous and observe the behaviour in test environment)

3. What is the current size of database? What is the daily, weekly,monthly incremental size of the database?

To get an idea of data traversing through the network.

4. What is the value of Maxdop?

To find out if any value is specified other than 0 (default), restricting CPU usage

Troubleshooting

1. As “Send/Receive Ack Time” can be used to measure network latency between the principal and mirror servers, we would need to decide whether the amount of activity is pushing the limits of network bandwidth.                       

How to use: “Send/Receive Ack Time” counter is cumulative and we should calculate the change over a given time period.

High rates in this counter can indicate network problems or bottleneck. 

The following Performance Monitor counters (perfmon) can be used in order to establish if the bandwidth is enough or fully utilized:

 i) In the Principle database server the following perfmon counters need to be collected releated to SQLServer:DatabaseMirroring:

 Principal

Log Bytes Sent/Sec   – Number of bytes of log sent per second

Log Send Queue KB  – Total number of kilobytes of log that have not been sent to the mirror server

Transaction Delay  – Number of milliseconds transaction termination waited for acknowledgement per second.

Send Receive/Ack Time – Milliseconds messages waited for acknowledgement from the partner per second.

Compressed Bytes Sent/sec – This is compressed size bytes sent and is a subset of the Log Bytes Sent/sec counter.

Mirror

Log Bytes Received/Sec – Number of bytes of log received per second

Redo Queue KB  – Total number of kilobytes that redo on the mirror database is behind the hardened log

Transaction Delay  – Number of milliseconds transaction termination waited for acknowledgement per second.

Redo Bytes/Sec  – Number of bytes of log redone by the mirror database per second

Log Harden Time (ms) – This is the time to write the received log to disk.

Log Compressed Bytes Rcvd/sec –  This is compressed size bytes received and is a subset of the Log Bytes Received/sec counter.

Send Receive/Ack Time – Milliseconds messages waited for acknowledgement from the partner per second.

We can watch the log send and receive rate to see how much of the log we are shipping and how fast.

Looking at the Redo Queue KB we can see the how much faster the send rate is compared to the Redo Bytes/Sec processing.

If redo is slower than send it will build up the redo queue.

The transaction delay shows details around the send and ack delays.

Send Receive/Ack Time shows the time (ms) that we are throttled in Flow Control where we had a packet ready to send but could not because the ack to release the

flow control had not been completed yet. Adding network latency to the send and ack will drive this number.

Send Receive/Ack Time counter is helpful in troubleshooting a problem that might be caused by a network bottleneck, such as unexplained failovers, a large send queue, or high transaction latency. In such cases, we can analyze the value of this counter to determine whether the network is causing the problem.

 If we have a mirroring session enabled then these counters will also appear in the sys.dm_os_performance_counters DMV, making it easier for us to get the details without having to configure Performance Monitor to do the collection.

eg.

 

2. Compare Perfmon counter –> Send/Receive Ack Time between two different environment (which is working slow) & other environment (which are working fast) if available.If the counter value is very high in the environment where we are having slowness, then we believe there is some networking issues going on between these or maybe the network is not configured correctly.

Look at the network and check for small burst of IOs from principle, a wait and other small burst, etc.

 3. Try to suppress the message by increasing the timeout value. 

However if the network is a bottleneck then there is a chance that Send/Receive Ack Time is still reaching threshold.

 May be the issue is with network switch. Should collaborate with the networking team to fix any issue with network switch.

 4. To detect the network bottleneck, we could see the following counters in the network interface object of Performance Monitor:

 Network Interface Object – The Network Interface performance object consists of counters that measure the rates at which bytes and packets are sent and received over a TCP/IP network connection. It includes counters that monitor connection errors.

 a. Network Interface\Bytes Total/Sec: This measures the rate at which bytes are sent and received over each network adapter, including framing characters. The network is saturated if you discover that more than 70 percent of the interface is consumed. For a 100-Mbps NIC, the interface consumed is 8.7MB/sec (100Mbps = 100000kbps = 12.5MB/sec* 70 percent). In a situation like this, you may want to add a faster network card or segment the network.

 b. Network Interface\Output Queue Length: This measures the length of the output packet queue, in packets.

There is network saturation if the value is more than 2. You can address this problem by adding a faster network card or segmenting the network.

 5. To detect the transaction latency, we could use a warning thresholds and alerts on Mirror commit overhead in Mirroring Monitor. Mirror commit overhead specifies the number of milliseconds of average delay per transaction that are tolerated before a warning is generated on the principal server. This delay is the amount of overhead incurred while the principal server instance waits for the mirror server instance to write the transaction’s log record into the redo queue. This value is relevant only in high-safety mode.

Please refer: http://technet.microsoft.com/en-us/library/ms408393.aspx

 6. In addition, a tool such as http://msdn.microsoft.com/en-us/library/windows/hardware/gg463264.aspx (“How to Use NTttcp to Test Network Performance”), might be able to confirm there is a “weakest link in the chain” (between principle and mirror) during otherwise normal everyday network traffic

7. – Find out the round-trip ping time.

– Increase the TcpWindowSize on both servers (HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\Interfaces\ interface-name)

http://technet.microsoft.com/en-us/library/cc938219.aspx

 8. Since we are using SQL 2008 R2 (SP2), there is a new feature in Database Mirroring (DBM) called “Database Mirroring Log Compression”.

The way it works is that, the outgoing log stream from the principal to the mirror is compressed, thereby minimizing the network bandwidth used by database mirroring. We could enable the feature and check if this solves the issue.

 9. Another aspect that we could check was the disk latency as this could play a role.

Disk Write Bytes/sec: The rate at which the disk is written to.

For both Principle & Mirror we should monitor this counter for the data as well as the log disks.

CONCLUSION

Consider a hypothetical scenario as below:

 Log Bytes Sent/sec = 1.80 MB

Log Compressed Bytes Sent/sec = 0.40 KB

Log Send Queue KB  = 16.67 GB —> This is the amount of log yet to be sent to the mirror

Send/Receive Ack Time = 18.6 seconds   —> Milliseconds that messages waited for acknowledgement from the partner

Log Send Flow Control Time = 0.9 sec  —> This measures how long a mirroring connection had to wait before it could us the mirroring flow control buffer.

 => So looking at the above data, we are seeing multiple factors here that are pointing at network bottleneck

1. The Send/Recv ACK time is 18.6 seconds which is way too high.

2. The Log Send Queue is 16.6 GB, meaning we are waiting for other messages to receive ACK before we can pick up another from the send queue.

 The issue might not be a bandwidth problem, but some other kind of networking issues that could be causing Latency, which is why networking team should be involved.

 Check the value of is_send_flow_controlled from sys.dm_db_mirroring_connections when the issue is happening.

If the value is 1 then it means that the network latency is affecting mirroring throughput.

It determines whether network sends have been postponed due to network flow control because the network is busy.

Please refer: http://technet.microsoft.com/en-us/library/ms189796(v=sql.105).aspx

 Large transactions or reindexing or maintenance activities on the principal database can also affect database mirroring (DBM) performance.

Leave a reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">

*