Monitoring Availability Groups with Redgate SQL Monitor

In previous posts here and here we discussed AlwaysOn Availability Group replication latency and monitoring concepts, specifically the importance of monitoring the send_queue and redo_queue. In this post I’m going to show you a technique for monitoring Availability Group replication latency with Redgate SQL Monitor and its Custom Metric functionality.

Here’s the rub, monitoring AGs is a little interesting for the following reasons

  1. We’re interested in trending and monitoring and that isn’t built into SQL Server or SSMS’s AlwaysOn Dashboard.  Both provide only point in time values. 
  2. We’ll need to monitor the health of the Availability Group as a whole. So we want to track performance data on all replicas in the AG. But interestingly the redo queue and send queue values in the DMVs on the primary are always NULL. So we need to get those values from the secondary replicas.
  3. Further, to work this into SQL Monitor’s Custom Metric framework we’ll need to limit our query’s result set to a single row and value.

Redo Queue

The redo queue is the amount of log records that haven’t been sent to a secondary replica in an AG. We want to track this as it is a measure of the amount of data on a secondary that is not yet redone into the database and can impact operations offloaded to secondaries

The bold text headings of the next section match the configuration areas on the Custom Metric screen in SQL Monitor.

Enter the T-SQL query that will collect data:

SELECT  rs.redo_queue_size
FROM    sys.dm_hadr_database_replica_states rs
        JOIN sys.availability_replicas r ON r.group_id = rs.group_id
                                            AND r.replica_id = rs.replica_id
WHERE   rs.database_id = DB_ID()
        AND r.replica_server_name = @@SERVERNAME
        AND rs.redo_queue_size IS NOT NULL;

This query will execute on each replica in the AG and will return data only for that replica. So if a replica does go offline we will loose the ability to query its information…but hey it’s offline. So it will be pretty hard to query regardless. But in reality this isn’t really good since the send queue values on the primary are always NULL. 

Select instance to collect from

Select all the instance that are in the AG you want to monitor. The T-SQL query will handle multiple Availability Groups as the query restricts its data to database ID on a particular server and a database can only be a member of one AG.

Choose databases to collect from

Please select all the databases that you are interested in monitoring.

Set collection frequency

Collect data every minute. I wish we could set this to a lower value.

Use collected or calculated values

Leave this unchecked. The DMVs queries return point in time values that we’ll want to graph the data as absolute values.

Send Queue

The send queue is used to store log records that will be sent from the primary to the secondary replica. We want to track this as it is a measure of the amount of data that is not on a secondary replica and can impact recovery point if there is a failure of a primary.

All of the configuration settings for this Custom Metric are the same as above except the T-SQL Query. Here’s that code:

Enter the T-SQL query that will collect data:

SELECT  rs.log_send_queue_size
FROM    sys.dm_hadr_database_replica_states rs
        JOIN sys.availability_replicas r ON r.group_id = rs.group_id
                                            AND r.replica_id = rs.replica_id
WHERE   rs.database_id = DB_ID()
        AND r.replica_server_name = @@SERVERNAME
        AND rs.log_send_queue_size IS NOT NULL

Send Queue Custom Metric Analysis

So let’s see this in action, like the last post we have a system with three replicas. Two are on the same subnet and a one on a simulated remote subnet. We added a 1000ms delay on the simulated WAN link to the remote location to simulate poor network conditions. 
 
Send Queue Availability Group Custom Metric
 
With the WAN latency added and a heavy write workload running, the primary replica is trying to send data to a secondary replica faster than the network can support and has to store the data in the send_queue and SQL Monitor gives us the ability to quickly see this in it’s Analysis tab.
 
Now let’s simulate a network outage by knocking out the WAN link that connects our primary site and secondary site. Sadly, at this point we loose visibility into the send_queue values on the unreachable secondary, but we’ll have the most recent value before it went offline.
 
With the WAN link down, the primary replica cannot send data to the secondary at the remote subnet and has to start queuing the data in the send_queue until the secondary comes back online.

Redo Queue Custom Metric Analysis

So with the WAN link down the send_queue builds up and once service is restored, all the log from the primary is sent over to the secondary. But at a rate that is faster than can be processed by the redo thread, so it goes into the redo_queue…let’s see that in a graph
 
Redo Queue Availability Group Custom Metric
 
 
Synthetic tests are great, but here is a screenshot from an actual workload from a production system. In this sample you can see a surge in the send_queue followed by a surge in the redo queue. The workload in this sample is database maintenance doing index rebuilds and more interesting is that this these replicas are on the same subnet!
 
Alwayson Availbility Group Replication Latency

 

 
I hope these Custom Metrics are useful to you and provide some insight into the replication health of your Availability Groups. Having a firm understanding of your AG replication and latency will allow you to answer some hard questions when it comes to where your data is at what point in time. 
 
Twitter @nocentino
Did you like this post? If so, click here for to sign up for our newsletter to get more SQL goodness delivered right to your inbox!