Using Extended Events to Visualize Availability Group Replication Internals

SQL 2014 Service Pack 2 was recently released by Microsoft and there is a ton of great new features and enhancements in this release.This isn’t just a collection of bug fixes…there’s some serious value in this Service Pack. Check out the full list here. One of the key things added in this Service Pack is an enhancement of the Extended Events for AlwaysOn Availability Group replication.

Why are the new Availability Group Extended Event interesting?

If you’ve used Availability Groups in production systems with high transaction volumes you know that replication latency can impact your availability. If you want to brush up on that check out our blog posts on AG Replication Latency, Monitoring for replication latency, and issues with the DMVs when monitoring. These new extended events add insight at nearly every point inside your Availability Group’s replication. More importantly they also include duration. So using these Extended Events we can pinpoint latency inside our Availability Group replication.

Here’s the list and description of the new Extended Events:

  • hadr_log_block_group_commit – Log block group commit processing duration
  • log_block_pushed_to_logpool – Log block start push to log pool
  • log_flush_start – Asynchronous log write start
  • hadr_log_block_compression – Log block compression processing duration
  • hadr_capture_log_block – Primary has captured a log block
  • hadr_capture_filestream_wait
  • ucs_connection_send_msg – UCS transport connection sent message
  • hadr_log_block_send_complete – After a log block message has been sent. This event is only used for fail points
  • log_flush_complete – Reports that synchronous send is complete
  • hadr_receive_harden_lsn_message – When a new hardened LSN from the secondary
  • hadr_db_commit_mgr_harden – Transaction commit harden result from Hadron Commit management
  • hadr_transport_receive_log_block_message – Fires when we’re receiving new log block message
  • hadr_log_block_decompression – Log block decompression processing duration
  • hadr_apply_log_block – Secondary is going to append a log block to the log
  • hadr_send_harden_lsn_message – Crafting a message to send containing a new hardened LSN on a secondary.  Test only
  • hadr_lsn_send_complete – After an LSN message has been sent.
The source for this list and it’s descriptions is Amit Banerjee’s HADR Virtual Chapter Talk and KB3173156

Using Extended Events to visualize Availability Group Replication Internals

Using these Extended Events, we’re able to get some insight into the replication internals of our Availability Groups. Like I described a second ago, we can use these to measure the duration of each event and sift out performance issues in our replication. But we can also use them to get a better understanding of what goes into Availability Group replication, let’s see how.
 
Here’s the steps I used to generate the chart below:
  1. Let’s add an Event Session with all of these new Events above to the primary and secondary replicas of a two replica synchronous Availability Group
  2. Then with the sessions running, insert only one row on the primary replica. This will cause the AG to replicate the data change event and trigger the new Extended Events on both the primary and secondary replicas
  3. Stop the Event Sessions
  4. Open the Event file in SSMS and View Target Data on both the primary and secondary replicas
  5. Order the Event data by timestamp
  6. Merge the event data together based on the timestamp (I did this in a spreadsheet)
With this list we have the entire replication sequence of Events for that single insert replicating data from the primary to the secondary from the perspective of both the primary and secondary. I took that information and made the following graphic for you visualizing what it takes to replicate data in a synchronous Availability Group.

##AG Synchrnous Replication Internals

 Figure 1: Availability Group Replication Extended Events

From Figure 1, let’s walk through the events…

On the primary

  • The first event that fires is hadr_log_block_group_commit starting things off. It’s my understanding that this initializes the needed code and program state for AG replication
  • Then up next is log_block_pushed_to_logpool. This fires when a log block is copied to the log pool. The log pool is a special memory location used to store log blocks that may need to be read again, since it’s in memory it keeps the readers from doing a physical IO on the transaction log
  • Then log_flush_start, this fires when log blocks are flushed to disk, normal write ahead logging
  • Once the log block has been copied into the log pool and flushed to disk, it’s read from the log pool and compressed then the hadr_log_block_compression event fires
  • After it’s compressed the AG “captures” the log block for replication to the secondary with hadr_capture_log_block 
  • Then builds and sends the message to the secondary and ucs_connection_send_msg fires
  • Once the message is sent the hadr_log_block_send_complete event fires
  • We mark the synchronous send complete with log_flush_complete

On the secondary

  • On the other side of the fence, basically the inverse occurs. We mark the receipt of the message with hadr_transport_receive_log_block_message
  • The log block is decompressed and hadr_log_block_decompression fires
  • Then the log block is appended to the local log buffer and hadr_apply_log_block fires
  • Then the log block is copied to the local log pool and hadr_pushed_to_logpool fires
  • Next, the asynchronous log write happens and log_flush_start marks that start of that
  • And once the asynchronous log write finishes, log_flush_complete fires
  • Once hardened locally we need to craft a message to send to the primary marking the the LSN is hardened and hadr_send_harden_lsn_message fires
  • ucs_connection_send_msg then transmits fires when the message is sent to the primary
  • hadr_lsn_send_complete marks the fires marking the end of sending the message

Back on the primary

  • hadr_receive_harden_lsn_message fires marking that the secondary has acknowledging receipt of the send LSN
  • And to wrap it all up hadr_db_commit_mgr_harden marks the end of the log block replication for this sequence

A few important notes…

  • Many of these Extended Events fire at different points and multiple times during replication under different conditions. There’s a field called mode, which is an integer value, that marks various conditions for the events. I have yet to decode what each value of mode is for all of the events but will update this blog as I get some insight.
     
  • It’s very important to note that this example highlights an insert into an Availability Group with a primary and only one secondary configured in synchronous Availability Mode. The code path and the subsequent Extended Events that fire can change based on Availability Mode (sync/async), replication health, and many other conditions that can impact replication.

I hope that I’ve shown you some interesting concepts that you can use to help troubleshoot your Availability Group replication using Extended Events. This post, highlights how you can use Extended Events to get some insight into the the internals of Availability Group replication.

If you have any questions about this or your Availability Group configuration or performance please feel free to email me at aen@centinosystems.com

Please follow me on Twitter @nocentino to keep up with new blog posts

References

Great references on the internals of the log pool – https://sqlactions.com/2014/03/31/how-it-works-logpool/

HADR Virtual Chapter – http://hadrvc.sqlpass.org/Home.aspx?EventID=5304