Designing for offloaded log backups in AlwaysOn Availability Groups – Part 1
AlwaysOn Availability Groups made their initial appearance in SQL 2012 and have generated a lot of buzz, HA and DR in one! Even with AGs, still integral to your DR strategy are backups and with AGs you’re given the option to offload backups to a secondary replica. In this blog we’re going to talk about offloaded log backups the potential impact to your databases’ recoverability under certain conditions, we’ll begin with some preliminaries on data movement in AGs.
How data is moved in Availability Groups
Roughly speaking data is synchronized between replicas in an Availability Group by sending log blocks from the transaction log of the primary (read/write) replica over a database mirroring endpoint to the secondary replicas. However, log blocks are not immediately sent, they are passed through a set of queues.
- Send queue – a queue used to store log records that will be sent from the primary to the secondary replica
- Redo queue – a queue used to store log records received on the secondary, that have to be “played back” on the secondary replica
The send queue’s impact on recovery point
In the event that log generation on the primary exceeds the rate in which log blocks are sent to secondaries or a secondary becomes unavailable, the log blocks can build up in the send queue. If there is unsent data in send queue, then this is the amount of data at risk in the event of a disastrous failure of the primary replica.
You might be thinking to yourself, I configured the AG in synchronous availability mode, I should be safe. Even in synchronous mode, the send queue can build up. A secondary replica in an Availability Group can become unavailable for any number of reasons, patching, network outage…etc and in this situation the data destined for the secondary replica(s) is stored in the send queue. If the secondary comes back online then the primary will work diligently to send the data to the secondary.
The redo queue’s impact on recovery point
In the event that log blocks received by a secondary exceed the rate in which the records can be processed by the redo thread, the redo queue will grow. This can happen when the secondary cannot simply keep up with the rate of change on the primary or during an outage of a secondary. A practical example of the latter is when a secondary comes back online after an outage and there is a large amount of data change during that outage. Think of the times when database maintenance is running on a weekend and the network team just happens to be updating switch firmware. All of that change is queued in the send queue on the primary and when the secondary is back online, quickly shipped over to the redo queue on the secondary.
Now, with that huge chunk of data change hardened on the secondary and in the redo queue, one would think we’re in the clear. Well, sort of, yes your data is in two places but now the offloaded transaction log backups on your secondary may start to fail if the secondary replica is too far behind the primary. Specifically, in the event that the database’s last backup LSN (log record beyond the end of the log backup) on the primary is greater than the local redo LSN on the secondary your backups will fail. This is a protection to prevent gaps in the continuity of the transaction log in backups and a condition that you need to be aware of when designing offloaded backups where the database is unable to take a successful log backups and impacting RPO.
Figure A: last LSN and redo LSN positions in the transaction log
Designing for offloaded backups in AlwaysOn AvailabilityGroups
Availability Groups allow us to offload backups to secondary replicas. This is a great option as it reduces IO on the primary replica. In doing so, system designers need to be aware of the impact the health of the AG replication has on off loaded backups. Understanding when your data is at risk during times where the secondaries are not completely caught up with the changes from the primary and techniques to mitigate that risk and protect the RPO and RTO that the business expects
Here are a few things to keep in mind:
- Understand and minimize the amount of log generation that occurs in your databases and design to support that load
- Monitor send_queue and redo_queue in sys.dm_hadr_database_replica_states on replicas to measure impact on recovery point objectives
- Understand your system’s operations, consider downtime for patching and network maintenance
- Understand resource contention on shared infrastructure, are you competing for things like network bandwidth, disk IO?